1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Creating absolute links from parsed ones

Discussion in 'VBA Macros' started by shahin, Sep 22, 2017.

  1. shahin

    shahin Active Member

    Messages:
    601
    Hi there! Hope you all are doing well. I have written a macro to pull a tags from different websites listed in the "storage" variable in my script. In case of just scraping the a tags from listed sites, it is doing awesome. However, the thing I wanna know is that whether there is any built-in function, method whatever in vba with which I can make newly scraped "relative" links "absolute". If it were not for several sites i could have used split method in combination with string concatenation to make usable links. But, as the sites are more than one and the parsed urls are different in their look so I can't think of any idea to make those links absolute. Any input on this will be vastly appreciated.

    Code (vb):

    Sub Creating_absolute_links()
        Dim http As New XMLHTTP60, html As New HTMLDocument
        Dim storage As String, post As Object, arr() As String, items

        storage = "https://www.yify-torrent.org/search/1080p/," & _
                "https://yts.ag/browse-movies,https://www.houzz.com/professionals," & _
                "https://www.wiseowl.co.uk/videos/"
     
        arr() = Split(storage, ",")

        For Each items In arr
            With http
                .Open "GET", items, False
                .send
                html.body.innerHTML = .responseText
            End With
         
            For Each post In html.getElementsByTagName("a")
                row = row + 1: Cells(row, 1) = post.href
            Next post
        Next items
    End Sub
     
    Btw, it is not necessary to think of any pattern to deal with these predefined sites only because there may be numerous sites in the storage. Thanks in advance.
  2. Kenneth Hobson

    Kenneth Hobson Active Member

    Messages:
    189
    I did not see any relative url strings returned. If it were me, I would skip the return strings without "//" using Instr(). Otherwise, a concatenation could be used for relative references. The syntax should be known to do that though. Some might use "//" or "../whatever" sort of thing.

    Nothing wrong with what you did. I formatted it to make concatenation easier. If I did a lot of those, I would poke the results into an array and write to the range once.

    Code (vb):

    Sub Creating_absolute_links()
      'Tools > References > Microsoft XML, v6.0
     Dim http As New XMLHTTP60
      'Tools > References Microsoft HTML Object Library
     Dim html As New HTMLDocument
      Dim s As String, post As Object, a, b, i, r As Long
      'Note trailing backslashes added to urls.
     a = Array( _
        "https://www.yify-torrent.org/search/1080p/", _
        "https://yts.ag/browse-movies/", _
        "https://www.houzz.com/professionals/", _
        "https://www.wiseowl.co.uk/videos/")
       
      For Each i In a
        With http
          .Open "GET", i, False
          .send
          html.body.innerHTML = .responseText
        End With
       
        r = 0 'Next r will be row 1.
       For Each post In html.getElementsByTagName("a")
          r = r + 1
          Cells(r, 1) = post.href
        Next post
      Next i
    End Sub
    shahin likes this.
  3. shahin

    shahin Active Member

    Messages:
    601
    I expected that there might be any functionality in vba using which it is possible to do the trick. In other languages, there is a built-in method which can do this without any hardship. As I am not an advanced user of vba so I thought it would be possible in vba as well. However, thanks for the cleaning.
  4. Kenneth Hobson

    Kenneth Hobson Active Member

    Messages:
    189
    What other language and what is the code?

    As I said, I saw no relative urls returned so I don't know what to act on. I used to use relative references in my old web site. These days, some links are obfuscated so they are not readily obtained from source text.
  5. shahin

    shahin Active Member

    Messages:
    601
    As you wanted to know: the newly produced links I'm getting from this scraper suffices to create absolute links. One such method is "urljoin" in python.
  6. Kenneth Hobson

    Kenneth Hobson Active Member

    Messages:
    189
    The answer is no, there is no built-in function like that.

    I never got into python nor ruby. I did not see any function written by others to do what it is "about" nor a good python description for what it does. I can make a good guess though and write a function for it if you like. It looks like for About returns, it concatenates the suffix to the end of the base url.
  7. shahin

    shahin Active Member

    Messages:
    601
    You are just wasting time bro. I didn't say anything on basis of assumption. I coded it first in python and then I tried to do the same thing in vba. I always try to do similar stuffs using both the languages. If you wanna see how to do that in python then I can give you a demo but don't erect useless arguments.
  8. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,919
    Really... be nice to someone trying to help you, taking time out of their day without any compensation.

    At any rate, what you can do is do Regex or other pattern matching.

    If pattern match "^about:/" then replace it with appropriate absolute link for each site. Ex. "https://www.wiseowl.co.uk/"

    If pattern contains "javascript" skip it.

    FYI - This isn't relative link in strict use of the term. And the reason for confusion I believe.

    Relative links use "../../file", "./file.html", "/file.htm" etc.

    You can find detail in the link.
    https://www.w3schools.com/html/html_filepaths.asp
    https://www.navegabem.com/absolute-or-relative-links.html

    There are two school of thought on use of relative path/link. Those that promote use of relative path/link for all internal link for ease of code maintenance.

    And those that promote use of absolute reference everywhere, to ensure proper linking and redirect when scraped by crawler/bots. It is good idea to use absolute reference when you want to promote your site/page with Google etc.
    shrivallabha likes this.
  9. shahin

    shahin Active Member

    Messages:
    601
    Thanks sir Chihiro, for your elaborative response. I don't get excited so easily because I hardly fabricate things and i don't say anything for sure that i can't. When i find somebody expressing doubt on what I've already achieved, it's difficult to curb the anger. This is it. However, i will try the regex out to serve the purpose.
  10. shrivallabha

    shrivallabha Excel Ninja

    Messages:
    1,787
  11. shahin

    shahin Active Member

    Messages:
    601
    If i consider scraping links with "home" page and go like this, it creates invalid links for "home" page. Moreover, the usage of "and" parameter in the script strictly limits the desired links from getting parsed.
    Code (vb):

    Sub Creating_absolute_links()
        Dim http As New XMLHTTP60, html As New HTMLDocument
        Dim post As Object, storage As Variant, link As Variant
       
        storage = Array( _
        "https://www.yify-torrent.org/search/1080p/", _
        "https://yts.ag/browse-movies/", _
        "https://www.wiseowl.co.uk/videos/")
       
        For Each link In storage
            With http
                .Open "GET", link, False
                .send
                html.body.innerHTML = .responseText
            End With
            For Each post In html.getElementsByTagName("a")
                If InStr(1, post.innerText, "home", 1) > 0 And InStr(post.href, "about:") > 0 Then
                    r = r + 1: Cells(r, 1) = link & Split(post.href, "about:")(1)
                End If
            Next post
        Next link
    End Sub
     
  12. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,919
    shahin likes this.
  13. shahin

    shahin Active Member

    Messages:
    601
    Thanks sir, for the suggestion. I'll abide by for sure. The thing I wanted to say from the very beginning is that after executing the above script some websites produce "about:", others produce "http:" and the rest "https:". In such cases how can I handle them all? This is it, sir.
  14. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,919
    Pattern matching as I said.

    For an example if you use pattern "^about:\/".

    It will only return match/true when string starts with "about:/".

    Therefore, if statement is used to check that RegEx.Test(String) return TRUE, before using REPLACE or other operation on the string.

    See link for detail on using RegEx in VBA.
    http://analystcave.com/excel-regex-tutorial/

    EDIT: Forgot to escape "/". See updated pattern.
    Last edited: Sep 27, 2017
    shahin likes this.
  15. shahin

    shahin Active Member

    Messages:
    601
    @Sir chihiro, Once again, it is my incapability that I could not make things clear in the first place what my desired output should be. Suppose, I execute my vba script to parse links from five different websites and it produced 50 links out of which I just filtered 30 preferable links that started with "about:". Now, as i want to reuse these newly scraped links for another http requests, the incomplete urls need to be replaced by "https://www.yify-torrent.org" or "https://www.houzz.com" or "https://yts.ag" and so on to be able to form valid urls. My question starts here: how can I make all of them valid urls (by regex or by string manipulation or some other way)? At this point I don't even know which incomplete urls should be added to which base urls, how can regex or any other method will be able to detect an incomplete link and join it with specific base url unless there is any built in function to do the job. This is it. Thanks.
  16. Marc L

    Marc L Excel Ninja

    Messages:
    3,277
    Hi !

    It's just about respecting the great Logic and
    observation of behavior of any webpage !

    For a webpage, observe manually any link when surfing the page
    and compare this link with the codepage or
    internal href property of the element : just write down on a paper and
    just see what to do to match it correctly with link observed when surfing …

    It's just about the domain or a base URL …
    See samples in this forum.
    shahin likes this.
  17. shahin

    shahin Active Member

    Messages:
    601
    @Marc L, You are barely visible in this forum. Miss you a lot. However, thanks for you suggestion. I've been trying to fix this problem but finding no idea to move on. Where is it I should find sample and what the "search keyword" might be?
  18. shahin

    shahin Active Member

    Messages:
    601
    If I decide to keep links started with "http" then I thing it is possible to reuse them like below:
    Code (vb):

    Sub Creating_absolute_links()
        Dim http As New XMLHTTP60, html As New HTMLDocument
        Dim link_storage As Variant, posts As Object, post As Variant

        link_storage = Array( _
                       "https://www.yify-torrent.org/search/1080p/", _
                       "https://yts.ag/browse-movies/", _
                       "https://www.houzz.com/professionals/", _
                       "https://www.wiseowl.co.uk/videos/")

        For Each post In link_storage
            With http
                .Open "GET", post, False
                .send
                html.body.innerHTML = .responseText
            End With

            For Each posts In html.getElementsByTagName("a")
                If InStr(posts.href, "http") > 0 Then r = r + 1: Cells(r, 1) = posts.href
            Next posts
        Next post
    End Sub
     
    Otherwise, If i stick to define the loop filtering links started with "about:", it gets complicated and make me confuse to go forward. How can I join the newly produced broken links with their respective base urls? I gave a try but it threw an error "Object variable or with block---" when It reaches "debug.print" this line.
    Code (vb):

    For Each posts In html.getElementsByTagName("a")
        If InStr(posts.href, "about:") > 0 Then r = r + 1: Cells(r, 1) = posts.href
    Next posts
    Debug.Print Split(post, ".com/")(0) & ".com/" & Split(posts.href, "about:/")(1)
     
    Last edited: Nov 12, 2017
  19. Marc L

    Marc L Excel Ninja

    Messages:
    3,277

    When you grab a link you already know the domain from where you read it,
    no ? (easy logic)
    So when you read a link starting with "about:/" just replace it
    by the correct domain or base of URL you already wrote on a paper
    when you manually observed the webpage.
    So for each webpage you need its URL and the base URL of its links …
    All this is only about pure Logic, not upon any coding trick !
    shahin likes this.
  20. shahin

    shahin Active Member

    Messages:
    601
    @Marc L, I know it's not a good question to ask why it didn't "Split(post, ".com/")(0) & ".com/" & Split(posts.href, "about:/")(1)" work out within my script but I'm willing to know. Thanks in advance.
  21. Marc L

    Marc L Excel Ninja

    Messages:
    3,277


    Or just use Replace VBA function …​
    shahin likes this.
  22. shahin

    shahin Active Member

    Messages:
    601
    Why didn't this Replace function come to my mind as an option? Thanks.
  23. Marc L

    Marc L Excel Ninja

    Messages:
    3,277


    As already written in posts #12, 14 & 19 …​
  24. shahin

    shahin Active Member

    Messages:
    601
    Looks like I'm very close to the solution I am after. Just look at the below script:
    Code (vb):

    Sub Creating_absolute_links()
        Dim http As New XMLHTTP60, html As New HTMLDocument
        Dim post As Object, vault As Variant, link As Variant
     
        vault = Array( _
            "https://yts.ag/browse-movies/", _
            "https://chandoo.org/wp/vba-classes/", _
            "https://www.wiseowl.co.uk/videos/")
     
        For Each link In vault
            With http
                .Open "GET", link, False
                .send
                html.body.innerHTML = .responseText
            End With
         
            For Each post In html.getElementsByTagName("a")
                r = r + 1: Cells(r, 1) = link & post.href   ''I'm talking about this line
           Next post
        Next link
    End Sub
     
    The first portion of the concatenated string aka "link" always produces the relevant site address and get joined with the "post.href" portion. I vastly check for any anomaly, as in if the site link is joining any newly produced "href" from "another link" but no what I found so far it accurate. That means, if i can filter out the fully qualified domain names from the "link" portion then the rest is easy to manage, i meant filtering preferable "broken links" from "href" part.
  25. Marc L

    Marc L Excel Ninja

    Messages:
    3,277
    Unclear … No needs to filter but to replace "about:" !

    Compare a link of this kind with its URL when you manually click the link.
    Give us a sample of a link and its true URL …

    It's just about easy observing, visualizing, comparing, …

Share This Page