Creating absolute links from parsed ones

shahin · Sep 22, 2017

Hi there! Hope you all are doing well. I have written a macro to pull a tags from different websites listed in the "storage" variable in my script. In case of just scraping the a tags from listed sites, it is doing awesome. However, the thing I wanna know is that whether there is any built-in function, method whatever in vba with which I can make newly scraped "relative" links "absolute". If it were not for several sites i could have used split method in combination with string concatenation to make usable links. But, as the sites are more than one and the parsed urls are different in their look so I can't think of any idea to make those links absolute. Any input on this will be vastly appreciated.

Code:

Sub Creating_absolute_links()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim storage As String, post As Object, arr() As String, items

    storage = "https://www.yify-torrent.org/search/1080p/," & _
            "https://yts.ag/browse-movies,https://www.houzz.com/professionals," & _
            "https://www.wiseowl.co.uk/videos/"
  
    arr() = Split(storage, ",")

    For Each items In arr
        With http
            .Open "GET", items, False
            .send
            html.body.innerHTML = .responseText
        End With
      
        For Each post In html.getElementsByTagName("a")
            row = row + 1: Cells(row, 1) = post.href
        Next post
    Next items
End Sub

Btw, it is not necessary to think of any pattern to deal with these predefined sites only because there may be numerous sites in the storage. Thanks in advance.

Kenneth Hobson · Sep 22, 2017

I did not see any relative url strings returned. If it were me, I would skip the return strings without "//" using Instr(). Otherwise, a concatenation could be used for relative references. The syntax should be known to do that though. Some might use "//" or "../whatever" sort of thing.

Nothing wrong with what you did. I formatted it to make concatenation easier. If I did a lot of those, I would poke the results into an array and write to the range once.

Code:

Sub Creating_absolute_links()
  'Tools > References > Microsoft XML, v6.0
  Dim http As New XMLHTTP60
  'Tools > References Microsoft HTML Object Library
  Dim html As New HTMLDocument
  Dim s As String, post As Object, a, b, i, r As Long
  'Note trailing backslashes added to urls.
  a = Array( _
    "https://www.yify-torrent.org/search/1080p/", _
    "https://yts.ag/browse-movies/", _
    "https://www.houzz.com/professionals/", _
    "https://www.wiseowl.co.uk/videos/")
   
  For Each i In a
    With http
      .Open "GET", i, False
      .send
      html.body.innerHTML = .responseText
    End With
   
    r = 0 'Next r will be row 1.
    For Each post In html.getElementsByTagName("a")
      r = r + 1
      Cells(r, 1) = post.href
    Next post
  Next i
End Sub

shahin · Sep 22, 2017

I expected that there might be any functionality in vba using which it is possible to do the trick. In other languages, there is a built-in method which can do this without any hardship. As I am not an advanced user of vba so I thought it would be possible in vba as well. However, thanks for the cleaning.

Kenneth Hobson · Sep 22, 2017

What other language and what is the code?

As I said, I saw no relative urls returned so I don't know what to act on. I used to use relative references in my old web site. These days, some links are obfuscated so they are not readily obtained from source text.

shahin · Sep 22, 2017

As you wanted to know: the newly produced links I'm getting from this scraper suffices to create absolute links. One such method is "urljoin" in python.

Kenneth Hobson · Sep 22, 2017

The answer is no, there is no built-in function like that.

I never got into python nor ruby. I did not see any function written by others to do what it is "about" nor a good python description for what it does. I can make a good guess though and write a function for it if you like. It looks like for About returns, it concatenates the suffix to the end of the base url.

shahin · Sep 22, 2017

You are just wasting time bro. I didn't say anything on basis of assumption. I coded it first in python and then I tried to do the same thing in vba. I always try to do similar stuffs using both the languages. If you wanna see how to do that in python then I can give you a demo but don't erect useless arguments.

Chihiro · Sep 25, 2017

Really... be nice to someone trying to help you, taking time out of their day without any compensation.

At any rate, what you can do is do Regex or other pattern matching.

If pattern match "^about:/" then replace it with appropriate absolute link for each site. Ex. "https://www.wiseowl.co.uk/"

If pattern contains "javascript" skip it.

FYI - This isn't relative link in strict use of the term. And the reason for confusion I believe.

Relative links use "../../file", "./file.html", "/file.htm" etc.

You can find detail in the link.
https://www.w3schools.com/html/html_filepaths.asp
https://www.navegabem.com/absolute-or-relative-links.html

There are two school of thought on use of relative path/link. Those that promote use of relative path/link for all internal link for ease of code maintenance.

And those that promote use of absolute reference everywhere, to ensure proper linking and redirect when scraped by crawler/bots. It is good idea to use absolute reference when you want to promote your site/page with Google etc.

shahin · Sep 26, 2017

Thanks sir Chihiro, for your elaborative response. I don't get excited so easily because I hardly fabricate things and i don't say anything for sure that i can't. When i find somebody expressing doubt on what I've already achieved, it's difficult to curb the anger. This is it. However, i will try the regex out to serve the purpose.

shrivallabha · Sep 26, 2017

Maybe you can use python's feature through VBA like this example:
https://stackoverflow.com/questions/39516875/return-result-from-python-to-vba

PS: Code is unverified (for me at least!).

shahin · Sep 26, 2017

If i consider scraping links with "home" page and go like this, it creates invalid links for "home" page. Moreover, the usage of "and" parameter in the script strictly limits the desired links from getting parsed.

Code:

Sub Creating_absolute_links()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim post As Object, storage As Variant, link As Variant
   
    storage = Array( _
    "https://www.yify-torrent.org/search/1080p/", _
    "https://yts.ag/browse-movies/", _
    "https://www.wiseowl.co.uk/videos/")
   
    For Each link In storage
        With http
            .Open "GET", link, False
            .send
            html.body.innerHTML = .responseText
        End With
        For Each post In html.getElementsByTagName("a")
            If InStr(1, post.innerText, "home", 1) > 0 And InStr(post.href, "about:") > 0 Then
                r = r + 1: Cells(r, 1) = link & Split(post.href, "about:")(1)
            End If
        Next post
    Next link
End Sub

Chihiro · Sep 27, 2017

... like I wrote. Replace 'about:/' with root directory/path, like given in example.
"https://www.wiseowl.co.uk/".

So...
"about:/videos/" becomes...
"https://www.wiseowl.co.uk/videos/"

FYI - instead of SPLIT function, use REPLACE function.

shahin · Sep 27, 2017

Thanks sir, for the suggestion. I'll abide by for sure. The thing I wanted to say from the very beginning is that after executing the above script some websites produce "about:", others produce "http:" and the rest "https:". In such cases how can I handle them all? This is it, sir.

Chihiro · Sep 27, 2017

Pattern matching as I said.

For an example if you use pattern "^about:\/".

It will only return match/true when string starts with "about:/".

Therefore, if statement is used to check that RegEx.Test(String) return TRUE, before using REPLACE or other operation on the string.

See link for detail on using RegEx in VBA.
http://analystcave.com/excel-regex-tutorial/

EDIT: Forgot to escape "/". See updated pattern.

shahin · Nov 11, 2017

@Sir chihiro, Once again, it is my incapability that I could not make things clear in the first place what my desired output should be. Suppose, I execute my vba script to parse links from five different websites and it produced 50 links out of which I just filtered 30 preferable links that started with "about:". Now, as i want to reuse these newly scraped links for another http requests, the incomplete urls need to be replaced by "https://www.yify-torrent.org" or "https://www.houzz.com" or "https://yts.ag" and so on to be able to form valid urls. My question starts here: how can I make all of them valid urls (by regex or by string manipulation or some other way)? At this point I don't even know which incomplete urls should be added to which base urls, how can regex or any other method will be able to detect an incomplete link and join it with specific base url unless there is any built in function to do the job. This is it. Thanks.

Marc L · Nov 11, 2017

Hi !

It's just about respecting the great Logic and
observation of behavior of any webpage !

For a webpage, observe manually any link when surfing the page
and compare this link with the codepage or
internal href property of the element : just write down on a paper and
just see what to do to match it correctly with link observed when surfing …

It's just about the domain or a base URL …
See samples in this forum.

shahin · Nov 11, 2017

@Marc L, You are barely visible in this forum. Miss you a lot. However, thanks for you suggestion. I've been trying to fix this problem but finding no idea to move on. Where is it I should find sample and what the "search keyword" might be?

shahin · Nov 12, 2017

If I decide to keep links started with "http" then I thing it is possible to reuse them like below:

Code:

Sub Creating_absolute_links()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim link_storage As Variant, posts As Object, post As Variant

    link_storage = Array( _
                   "https://www.yify-torrent.org/search/1080p/", _
                   "https://yts.ag/browse-movies/", _
                   "https://www.houzz.com/professionals/", _
                   "https://www.wiseowl.co.uk/videos/")

    For Each post In link_storage
        With http
            .Open "GET", post, False
            .send
            html.body.innerHTML = .responseText
        End With

        For Each posts In html.getElementsByTagName("a")
            If InStr(posts.href, "http") > 0 Then r = r + 1: Cells(r, 1) = posts.href
        Next posts
    Next post
End Sub

Otherwise, If i stick to define the loop filtering links started with "about:", it gets complicated and make me confuse to go forward. How can I join the newly produced broken links with their respective base urls? I gave a try but it threw an error "Object variable or with block---" when It reaches "debug.print" this line.

Code:

For Each posts In html.getElementsByTagName("a")
    If InStr(posts.href, "about:") > 0 Then r = r + 1: Cells(r, 1) = posts.href
Next posts
Debug.Print Split(post, ".com/")(0) & ".com/" & Split(posts.href, "about:/")(1)

Marc L · Nov 13, 2017

When you grab a link you already know the domain from where you read it,
no ? (easy logic)
So when you read a link starting with "about:/" just replace it
by the correct domain or base of URL you already wrote on a paper
when you manually observed the webpage.
So for each webpage you need its URL and the base URL of its links …
All this is only about pure Logic, not upon any coding trick !

shahin · Nov 13, 2017

@Marc L, I know it's not a good question to ask why it didn't "Split(post, ".com/")(0) & ".com/" & Split(posts.href, "about:/")(1)" work out within my script but I'm willing to know. Thanks in advance.

Marc L · Nov 13, 2017

Or just use Replace VBA function …

shahin · Nov 13, 2017

Why didn't this Replace function come to my mind as an option? Thanks.

Marc L · Nov 13, 2017

As already written in posts #12, 14 & 19 …

shahin · Nov 16, 2017

Looks like I'm very close to the solution I am after. Just look at the below script:

Code:

Sub Creating_absolute_links()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim post As Object, vault As Variant, link As Variant
 
    vault = Array( _
        "https://yts.ag/browse-movies/", _
        "https://chandoo.org/wp/vba-classes/", _
        "https://www.wiseowl.co.uk/videos/")
 
    For Each link In vault
        With http
            .Open "GET", link, False
            .send
            html.body.innerHTML = .responseText
        End With
     
        For Each post In html.getElementsByTagName("a")
            r = r + 1: Cells(r, 1) = link & post.href   ''I'm talking about this line
        Next post
    Next link
End Sub

The first portion of the concatenated string aka "link" always produces the relevant site address and get joined with the "post.href" portion. I vastly check for any anomaly, as in if the site link is joining any newly produced "href" from "another link" but no what I found so far it accurate. That means, if i can filter out the fully qualified domain names from the "link" portion then the rest is easy to manage, i meant filtering preferable "broken links" from "href" part.

Marc L · Nov 16, 2017

Unclear … No needs to filter but to replace "about:" !

Compare a link of this kind with its URL when you manually click the link.
Give us a sample of a link and its true URL …

It's just about easy observing, visualizing, comparing, …

Creating absolute links from parsed ones

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Excel Ninja