• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

How to parse links conditionally?

shahin

Active Member
Hi there!! Hope you all are doing well. This time I've got stuck on a slightly different issue with my script. The script is supposed to find "contact" or "about" link from some webpages stored within "linklists" variable. It is doing it's job not following the condition I tried to create. I expected my script parse the link connected to "contact" and If "contact" is not available then only it will look for "about" link to parse. However, when both of the "contact" and "about" link are available, the scraper sources the link connected to "about" whereas my initial and main condition is to get link connected to "contact". How can I create the condition in such a way so the script will first look for "contact" link and If "contact" link is not available then it will go for parsing "about" link.

This is what i have tried so far:
Code:
Sub Get_Conditional_Links()
    Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object
    Dim newlink As String, linklists As Variant, link As Variant

    linklists = [{"http://www.mount-zion.biz/","http://www.innovaprint.com.sg/","http://www.cityscape.com.sg/"}]

    For Each link In linklists
        With IE
            .Visible = True
            .navigate link
            While .Busy = True Or .readyState < 4: DoEvents: Wend
            Set HTML = .document
        End With

        For Each post In HTML.getElementsByTagName("a")
            If InStr(1, post.innerText, "contact", 1) > 0 Then
                newlink = post.getAttribute("href")
                Exit For

            ElseIf InStr(1, post.innerText, "about", 1) > 0 Then
                newlink = post.getAttribute("href")
                Exit For
            End If
        Next post
        R = R + 1: Cells(R, 1) = newlink
    Next link
    IE.Quit
End Sub

Btw, If I comment out the below portion from my above script then the script does parse the "contact" link from each webpage but I would like to keep both the conditions active within my script so that if one is missing another will be of use.
Code:
ElseIf InStr(1, post.innerText, "about", 1) > 0 Then
    newlink = post.getAttribute("href")
    Exit For
 
Hi !

The logic issue is where your Cells = newlink codeline is located !
Just move this codeline in each If condition.
In fact you even do not need newlink variable neither HTML neither …
 
... you are looping each <a> and checking for If conditions.

Hence, if "about" is found before an element that contains "contact", then code will not loop further down the list/collection to check for it.

Rather than looping through collection. You are better off doing string manipulation or RegExp.

Otherwise, do two pass on the collection. First loop to find "contact" if not found, do second loop.

Edit: Or only have "Exit For", at "contact" condition only. So you don't exit loop prematurely.
 
Last edited:
It's long since I heard from you sir. Hope you are doing well. I've tried one but it needs some optimization to work flawlessly (or I couldn't understand the logic at all).
Code:
Sub Get_Conditional_Links()
    Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object
    Dim newlink As String, linklists As Variant, link As Variant

    linklists = [{"http://www.mount-zion.biz/","http://www.innovaprint.com.sg/","http://www.cityscape.com.sg/","https://www.plexure.com.sg/"}]

    For Each link In linklists
        With IE
            .Visible = True
            .navigate link
            While .Busy = True Or .readyState < 4: DoEvents: Wend
            Set HTML = .document
        End With

        For Each post In HTML.getElementsByTagName("a")
            If InStr(1, post.innerText, "contact", 1) > 0 Then
                newlink = post.getAttribute("href")
                Exit For
              
            ElseIf InStr(1, post.innerText, "about", 1) > 0 Then
                newlink = post.getAttribute("href")
            End If
        Next post
        R = R + 1: Cells(R, 1) = newlink
    Next link
    IE.Quit
End Sub

It works for the first three links but for the last link the scraper checks the "else block" and throws an error "invalid use of null" because in that link there is no "href" connected to "about".
 
Last edited:
See the edit sir. I'm not sure I have been able to go through the instruction you have given.
 
Yes I was a bit lost 'cause of all your not necessary stuff !

To work flawlessly forget IE and just use any request.
I just try on my side from a clean code of this forum
and I get next results without any issue :
http://www.mount-zion.biz/contactus.html
http://www.innovaprint.com.sg/contact.html
http://www.cityscape.com.sg/?page_id=37

So as the logic needed here is at beginner level,
as it just needs a well observation of the webpage content,
as always just check content of variables during code execution …
 
Use Debug.Print post.innerHtml inside inner For loop, and comment out entire "If... End If" block.

You'll see what I mean.

Note:
For looping through object/collection in HTMLDocument, it's never a good idea to loop using generic object variable. As collection can contain more than one object type. This can add unnecessary loop iteration and cause various errors. Instead, use index to loop through collection.
Ex:
Code:
With HTML.getElementsByTagName("a")
  For i = 0 to .Length - 1
      Debug.Print .Item(i).innerHtml
  Next
End With
 
Via For Each oElt In oDoc.getElementsByTagName("A")
even if the collection is empty no issue 'cause the loop is not executed
and as here it's always the same element type
so it's easy to directly work with oElt.innerText and oElt.href
like you can see in threads of this forum, often shahin's own !
 
For sir chihiro:

This is how I got success in python:
Code:
def FetchLink(site):
    res = requests.get(site)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("a[href]"):
        if "contact" in item.text.lower():
            abslink = urljoin(site,item['href'])
            print(abslink)
            return 0           # Exit from function
    for item in soup.select("a[href]"):
        if "about" in item.text.lower():
            abslink = urljoin(site,item['href'])
            print(abslink)
            return 0            # Exit from function
 

So why do you compare "CONTACT" = "contact" in VBA via InStr ?
Of course your VBA code can't work like this ! If you just test your code …

Must see in VBA inner help : LCase and Option Compare
 
But, why they don't? Because using the below script I can see that vba is able to handle all cases if I try like below:
Code:
Sub GetNames()
    Dim namestorage As Variant, elem As Variant
 
    namestorage = [{"CONTACT","Contact","CoNtAcT","cOnTaCt"}]
    For Each elem In namestorage
        If InStr(1, elem, "contact", 1) > 0 Then
            R = R + 1: Cells(R, 1) = elem
        End If
    Next elem
End Sub

Output:

CONTACT
Contact
CoNtAcT
cOnTaCt

If this is what you meant @Marc L.
 
... compare your python code to vba. In python you have two separate loop for each check. Where as you are trying to do it in single loop for both check in vba. Hence your issue of prematurely exiting loop, when "about" is encountered before "contact" (which is actually every case in this sample).
 
Now it works. But don't you think the code looks uglier @sir chihiro:
Code:
Sub Get_Conditional_Links()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim post As Object, elem As Object, newlink As String
    Dim linklists As Variant, link As Variant

    linklists = [{"https://www.plexure.com.sg/","http://www.mount-zion.biz/","http://www.innovaprint.com.sg/","http://www.cityscape.com.sg/"}]

    For Each link In linklists
        With IE
            .Visible = True
            .navigate link
            While .Busy = True Or .readyState < 4: DoEvents: Wend
            Set HTML = .document
        End With

        For Each post In HTML.getElementsByTagName("a")
            If InStr(1, post.innerText, "contact", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
        Next post
       
        For Each elem In HTML.getElementsByTagName("a")
            If InStr(1, post.innerText, "about", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
        Next elem
        R = R + 1: Cells(R, 1) = newlink
    Next link
    IE.Quit
End Sub
 
Ugly 'cause your code is not logical unless you prefer about on contact
'cause it doesn't need to run the second loop if contact already found !
And check your results as some are not good enough …
On my side I just use an unique loop …

But, why they don't?
'Cause your VBA code doesn't do the same as your Python code !

If this is what you meant
I just meant this :
Must see in VBA inner help : LCase and Option Compare
Read, just read and apply …

Edit : sorry shahin, I did not see your compare text option used in Instr !

 
I have edited my previous post …

WebContactLinks.gif

 
One last issue I would like to get resolved. Although this has been asked before, I hope to get a little twitch on it.

Upon execution, the above script produces the below links:

Code:
https://www.plexure.com.sg/contact
contactus.html
/contact.html
http://www.cityscape.com.sg/?page_id=37

When I tried conditional statement in order to make them broken to complete links, I got the below results:

The statement I used:

Code:
If InStr(newlink, "http") > 0 Then
    R = R + 1: Cells(R, 1) = newlink
Else:
    R = R + 1: Cells(R, 1) = link & newlink
End If

The result it produces (one but all are okay):

Code:
https://www.plexure.com.sg/contact
http://www.mount-zion.biz/contactus.html
http://www.innovaprint.com.sg//contact.html  ''this is invalid: notice the double slash
http://www.cityscape.com.sg/?page_id=37

The problem I get with this broken link "/contact.html" is that it starts with a forward slash and as a result the conditional statement creates a useless link "http://www.innovaprint.com.sg//contact.html".
 
It seems I've made it but the code looks like the "Joker" in "The Dark Knight" movie. It takes ages to produce the results.
Code:
Sub Get_Conditional_Links()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim post As Object, elem As Object, newlink As String
    Dim frstchr$, linklists As Variant, link As Variant

    linklists = [{"http://www.innovaprint.com.sg/","https://www.plexure.com.sg/","http://www.mount-zion.biz/","http://www.cityscape.com.sg/"}]

    For Each link In linklists
        With IE
            .Visible = True
            .navigate link
            While .Busy = True Or .readyState < 4: DoEvents: Wend
            Set HTML = .document
        End With

        For Each post In HTML.getElementsByTagName("a")
            If InStr(1, post.innerText, "contact", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
        Next post

        For Each elem In HTML.getElementsByTagName("a")
            If InStr(1, post.innerText, "about", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
        Next elem

        If InStr(newlink, "http") > 0 Then
            R = R + 1: Cells(R, 1) = newlink
        Else:
            frstchr = Left(newlink, 1)
            If frstchr = "/" Then newlink = Split(newlink, frstchr)(1)
            R = R + 1: Cells(R, 1) = link & newlink
        End If
    Next link
    IE.Quit
End Sub
 
Thanks for the pointer @Marc L. This is very likely to have. So, I should choose replace function over split function in this very case, right?
 
Back
Top