1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

How to parse links conditionally?

Discussion in 'VBA Macros' started by shahin, Apr 17, 2018.

  1. shahin

    shahin Active Member

    Messages:
    897
    Hi there!! Hope you all are doing well. This time I've got stuck on a slightly different issue with my script. The script is supposed to find "contact" or "about" link from some webpages stored within "linklists" variable. It is doing it's job not following the condition I tried to create. I expected my script parse the link connected to "contact" and If "contact" is not available then only it will look for "about" link to parse. However, when both of the "contact" and "about" link are available, the scraper sources the link connected to "about" whereas my initial and main condition is to get link connected to "contact". How can I create the condition in such a way so the script will first look for "contact" link and If "contact" link is not available then it will go for parsing "about" link.

    This is what i have tried so far:
    Code (vb):

    Sub Get_Conditional_Links()
        Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object
        Dim newlink As String, linklists As Variant, link As Variant

        linklists = [{"http://www.mount-zion.biz/","http://www.innovaprint.com.sg/","http://www.cityscape.com.sg/"}]

        For Each link In linklists
            With IE
                .Visible = True
                .navigate link
                While .Busy = True Or .readyState < 4: DoEvents: Wend
                Set HTML = .document
            End With

            For Each post In HTML.getElementsByTagName("a")
                If InStr(1, post.innerText, "contact", 1) > 0 Then
                    newlink = post.getAttribute("href")
                    Exit For

                ElseIf InStr(1, post.innerText, "about", 1) > 0 Then
                    newlink = post.getAttribute("href")
                    Exit For
                End If
            Next post
            R = R + 1: Cells(R, 1) = newlink
        Next link
        IE.Quit
    End Sub
     
    Btw, If I comment out the below portion from my above script then the script does parse the "contact" link from each webpage but I would like to keep both the conditions active within my script so that if one is missing another will be of use.
    Code (vb):

    ElseIf InStr(1, post.innerText, "about", 1) > 0 Then
        newlink = post.getAttribute("href")
        Exit For
     
  2. Marc L

    Marc L Excel Ninja

    Messages:
    4,253
    Hi !

    The logic issue is where your Cells = newlink codeline is located !
    Just move this codeline in each If condition.
    In fact you even do not need newlink variable neither HTML neither …
  3. shahin

    shahin Active Member

    Messages:
    897
    I did it according to your suggestion @Marc L. However, the result is still the same.
  4. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,935
    ... you are looping each <a> and checking for If conditions.

    Hence, if "about" is found before an element that contains "contact", then code will not loop further down the list/collection to check for it.

    Rather than looping through collection. You are better off doing string manipulation or RegExp.

    Otherwise, do two pass on the collection. First loop to find "contact" if not found, do second loop.

    Edit: Or only have "Exit For", at "contact" condition only. So you don't exit loop prematurely.
    Last edited: Apr 17, 2018
    shahin likes this.
  5. shahin

    shahin Active Member

    Messages:
    897
    It's long since I heard from you sir. Hope you are doing well. I've tried one but it needs some optimization to work flawlessly (or I couldn't understand the logic at all).
    Code (vb):

    Sub Get_Conditional_Links()
        Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object
        Dim newlink As String, linklists As Variant, link As Variant

        linklists = [{"http://www.mount-zion.biz/","http://www.innovaprint.com.sg/","http://www.cityscape.com.sg/","https://www.plexure.com.sg/"}]

        For Each link In linklists
            With IE
                .Visible = True
                .navigate link
                While .Busy = True Or .readyState < 4: DoEvents: Wend
                Set HTML = .document
            End With

            For Each post In HTML.getElementsByTagName("a")
                If InStr(1, post.innerText, "contact", 1) > 0 Then
                    newlink = post.getAttribute("href")
                    Exit For
                 
                ElseIf InStr(1, post.innerText, "about", 1) > 0 Then
                    newlink = post.getAttribute("href")
                End If
            Next post
            R = R + 1: Cells(R, 1) = newlink
        Next link
        IE.Quit
    End Sub
     
    It works for the first three links but for the last link the scraper checks the "else block" and throws an error "invalid use of null" because in that link there is no "href" connected to "about".
    Last edited: Apr 17, 2018
  6. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,935
    You did not read my post. Take time to read it over and digest what's written there.
  7. shahin

    shahin Active Member

    Messages:
    897
    See the edit sir. I'm not sure I have been able to go through the instruction you have given.
  8. Marc L

    Marc L Excel Ninja

    Messages:
    4,253
    Yes I was a bit lost 'cause of all your not necessary stuff !

    To work flawlessly forget IE and just use any request.
    I just try on my side from a clean code of this forum
    and I get next results without any issue :
    http://www.mount-zion.biz/contactus.html
    http://www.innovaprint.com.sg/contact.html
    http://www.cityscape.com.sg/?page_id=37

    So as the logic needed here is at beginner level,
    as it just needs a well observation of the webpage content,
    as always just check content of variables during code execution …
  9. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,935
    Use Debug.Print post.innerHtml inside inner For loop, and comment out entire "If... End If" block.

    You'll see what I mean.

    Note:
    For looping through object/collection in HTMLDocument, it's never a good idea to loop using generic object variable. As collection can contain more than one object type. This can add unnecessary loop iteration and cause various errors. Instead, use index to loop through collection.
    Ex:
    Code (vb):
    With HTML.getElementsByTagName("a")
      For i = 0 to .Length - 1
          Debug.Print .Item(i).innerHtml
      Next
    End With
  10. Marc L

    Marc L Excel Ninja

    Messages:
    4,253
    Via For Each oElt In oDoc.getElementsByTagName("A")
    even if the collection is empty no issue 'cause the loop is not executed
    and as here it's always the same element type
    so it's easy to directly work with oElt.innerText and oElt.href
    like you can see in threads of this forum, often shahin's own !
    shahin likes this.
  11. shahin

    shahin Active Member

    Messages:
    897
    For sir chihiro:

    This is how I got success in python:
    Code (vb):

    def FetchLink(site):
        res = requests.get(site)
        soup = BeautifulSoup(res.text,"lxml")
        for item in soup.select("a[href]"):
            if "contact" in item.text.lower():
                abslink = urljoin(site,item['href'])
               print(abslink)
                return 0           # Exit from function
        for item in soup.select("a[href]"):
            if "about" in item.text.lower():
                abslink = urljoin(site,item['href'])
               print(abslink)
                return 0            # Exit from function
     
  12. Marc L

    Marc L Excel Ninja

    Messages:
    4,253

    So why do you compare "CONTACT" = "contact" in VBA via InStr ?
    Of course your VBA code can't work like this ! If you just test your code …

    Must see in VBA inner help : LCase and Option Compare
  13. shahin

    shahin Active Member

    Messages:
    897
    But, why they don't? Because using the below script I can see that vba is able to handle all cases if I try like below:
    Code (vb):

    Sub GetNames()
        Dim namestorage As Variant, elem As Variant
     
        namestorage = [{"CONTACT","Contact","CoNtAcT","cOnTaCt"}]
        For Each elem In namestorage
            If InStr(1, elem, "contact", 1) > 0 Then
                R = R + 1: Cells(R, 1) = elem
            End If
        Next elem
    End Sub
     
    Output:

    CONTACT
    Contact
    CoNtAcT
    cOnTaCt

    If this is what you meant @Marc L.
  14. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,935
    ... compare your python code to vba. In python you have two separate loop for each check. Where as you are trying to do it in single loop for both check in vba. Hence your issue of prematurely exiting loop, when "about" is encountered before "contact" (which is actually every case in this sample).
    shahin likes this.
  15. shahin

    shahin Active Member

    Messages:
    897
    Now it works. But don't you think the code looks uglier @sir chihiro:
    Code (vb):

    Sub Get_Conditional_Links()
        Dim IE As New InternetExplorer, HTML As HTMLDocument
        Dim post As Object, elem As Object, newlink As String
        Dim linklists As Variant, link As Variant

        linklists = [{"https://www.plexure.com.sg/","http://www.mount-zion.biz/","http://www.innovaprint.com.sg/","http://www.cityscape.com.sg/"}]

        For Each link In linklists
            With IE
                .Visible = True
                .navigate link
                While .Busy = True Or .readyState < 4: DoEvents: Wend
                Set HTML = .document
            End With

            For Each post In HTML.getElementsByTagName("a")
                If InStr(1, post.innerText, "contact", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
            Next post
           
            For Each elem In HTML.getElementsByTagName("a")
                If InStr(1, post.innerText, "about", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
            Next elem
            R = R + 1: Cells(R, 1) = newlink
        Next link
        IE.Quit
    End Sub
     
  16. Marc L

    Marc L Excel Ninja

    Messages:
    4,253
    Ugly 'cause your code is not logical unless you prefer about on contact
    'cause it doesn't need to run the second loop if contact already found !
    And check your results as some are not good enough …
    On my side I just use an unique loop …

    'Cause your VBA code doesn't do the same as your Python code !

    I just meant this :
    Read, just read and apply …

    Edit : sorry shahin, I did not see your compare text option used in Instr !

  17. Marc L

    Marc L Excel Ninja

    Messages:
    4,253
    I have edited my previous post …

    [​IMG]
    shahin likes this.
  18. shahin

    shahin Active Member

    Messages:
    897
    One last issue I would like to get resolved. Although this has been asked before, I hope to get a little twitch on it.

    Upon execution, the above script produces the below links:

    Code (vb):

    https://www.plexure.com.sg/contact
    contactus.html
    /contact.html
    http://www.cityscape.com.sg/?page_id=37
     
    When I tried conditional statement in order to make them broken to complete links, I got the below results:

    The statement I used:

    Code (vb):

    If InStr(newlink, "http") > 0 Then
        R = R + 1: Cells(R, 1) = newlink
    Else:
        R = R + 1: Cells(R, 1) = link & newlink
    End If
     
    The result it produces (one but all are okay):

    Code (vb):

    https://www.plexure.com.sg/contact
    http://www.mount-zion.biz/contactus.html
    http://www.innovaprint.com.sg//contact.html  ''this is invalid: notice the double slash
    http://www.cityscape.com.sg/?page_id=37
     
    The problem I get with this broken link "/contact.html" is that it starts with a forward slash and as a result the conditional statement creates a useless link "http://www.innovaprint.com.sg//contact.html".
  19. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,935
    So check left most char of returned string or use replace, split or any other suitable string manipulation.
    shahin likes this.
  20. Marc L

    Marc L Excel Ninja

    Messages:
    4,253

    I had the same double "/" issue with a request
    and it just needs a replace tweak …
  21. shahin

    shahin Active Member

    Messages:
    897
    It seems I've made it but the code looks like the "Joker" in "The Dark Knight" movie. It takes ages to produce the results.
    Code (vb):

    Sub Get_Conditional_Links()
        Dim IE As New InternetExplorer, HTML As HTMLDocument
        Dim post As Object, elem As Object, newlink As String
        Dim frstchr$, linklists As Variant, link As Variant

        linklists = [{"http://www.innovaprint.com.sg/","https://www.plexure.com.sg/","http://www.mount-zion.biz/","http://www.cityscape.com.sg/"}]

        For Each link In linklists
            With IE
                .Visible = True
                .navigate link
                While .Busy = True Or .readyState < 4: DoEvents: Wend
                Set HTML = .document
            End With

            For Each post In HTML.getElementsByTagName("a")
                If InStr(1, post.innerText, "contact", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
            Next post

            For Each elem In HTML.getElementsByTagName("a")
                If InStr(1, post.innerText, "about", 1) > 0 Then newlink = post.getAttribute("href"): Exit For
            Next elem

            If InStr(newlink, "http") > 0 Then
                R = R + 1: Cells(R, 1) = newlink
            Else:
                frstchr = Left(newlink, 1)
                If frstchr = "/" Then newlink = Split(newlink, frstchr)(1)
                R = R + 1: Cells(R, 1) = link & newlink
            End If
        Next link
        IE.Quit
    End Sub
     
  22. Marc L

    Marc L Excel Ninja

    Messages:
    4,253
    shahin likes this.
  23. shahin

    shahin Active Member

    Messages:
    897
    Right you are Marc L. It's time I switched to "xmlhttp" or "serverxmlhttp" request.
  24. Marc L

    Marc L Excel Ninja

    Messages:
    4,253

    You choose risk using Split if a link has more than one "/"
    like for example /adm/contact.html …​
    shahin likes this.
  25. shahin

    shahin Active Member

    Messages:
    897
    Thanks for the pointer @Marc L. This is very likely to have. So, I should choose replace function over split function in this very case, right?

Share This Page