1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Convert my code from xmlhttp method into winhttp method

Discussion in 'VBA Macros' started by shahin, Feb 17, 2017.

  1. shahin

    shahin Member

    Messages:
    187
    Dear all, hope you are doing fine. I have made a scraper using vba xmhttp method which is parsing yell.com very smoothly at this moment. But i noticed that the site sometimes uses redirection method, as a result i end up with nothing when i run the code. So if there somebody out here help me creating this same code using winhttp method then i would really be indebted to him. I never worked with this method so far that is why it is totally foreign to me. For your consideration i am pasting here the code i have written. Thanks a trillion in advance.

    Code (vb):

    Option Explicit
    Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="
    Sub ScrapingYell()

    Dim http As New MSXML2.XMLHTTP60
    Dim html As New HTMLDocument
    Dim posts As Object, post As Object, links As Object, link As Object
    Dim x As Long, u As Long

    x = 2

        For u = 2 To 6
     
        http.Open "GET", pageurl & u, False
        http.send
        html.body.innerHTML = http.responseText
     
        Set posts = html.getElementsByClassName("row businessCapsule--title")
     
            For Each post In posts
            Set links = post.getElementsByTagName("a")(0)
            Cells(x, 1) = links.innerText
            x = x + 1
            Next post
        Next u
    End Sub
     
  2. Marc L

    Marc L Excel Ninja

    Messages:
    2,910
  3. shahin

    shahin Member

    Messages:
    187
    Hi Marc L! It's nice to have you in the loop. Actually when I decided to use this winhttp method, first of all I remember your code written several places in this forum. I tried to follow your style but I'm at beginner level that is why when i go through your code, my head spins. It's unfathomable to me. I tried to sort the code my level best, though. Here is what I could do with your style mingling with regular style. Don't laugh at me. Thanks in advance.

    Code (vb):

    Option Explicit
    Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="

    Sub ScrapingYell()

    Dim http As Object
    Dim html As New HTMLDocument
    Dim posts As Object, post As Object, links As Object, link As Object
    Dim x As Long, u As Long, titles As String

    x = 2

        For u = 2 To 6

        Set http = CreateObject("WinHTTP.WinHTTPRequest.5.1")

        http.Open "GET", pageurl & u, False
        http.setRequestHeader "DNT", "1"
        http.setRequestHeader "Content-Type", "text/xml"
        http.send

        html.body.innerHTML = http.responseText

        Set posts = html.getElementsByClassName("row businessCapsule--title")
            For Each post In posts
            Set links = post.getElementsByTagName("a")(0)
            Cells(x, 1) = links.innerText
            x = x + 1
            Next post
        Next u
    End Sub
     
    It is returning nothing and not showing any error either.
  4. Marc L

    Marc L Excel Ninja

    Messages:
    2,910

    • Create Http object only once ! Just before starting the loop …

    • Do not forget to free object variables before ending like

    Set posts = Nothing


    Your problem here is using getElementsByClassName
    which does not work on every web page !
    So it was not an issue from Msxml2 object …

    So you must observe initial web page code and change your strategy !

    For example, searching all "A" tag names and
    check if parent element has a class name starting with "col-sm-" …
    shahin likes this.
  5. shahin

    shahin Member

    Messages:
    187
    Thanks again for your sharp response. I tried to follow what you just said. Still getting no result. Perhaps I'm doing something wrong anywhere within my code. Here is what I modified according to your instruction.
    Code (vb):

    Option Explicit
    Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="

    Sub ScrapingYell()

    Dim http As Object
    Dim html As New HTMLDocument
    Dim posts As Object, post As Object, links As Object, link As Object
    Dim x As Long, u As Long

    x = 2
        Set http = CreateObject("WinHTTP.WinHTTPRequest.5.1")
     
        For u = 2 To 6
     
        http.Open "GET", pageurl & u, False
        http.setRequestHeader "DNT", "1"
        http.setRequestHeader "Content-Type", "text/xml"
        http.send

        html.body.innerHTML = http.responseText
     
        Set posts = html.getElementsByClassName("col-sm-20") ''Another class name "col-sm-24"

            For Each post In posts
            Set links = post.getElementsBytagName("a")(0)
            Cells(x, 1) = links.innerText
            x = x + 1
            Next post
         
        Next u
       Set posts = Nothing
    End Sub

     
  6. shahin

    shahin Member

    Messages:
    187
    You are right Marc L!!!!!! Changing the website I'm getting results now. I'am not worried about how yell.com reacts with this code. However, I'm happy that it is even working without request header if I try to scrape Real YP. So, at this point I'm confused about setting request header. Is it necessary to put those lines in the code always or it is for POST request? Btw, could you please suggest me about any method following which I wont be redirected from a webpage. Thanks again.
    Code (vb):

    Option Explicit
    Const pageurl As String = "http://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=San%20Francisco%2C%20CA&page=2"

    Sub ScrapingYell()

    Dim http As New WinHttpRequest
    Dim html As New HTMLDocument
    Dim posts As Object, post As Object, links As Object, link As Object
    Dim x As Long, u As Long

    x = 2

        http.Open "GET", pageurl, False
        http.send

        html.body.innerHTML = http.responseText
       
        Set posts = html.getElementsByClassName("info")
           
            For Each post In posts
            Set links = post.getElementsByTagName("a")(0)
            Cells(x, 1) = links.innerText
            x = x + 1
            Next post

       Set posts = Nothing
    End Sub

     
  7. Marc L

    Marc L Excel Ninja

    Messages:
    2,910
    No ‼ You did very not follow anything 'cause as I wrote :​
    And you still use getElementsByClassName in your code !

    Well works on my side on yell.com without this statement …​


    Do not ever need to change website !
    Just with a well observation, thinking and applying an appropriate strategy …

    Concerning request headers, it depends on each webpage
    but often "POST" requires more.

    Can't avoid redirection.
    But here I didn't have any in your initial website (yell.com),
    just saw a bad strategy within your code !
    Idea was good but not applicable to this website …
    As there is a better tag to use under yell.com !

    Depends in fact of what exactly you wanna grab
    'cause again you did not crystal clear explain your need !


    And some object variables are not empty at end of your code, free them !
  8. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,006
    FYI - From my experience, getElementsByClassName will often have issues when you don't add reference to library in VBA project (i.e. late bind method). Can't find the source that explains why this happens, but 9 out of 10 times, you can resolve it by adding reference to the library in the project, and using early bind.

    In fact, web scraping is one of few instances where I prefer to add reference to library and use early binding. Though it provides less compatibility between systems (and versions of Excel etc).
    jamesexcel1970 and shahin like this.
  9. Marc L

    Marc L Excel Ninja

    Messages:
    2,910

    Chihiro, as you can see in initial post,
    shahin already used references with issue !

    Better than a class name is to use a tag like H2 for example
    even if on my side that works via the class name …
    jamesexcel1970 and shahin like this.
  10. shahin

    shahin Member

    Messages:
    187
    Hi Marc L! Using tag name I'm still getting nothing as I did earlier.
    Code (vb):

    Option Explicit
    Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="

    Sub ScrapingYell()
    Dim http As Object
    Dim html As New HTMLDocument
    Dim posts As Object, post As Object, links As Object, link As Object
    Dim x As Long, u As Long

    x = 2
        Set http = CreateObject("WinHTTP.WinHTTPRequest.5.1")
        For u = 2 To 6
       
            http.Open "GET", pageurl & u, False
            http.Send
       
            html.body.innerHTML = http.responseText
         
            Set posts = html.getElementsByTagName("h2")
            For Each post In posts
                Cells(x, 1) = post.innerText
                x = x + 1
            Next post
        Next u
       Set posts = Nothing
    End Sub
     
    Tried using early binding but still no luck.
  11. shahin

    shahin Member

    Messages:
    187
    Dear sir Chihiro, The reason I wanted to use "WINhttprequest" method is because somewhere within this forum I read that this method has the ability to prevent redirection. However, redirection is not taking place here, rather the site requires captcha solution to make sure the operator is not a bot. When I wrote a full code using "XMLhttp" method I could notice that it only grabs the data of the first page and in place of grabbing the next page it stops working because trying manual intervention i saw the page was requiring to solve a captcha before showing its next page. That's it. Here is how i tried.

    Code (vb):

    Option Explicit
    Const mainurl As String = "https://www.yell.com"
    Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="
    Sub ScrapingYell()

    Dim http As New MSXML2.XMLHTTP60
    Dim html As New HTMLDocument, hmm As New HTMLDocument
    Dim posts As Object, post As Object, links As Object, link As Object
    Dim cc As Object, dd As Object, ee As Object, ff As Object
    Dim pagelink As String
    Dim x As Long, u As Long, t As Long

    x = 2

        For u = 2 To 3
     
        http.Open "GET", pageurl & u, False
        http.send
        html.body.innerHTML = http.responseText
     
        Set posts = html.getElementsByClassName("row businessCapsule--title")
     
            For Each post In posts
            Set links = post.getElementsByTagName("a")(0)
            pagelink = mainurl & Replace(links.href, "about:", "")

        http.Open "GET", pagelink, False
        http.send
        hmm.body.innerHTML = http.responseText
         
         
            Set cc = hmm.getElementsByClassName("businessCapsule--title")
            Set dd = hmm.getElementsByClassName("address")
            Set ee = hmm.getElementsByClassName("business-telephone")
     
            For t = 0 To cc.Length - 1
                If cc.Length > 0 Or dd.Length > 0 Or ee.Length > 0 Then
                Cells(x, 1) = cc(t).innerText
                Cells(x, 2) = dd(t).innerText
                Cells(x, 3) = ee(t).innerText
                x = x + 1
                End If
            Next t
            Next post
        Next u
    End Sub
     
  12. Marc L

    Marc L Excel Ninja

    Messages:
    2,910

    Tests after tests, I never met any issue …​

    And as I wrote …
    … at least explain exactly what you need to grab ‼
  13. shahin

    shahin Member

    Messages:
    187
  14. Marc L

    Marc L Excel Ninja

    Messages:
    2,910

    Very do not forget to release variable objects ‼

    As it seems on your side getElementsByClassName work for first page
    so try "POST" instead of "GET" if website accepts that …

    Maybe more request headers are necessary …
    shahin likes this.
  15. shahin

    shahin Member

    Messages:
    187
    Thanks again to respond, Marc L. I double checked the form element of that page. It accepts "GET" method only. As you mentioned beforehand, the next page might be missing "getElementsByClassName". It's not like that. Every page from at least 2 to 4 is with the same element I have written in my code. Anyways, perhaps you overlooked what I mentioned earlier. The website uses "CAPTCHA" if more than one page is scraped. By the way, you will face it when you make the second "GET" request as i did and go across two pages. Thanks.
  16. Marc L

    Marc L Excel Ninja

    Messages:
    2,910

    After running your last code, a yell webpage on a webbrowser was already
    opened, when I ask for next page I had the message asking if I'm human
    or robot ! So this website seems like Google limiting requests …

    In fact when inspecting webbrowser request, there are some cookies
    specific to the browser session … They maybe could be reproduced
    within an http request (via headers) but sometimes it's a mess !

    So first use WinHttp and in case of failing (adding more request headers)
    two choices :
    • staying with this website but via piloting a webbrowser,
    • leaving this website in favor of another one less restrictive …
    shahin likes this.
  17. Marc L

    Marc L Excel Ninja

    Messages:
    2,910
    I just try on names under yell.com and
    it seems to work with these headers :​
    Code (vb):
    Sub Demo00()
      Const URL = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="
        Dim P%, oReq As New WinHttpRequest, oDoc As New HTMLDocument, oElt As HTMLHeaderElement
        For P = 1 To 4
                oReq.Open "GET", URL & P, False
                oReq.setRequestHeader "DNT", "1"
                oReq.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64)"
                oReq.send
             If oReq.Status = 200 Then
                    oDoc.body.innerHTML = oReq.responseText
                    Debug.Print vbLf; "Page"; P; ":"
                For Each oElt In oDoc.getElementsByTagName("H2")
                    Debug.Print oElt.innerText
                Next
             End If
        Next
            Set oReq = Nothing:  Set oDoc = Nothing
    End Sub
    Results in VBE Immediate window …
    shahin likes this.
  18. Marc L

    Marc L Excel Ninja

    Messages:
    2,910

    The easy way : just use VBA inner text function on responseText !

    I'm on the road now, I'll be back after at least 5 hours or next day …
    jamesexcel1970 and shahin like this.
  19. shahin

    shahin Member

    Messages:
    187
    Dear Marc L, I tested the code using two different request in a single subroutine following your suggestion. First one "XMLhttp" and second one "WINhttp". However, I found that it is working 80% that means in two pages there are 30 names and docs out of which it is able to extract 20 records irrespective of captcha. Thanks a zillion dear. I hope in your free time with your soft touch in my code, the performance might be 100%. Have a nice trip.
  20. Marc L

    Marc L Excel Ninja

    Messages:
    2,910
    On my side 60 names are extracted from pages 1 to 4
    (& address, locality, postal code, region and telephone)
    whatever using Msxml2 or WinHttp :​
    Code (vb):
    Sub Demo()
          Const URL = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="
            Dim L&, P%, SPQ$(), VA$(), N&, R&
                L = 2
                ActiveSheet.UsedRange.Offset(1).Clear
                On Error Resume Next
        With New WinHttpRequest     ' or XMLHTTP
           For P = 1 To 4
                With ActiveWindow.VisibleRange.Rows
                    If L > .Item(.Count).Row Then ActiveWindow.LargeScroll 1
                End With
                    Cells(L, 1).Value = " Page " & P & " :"
                    Err.Clear
                   .Open "GET", URL & P, False
                   .setRequestHeader "DNT", "1"
                   .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64)"
                   .send
                If Err.Number Then
                    Debug.Print P; ": #"; Err.Number; " "; Err.Description
                ElseIf .Status = 200 Then
                      SPQ = Split(.responseText, "<h2 itemprop=""name"">")
                        N = UBound(SPQ)
                   ReDim VA(1 To N, 1 To 6)
                    For R = 1 To N
                        VA(R, 1) = Split(SPQ(R), "<")(0)
                        VA(R, 2) = Split(Split(SPQ(R), "itemprop=""streetAddress"">")(1), "<")(0)
                        VA(R, 3) = Split(Split(SPQ(R), "itemprop=""addressLocality"">")(1), "<")(0)
                        VA(R, 4) = Split(Split(SPQ(R), "itemprop=""postalCode"">")(1), "<")(0)
                        VA(R, 5) = Split(Split(SPQ(R), "itemprop=""addressRegion"">")(1), "<")(0)
                        VA(R, 6) = Split(Split(SPQ(R), "itemprop=""telephone"">")(1), "<")(0)
                    Next
                        Cells(L, 2).Resize(N, 6).Value = VA
                        L = L + N
                Else
                    Debug.Print P; ": status "; .Status; " "; .statusText
                End If
            Next
        End With
    End Sub
    Errors are displayed within Immediate window …
    shahin likes this.
  21. shahin

    shahin Member

    Messages:
    187
    @Marc L
    OMG!!!! You are such a gem!!!!!! Your code is working perfectly. I wish I could hit the like button million times. However, I am still unable to find out what was wrong with my code. Btw, can you give me any link where you used this (Responsetext) style one more time because I am very willing to learn this. I'm still wondering how could you ignore the "CAPTCHA"!!! Thanks a trillion.
  22. Marc L

    Marc L Excel Ninja

    Messages:
    2,910
    jamesexcel1970 and shahin like this.
  23. shahin

    shahin Member

    Messages:
    187
    Hi Marc L! Hope you are doing fine. MSDN website is a vast area to search for anything specific. It's even harder for a novice like me. I like your "responsetext" style very much and to learn that I tried a lot throughout the day to find a suitable source where I can start from. However, the style is perhaps not that common that's because I could not find any match. If you can give me a hint as to which link or site I should follow, I would be very happy. Thanks a lot for everything.
  24. Marc L

    Marc L Excel Ninja

    Messages:
    2,910
    No, search is at kid level : try this one !

    So responseText is just the request answer as explained on MSDN …

    Maybe within Web Scraping tutorial
    You can also learn from threads on same subject !

    So here issue was not using xmlhttp request method
    but just understanding how an easy webpage works,
    just by reading its HTML source code …
    shahin likes this.
  25. shahin

    shahin Member

    Messages:
    187
    Thanksssssssssssss.:)

Share This Page