• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Convert my code from xmlhttp method into winhttp method

shahin

Active Member
Dear all, hope you are doing fine. I have made a scraper using vba xmhttp method which is parsing yell.com very smoothly at this moment. But i noticed that the site sometimes uses redirection method, as a result i end up with nothing when i run the code. So if there somebody out here help me creating this same code using winhttp method then i would really be indebted to him. I never worked with this method so far that is why it is totally foreign to me. For your consideration i am pasting here the code i have written. Thanks a trillion in advance.

Code:
Option Explicit
Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="
Sub ScrapingYell()

Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim posts As Object, post As Object, links As Object, link As Object
Dim x As Long, u As Long

x = 2

    For u = 2 To 6
  
    http.Open "GET", pageurl & u, False
    http.send
    html.body.innerHTML = http.responseText
  
    Set posts = html.getElementsByClassName("row businessCapsule--title")
  
        For Each post In posts
        Set links = post.getElementsByTagName("a")(0)
        Cells(x, 1) = links.innerText
        x = x + 1
        Next post
    Next u
End Sub
 
Hi Marc L! It's nice to have you in the loop. Actually when I decided to use this winhttp method, first of all I remember your code written several places in this forum. I tried to follow your style but I'm at beginner level that is why when i go through your code, my head spins. It's unfathomable to me. I tried to sort the code my level best, though. Here is what I could do with your style mingling with regular style. Don't laugh at me. Thanks in advance.

Code:
Option Explicit
Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="

Sub ScrapingYell()

Dim http As Object
Dim html As New HTMLDocument
Dim posts As Object, post As Object, links As Object, link As Object
Dim x As Long, u As Long, titles As String

x = 2

    For u = 2 To 6

    Set http = CreateObject("WinHTTP.WinHTTPRequest.5.1")

    http.Open "GET", pageurl & u, False
    http.setRequestHeader "DNT", "1"
    http.setRequestHeader "Content-Type", "text/xml"
    http.send

    html.body.innerHTML = http.responseText

    Set posts = html.getElementsByClassName("row businessCapsule--title")
        For Each post In posts
        Set links = post.getElementsByTagName("a")(0)
        Cells(x, 1) = links.innerText
        x = x + 1
        Next post
    Next u
End Sub

It is returning nothing and not showing any error either.
 

• Create Http object only once ! Just before starting the loop …

• Do not forget to free object variables before ending like

Set posts = Nothing


Your problem here is using getElementsByClassName
which does not work on every web page !
So it was not an issue from Msxml2 object …

So you must observe initial web page code and change your strategy !

For example, searching all "A" tag names and
check if parent element has a class name starting with "col-sm-" …
 
Thanks again for your sharp response. I tried to follow what you just said. Still getting no result. Perhaps I'm doing something wrong anywhere within my code. Here is what I modified according to your instruction.
Code:
Option Explicit
Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="

Sub ScrapingYell()

Dim http As Object
Dim html As New HTMLDocument
Dim posts As Object, post As Object, links As Object, link As Object
Dim x As Long, u As Long

x = 2
    Set http = CreateObject("WinHTTP.WinHTTPRequest.5.1")
  
    For u = 2 To 6
  
    http.Open "GET", pageurl & u, False
    http.setRequestHeader "DNT", "1"
    http.setRequestHeader "Content-Type", "text/xml"
    http.send

    html.body.innerHTML = http.responseText
  
    Set posts = html.getElementsByClassName("col-sm-20") ''Another class name "col-sm-24"

        For Each post In posts
        Set links = post.getElementsBytagName("a")(0)
        Cells(x, 1) = links.innerText
        x = x + 1
        Next post
      
    Next u
   Set posts = Nothing
End Sub
 
You are right Marc L!!!!!! Changing the website I'm getting results now. I'am not worried about how yell.com reacts with this code. However, I'm happy that it is even working without request header if I try to scrape Real YP. So, at this point I'm confused about setting request header. Is it necessary to put those lines in the code always or it is for POST request? Btw, could you please suggest me about any method following which I wont be redirected from a webpage. Thanks again.
Code:
Option Explicit
Const pageurl As String = "http://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=San%20Francisco%2C%20CA&page=2"

Sub ScrapingYell()

Dim http As New WinHttpRequest
Dim html As New HTMLDocument
Dim posts As Object, post As Object, links As Object, link As Object
Dim x As Long, u As Long

x = 2

    http.Open "GET", pageurl, False
    http.send

    html.body.innerHTML = http.responseText
   
    Set posts = html.getElementsByClassName("info")
       
        For Each post In posts
        Set links = post.getElementsByTagName("a")(0)
        Cells(x, 1) = links.innerText
        x = x + 1
        Next post

   Set posts = Nothing
End Sub
 
I tried to follow what you just said.
No ‼ You did very not follow anything 'cause as I wrote :​
Your problem here is using getElementsByClassName
which does not work on every web page !
So it was not an issue from Msxml2 object …

So you must observe initial web page code and change your strategy !

For example, searching all "A" tag names …
And you still use getElementsByClassName in your code !

Well works on my side on yell.com without this statement …​


Changing the website I'm getting results now. I'am not worried about how yell.com reacts with this code. However, I'm happy that it is even working without request header if I try to scrape Real YP. So, at this point I'm confused about setting request header. Is it necessary to put those lines in the code always or it is for POST request? Btw, could you please suggest me about any method following which I wont be redirected from a webpage. Thanks again.
Do not ever need to change website !
Just with a well observation, thinking and applying an appropriate strategy …

Concerning request headers, it depends on each webpage
but often "POST" requires more.

Can't avoid redirection.
But here I didn't have any in your initial website (yell.com),
just saw a bad strategy within your code !
Idea was good but not applicable to this website …
As there is a better tag to use under yell.com !

Depends in fact of what exactly you wanna grab
'cause again you did not crystal clear explain your need !


And some object variables are not empty at end of your code, free them !
 
FYI - From my experience, getElementsByClassName will often have issues when you don't add reference to library in VBA project (i.e. late bind method). Can't find the source that explains why this happens, but 9 out of 10 times, you can resolve it by adding reference to the library in the project, and using early bind.

In fact, web scraping is one of few instances where I prefer to add reference to library and use early binding. Though it provides less compatibility between systems (and versions of Excel etc).
 

Chihiro, as you can see in initial post,
shahin already used references with issue !

Better than a class name is to use a tag like H2 for example
even if on my side that works via the class name …
 
Hi Marc L! Using tag name I'm still getting nothing as I did earlier.
Code:
Option Explicit
Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="

Sub ScrapingYell()
Dim http As Object
Dim html As New HTMLDocument
Dim posts As Object, post As Object, links As Object, link As Object
Dim x As Long, u As Long

x = 2
    Set http = CreateObject("WinHTTP.WinHTTPRequest.5.1")
    For u = 2 To 6
   
        http.Open "GET", pageurl & u, False
        http.Send
   
        html.body.innerHTML = http.responseText
     
        Set posts = html.getElementsByTagName("h2")
        For Each post In posts
            Cells(x, 1) = post.innerText
            x = x + 1
        Next post
    Next u
   Set posts = Nothing
End Sub

Tried using early binding but still no luck.
 
Dear sir Chihiro, The reason I wanted to use "WINhttprequest" method is because somewhere within this forum I read that this method has the ability to prevent redirection. However, redirection is not taking place here, rather the site requires captcha solution to make sure the operator is not a bot. When I wrote a full code using "XMLhttp" method I could notice that it only grabs the data of the first page and in place of grabbing the next page it stops working because trying manual intervention i saw the page was requiring to solve a captcha before showing its next page. That's it. Here is how i tried.

Code:
Option Explicit
Const mainurl As String = "https://www.yell.com"
Const pageurl As String = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="
Sub ScrapingYell()

Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument, hmm As New HTMLDocument
Dim posts As Object, post As Object, links As Object, link As Object
Dim cc As Object, dd As Object, ee As Object, ff As Object
Dim pagelink As String
Dim x As Long, u As Long, t As Long

x = 2

    For u = 2 To 3
  
    http.Open "GET", pageurl & u, False
    http.send
    html.body.innerHTML = http.responseText
  
    Set posts = html.getElementsByClassName("row businessCapsule--title")
  
        For Each post In posts
        Set links = post.getElementsByTagName("a")(0)
        pagelink = mainurl & Replace(links.href, "about:", "")

    http.Open "GET", pagelink, False
    http.send
    hmm.body.innerHTML = http.responseText
      
      
        Set cc = hmm.getElementsByClassName("businessCapsule--title")
        Set dd = hmm.getElementsByClassName("address")
        Set ee = hmm.getElementsByClassName("business-telephone")
  
        For t = 0 To cc.Length - 1
            If cc.Length > 0 Or dd.Length > 0 Or ee.Length > 0 Then
            Cells(x, 1) = cc(t).innerText
            Cells(x, 2) = dd(t).innerText
            Cells(x, 3) = ee(t).innerText
            x = x + 1
            End If
        Next t
        Next post
    Next u
End Sub
 

Tests after tests, I never met any issue …​

And as I wrote …
Depends in fact of what exactly you wanna grab
'cause again you did not crystal clear explain your need !
… at least explain exactly what you need to grab ‼
 

Very do not forget to release variable objects ‼

As it seems on your side getElementsByClassName work for first page
so try "POST" instead of "GET" if website accepts that …

Maybe more request headers are necessary …
 
Thanks again to respond, Marc L. I double checked the form element of that page. It accepts "GET" method only. As you mentioned beforehand, the next page might be missing "getElementsByClassName". It's not like that. Every page from at least 2 to 4 is with the same element I have written in my code. Anyways, perhaps you overlooked what I mentioned earlier. The website uses "CAPTCHA" if more than one page is scraped. By the way, you will face it when you make the second "GET" request as i did and go across two pages. Thanks.
 

After running your last code, a yell webpage on a webbrowser was already
opened, when I ask for next page I had the message asking if I'm human
or robot ! So this website seems like Google limiting requests …

In fact when inspecting webbrowser request, there are some cookies
specific to the browser session … They maybe could be reproduced
within an http request (via headers) but sometimes it's a mess !

So first use WinHttp and in case of failing (adding more request headers)
two choices :
• staying with this website but via piloting a webbrowser,
• leaving this website in favor of another one less restrictive …
 
I just try on names under yell.com and
it seems to work with these headers :​
Code:
Sub Demo00()
  Const URL = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="
    Dim P%, oReq As New WinHttpRequest, oDoc As New HTMLDocument, oElt As HTMLHeaderElement
    For P = 1 To 4
            oReq.Open "GET", URL & P, False
            oReq.setRequestHeader "DNT", "1"
            oReq.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64)"
            oReq.send
         If oReq.Status = 200 Then
                oDoc.body.innerHTML = oReq.responseText
                Debug.Print vbLf; "Page"; P; ":"
            For Each oElt In oDoc.getElementsByTagName("H2")
                Debug.Print oElt.innerText
            Next
         End If
    Next
        Set oReq = Nothing:  Set oDoc = Nothing
End Sub
Results in VBE Immediate window …
 
Dear Marc L, I tested the code using two different request in a single subroutine following your suggestion. First one "XMLhttp" and second one "WINhttp". However, I found that it is working 80% that means in two pages there are 30 names and docs out of which it is able to extract 20 records irrespective of captcha. Thanks a zillion dear. I hope in your free time with your soft touch in my code, the performance might be 100%. Have a nice trip.
 
On my side 60 names are extracted from pages 1 to 4
(& address, locality, postal code, region and telephone)
whatever using Msxml2 or WinHttp :​
Code:
Sub Demo()
      Const URL = "https://www.yell.com/ucs/UcsSearchAction.do?keywords=cafes+%26+coffee+shops&location=all+states&scrambleSeed=1864223494&pageNum="
        Dim L&, P%, SPQ$(), VA$(), N&, R&
            L = 2
            ActiveSheet.UsedRange.Offset(1).Clear
            On Error Resume Next
    With New WinHttpRequest     ' or XMLHTTP
        For P = 1 To 4
            With ActiveWindow.VisibleRange.Rows
                If L > .Item(.Count).Row Then ActiveWindow.LargeScroll 1
            End With
                Cells(L, 1).Value = " Page " & P & " :"
                Err.Clear
               .Open "GET", URL & P, False
               .setRequestHeader "DNT", "1"
               .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64)"
               .send
            If Err.Number Then
                Debug.Print P; ": #"; Err.Number; " "; Err.Description
            ElseIf .Status = 200 Then
                  SPQ = Split(.responseText, "<h2 itemprop=""name"">")
                    N = UBound(SPQ)
               ReDim VA(1 To N, 1 To 6)
                For R = 1 To N
                    VA(R, 1) = Split(SPQ(R), "<")(0)
                    VA(R, 2) = Split(Split(SPQ(R), "itemprop=""streetAddress"">")(1), "<")(0)
                    VA(R, 3) = Split(Split(SPQ(R), "itemprop=""addressLocality"">")(1), "<")(0)
                    VA(R, 4) = Split(Split(SPQ(R), "itemprop=""postalCode"">")(1), "<")(0)
                    VA(R, 5) = Split(Split(SPQ(R), "itemprop=""addressRegion"">")(1), "<")(0)
                    VA(R, 6) = Split(Split(SPQ(R), "itemprop=""telephone"">")(1), "<")(0)
                Next
                    Cells(L, 2).Resize(N, 6).Value = VA
                    L = L + N
            Else
                Debug.Print P; ": status "; .Status; " "; .statusText
            End If
        Next
    End With
End Sub
Errors are displayed within Immediate window …
 
@Marc L
OMG!!!! You are such a gem!!!!!! Your code is working perfectly. I wish I could hit the like button million times. However, I am still unable to find out what was wrong with my code. Btw, can you give me any link where you used this (Responsetext) style one more time because I am very willing to learn this. I'm still wondering how could you ignore the "CAPTCHA"!!! Thanks a trillion.
 
Hi Marc L! Hope you are doing fine. MSDN website is a vast area to search for anything specific. It's even harder for a novice like me. I like your "responsetext" style very much and to learn that I tried a lot throughout the day to find a suitable source where I can start from. However, the style is perhaps not that common that's because I could not find any match. If you can give me a hint as to which link or site I should follow, I would be very happy. Thanks a lot for everything.
 
No, search is at kid level : try this one !

So responseText is just the request answer as explained on MSDN …

Maybe within Web Scraping tutorial
You can also learn from threads on same subject !

So here issue was not using xmlhttp request method
but just understanding how an easy webpage works,
just by reading its HTML source code …
 
Back
Top