• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

My crawler skips the content of first page

shahin

Active Member
Hi there! I've written some code to parse Name, Address and Phone number of different shops from yell.com. If any link is given to my crawler It parses the whole content irrespective of how many pages it has spread across. However, the only problem I could find out is that it always skips the content of first page, as in if there are 10 pages, my crawler scrapes last 9 pages. A little twitch might lead me to get a workaround. Here is the complete code. Thanks in advance.

Code:
Sub YellUK()
Const mlink = "https://www.yell.com"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
Dim post As HTMLHtmlElement, page As Object, newlink As String

With http
    .Open "GET", "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=United+Kingdom&scrambleSeed=1426936001", False
    .send
    html.body.innerHTML = .responseText
End With
Set page = html.getElementsByClassName("row pagination")(0).getElementsByTagName("a")
For i = 0 To page.Length - 1
    newlink = mlink & Replace(page(i).href, "about:", "")
    With http
        .Open "GET", newlink, False
        .send
        htm.body.innerHTML = .responseText
    End With

    For Each post In htm.getElementsByClassName("js-LocalBusiness")
        x = x + 1
        With post.getElementsByClassName("row businessCapsule--title")(0).getElementsByTagName("a")
            If .Length Then Cells(x + 1, 1) = .Item(0).innerText
        End With
        With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
            If .Length > 1 Then Cells(x + 1, 2) = .Item(1).innerText
        End With
        With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
            If .Length > 2 Then Cells(x + 1, 3) = .Item(2).innerText
        End With
        With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
            If .Length > 3 Then Cells(x + 1, 4) = .Item(3).innerText
        End With
        With post.getElementsByClassName("businessCapsule--tel")
            If .Length > 1 Then Cells(x + 1, 5) = .Item(1).innerText
        End With
    Next post
Next i
End Sub
 
Last edited by a moderator:
Hi !

As the code just does what is written : from second page to last one …

Codeline #12 : if you minus one for the page.length
so why do you start from 1 as the second page ?! :DD
 
Thanks Marc L, for your answer. It is always a pleasure to have you in the loop. Anyways, I did that cause the scraped links are exactly like below:

----2
----3
----4
----5
----6
----7
----8
----9
----10
----2
 
Last edited:
So with your object variable named page :

• to know the number of elements within this page collection
via it length property.

• As last element number within this collection is length - 1
so the first element can not be 1 aka 1 is the number for second element !

So what is the number within this collection for first element ?! :cool:
As you already know that like in codeline #11 …
 
Changed made in length. FYC, I'm gonna paste here the html element within which the page numbers are stored. Btw, find this "pageNum=" portion in the pasted elements then you will get to know why i used length property in my code differently.

Code:
<div class="row pagination">
<div class="col-sm-24">
&nbsp;<span class="pagination--page is-selected">1</span>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=2" data-tracking="DISPLAY:PAGINATION:NUMBER">2</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=3" data-tracking="DISPLAY:PAGINATION:NUMBER">3</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=4" data-tracking="DISPLAY:PAGINATION:NUMBER">4</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=5" data-tracking="DISPLAY:PAGINATION:NUMBER">5</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=6" data-tracking="DISPLAY:PAGINATION:NUMBER">6</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=7" data-tracking="DISPLAY:PAGINATION:NUMBER">7</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=8" data-tracking="DISPLAY:PAGINATION:NUMBER">8</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=9" data-tracking="DISPLAY:PAGINATION:NUMBER">9</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=10" data-tracking="DISPLAY:PAGINATION:NUMBER">10</a>
&nbsp;<a rel="nofollow" class="pagination--next" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=2" data-tracking="DISPLAY:PAGINATION:NEXT">Next</a>
</div>
</div>
 
Last edited:
Edit : there are not really page numbers but just text …

On my side starting loop from zero brings more data than from one !

Of course if loop ending is not changed as you did ‼ :eek:
'cause last page index is still page.length - 1 !
 
This attached file is for Sir Chihiro and Marc L to get me a workaround on how I could improve my code. If you run the macro feeding any url from yell.com to the inputbox, you will see that whether the url you are submitting has pagination or not, the parser is doing it's job perfectly. As soon as you open the code you will be frustrated a little bit to see what i did out there is quite disorganized. I hope you will show me any way how can I improve that. Thanks in advance.
 

Attachments

As code does the job, what's matter ?
A code you had created yourself is a code you can easily maintain !

Anyway, a demonstration revisiting your code :​
Code:
     Dim oDoc As New HTMLDocument, oReq As New XMLHTTP60, R&

Sub YellRequest(ByVal URL$)
         Dim oElt As HTMLHtmlElement, C&
         oReq.Open "GET", "https://www.yell.com" & URL, False
         oReq.send
         oDoc.body.innerHTML = oReq.responseText
    For Each oElt In oDoc.getElementsByClassName("js-LocalBusiness")
             R = R + 1
        With oElt.getElementsByClassName("row businessCapsule--title")(0).getElementsByTagName("A")
            If .Length Then Cells(R, 1).Value = .Item(0).innerText
        End With
        With oElt.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("SPAN")
            For C = 2 To Application.Min(.Length, 4)
                Cells(R, C).Value = .Item(C - 1).innerText
            Next
        End With
        With oElt.getElementsByClassName("businessCapsule--tel")
            If .Length > 1 Then Cells(R, 5).Value = .Item(1).innerText
        End With
    Next
End Sub

Sub Demo1()
         Dim oCol As New Collection, I&, V
         R = 1
         YellRequest "/ucs/UcsSearchAction.do?keywords=pizza&location=United+Kingdom&scrambleSeed=1426936001"
    With oDoc.getElementsByClassName("row pagination")(0).getElementsByTagName("A")
        For I = 1 To .Length - 1
             oCol.Add Replace(.Item(I).href, "about:", "")
        Next
    End With
    For Each V In oCol
         YellRequest V
    Next
         Set oCol = Nothing:  Set oDoc = Nothing:  Set oReq = Nothing
End Sub
Do you like it ? So thanks to click on bottom right Like !
 
Or like this :​
Code:
Sub Demo2()
     Dim hDoc As New HTMLDocument, I&
         R = 1
         YellRequest "/ucs/UcsSearchAction.do?keywords=pizza&location=United+Kingdom&scrambleSeed=1426936001"
         hDoc.body.innerHTML = oReq.responseText
    With hDoc.getElementsByClassName("row pagination")(0).getElementsByTagName("A")
        For I = 1 To .Length - 1
            YellRequest Replace(.Item(I).href, "about:", "")
        Next
    End With
     Set hDoc = Nothing:  Set oDoc = Nothing:  Set oReq = Nothing
End Sub
 
No way!!!! It works like magic. Thanksssssssssssssssss a ton. And again, it is not that easy for me to understand your code in the first place.
 
Every time I go through you code, some new stuffs come along, as in collection here. very precise and to the point. It handles paginated pages very smoothly. However, in case of single page data, it does it jobs but breaks that is why i had to use error handler. A url of such type is:
"/ucs/UcsSearchAction.do?keywords=uranium+mining&location=United+Kingdom&scrambleSeed=1211913649"
 
No need On Error with​
Code:
Sub Demo1()
              Dim oCol As New Collection, I&, V
        R = 1
        YellRequest "/ucs/UcsSearchAction.do?keywords=uranium+mining&location=United+Kingdom&scrambleSeed=1211913649"
    With oDoc.getElementsByClassName("row pagination")
        If .Length Then
            With .Item(0).getElementsByTagName("A")
                For I = 1 To .Length - 1
                  oCol.Add Replace(.Item(I).href, "about:", "")
                Next
            End With
        End If
    End With
    For Each V In oCol
        YellRequest V
    Next
        Set oCol = Nothing:  Set oDoc = Nothing:  Set oReq = Nothing
End Sub
or via​
Code:
Sub Demo2()
     Dim hDoc As New HTMLDocument, I&
         R = 1
         YellRequest "/ucs/UcsSearchAction.do?keywords=uranium+mining&location=United+Kingdom&scrambleSeed=1211913649"
         hDoc.body.innerHTML = oReq.responseText
    With hDoc.getElementsByClassName("row pagination")
        If .Length Then
            With .Item(0).getElementsByTagName("A")
                For I = 1 To .Length - 1
                    YellRequest Replace(.Item(I).href, "about:", "")
                Next
            End With
        End If
    End With
     Set hDoc = Nothing:  Set oDoc = Nothing:  Set oReq = Nothing
End Sub
for this case …
 
Voila! It solves every single thing I was worried about. You are one of the great teachers. Thanks Marc L for everything.
 
Back
Top