• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Multipage web-scraping without even knowing the last page number

shahin

Active Member
I was very happy running my code for a site to crawl the titles of different tutorials spreading across several pages. I tried to write some code not depending on the last page number the url has but on the status code until it shows http.status<>200. The code I'm pasting below is working impeccably in this case. However, Trouble comes up when I try to use another url to see whether it breaks automatically but found that the code did fetch all the results but did not break. What is the workaround in this case? Here is the working one?

Code:
Sub wise_owl()

Const mlink = "http://www.wiseowl.co.uk/videos/default"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object

y = 1
Do While True

    With http
        .Open "GET", mlink & "-" & y & ".htm", False
        .send
        If .Status <> 200 Then
            MsgBox "It's done"
            Exit Sub
        End If
        html.body.innerHTML = .responseText
    End With

    For Each post In html.getElementsByClassName("woVideoListDefaultSeriesTitle")
    With post.getElementsByTagName("a")
        x = x + 1
        If .Length Then Cells(x, 1) = .item(0).innerText
    End With
    Next post
    y = y + 1
Loop
End Sub
 
Issue always comes from a bad logic !

And you have different ways :
- calculate last page number from first page, is is so difficult ? :rolleyes:
- If exists a Next button ?
- If exists at least an item in the page ?
- Compare page number read to the one of its URL …
 
@Marc L,
That is why I consider you the legend when it comes to play with something different beyond my knowledge. You dug out several logic whereas I was able to think nothing. Btw, I think I should go with this "If exists a Next button?" but how can I? Please excuse my ignorance. Thanks for your comment.
 
As usual, inpect the item within your webbrowser to see its id, its classe, …

Within your code, set an object variable to this item and if the variable
Is Nothing, so the item does not exist in the page …
Or directly check webpage code via InStr statement …
 
Hi Marc L, if this is what you suggested, it works. However, it fails to scrape the last page because there is not any next button. A disabled button is there which is under class="disabled" but I can't place this in my scraper so that it can do the last page as well.

Code:
Sub yp()
Const mlink = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA&page="
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object

y = 10
Do While True
    With http
        .Open "GET", mlink & y, False
        .send
        html.body.innerHTML = .responseText
    End With
    If InStr(http.responseText, "next ajax-page") Then
        For Each post In html.getElementsByClassName("n")
            With post.getElementsByTagName("span")
                x = x + 1
                If .Length Then Cells(x, 1) = .item(0).innerText
            End With
        Next post
    Else: Exit Do
    End If
    y = y + 1
Loop
MsgBox "It's done"
End Sub
 
Just a bad logic, observe last page with your webbrowser !

Read first data on current page then
if not exists Next button the loop must ends …

Do
…​
Loop While InStr()
 
@Marc L,
Perhaps I could not make you understand what i meant. The last code I pasted above works and is able to break the loop but It breaks without scraping the last page because there is not any next button in the last page. This is it.
 
This is the solution I was after. Whatever the last page number is just put any number smaller than that in the url, it will work flawlessly:
Code:
Sub yp()
Const mlink = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA&page="
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim posts As Object, post As Object

y = 8
Do
    With http
        .Open "GET", mlink & y, False
        .send
        html.body.innerHTML = .responseText
    End With
  
        Set posts = html.getElementsByClassName("n")
        On Error GoTo Endofpage
        Debug.Print Len(posts) 'to force Error 91
  
        For Each post In posts
            With post.getElementsByTagName("span")
                x = x + 1
                If .Length Then Cells(x, 1) = .item(0).innerText
            End With
        Next post
        y = y + 1
Endofpage:
Loop Until Err.Number = 91
MsgBox "It's over"
End Sub
 
@Marc L,
Perhaps I could not make you understand what i meant. The last code I pasted above works and is able to break the loop but It breaks without scraping the last page because there is not any next button in the last page. This is it.
Misobservation of the webpage, always a bad logic
as Next button is independant from data !
So first read data and after check the button.

And your code does not work on my side with several computers.
Try from first page …
 
I tried with two other sites [both of them from it's first page] and found no issues at all. This is the other one:
Code:
Sub yify()
Const mlink = "https://www.yify-torrent.org/genres/western/p-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, posts As Object

y = 1
Do
    With http
        .Open "GET", mlink & y & "/", False
        .send
        html.body.innerHTML = .responseText
    End With

    Set posts = html.getElementsByClassName("mv")
    On Error GoTo Endofpage
    Debug.Print Len(posts) 'to force Error 91

    For Each post In posts
        With post.getElementsByTagName("div")
            x = x + 1
            If .Length Then Cells(x, 1) = .item(0).innerText
        End With
    Next post
    y = y + 1
Endofpage:
Loop Until Err.Number = 91
MsgBox "It's over"
End Sub
 
Last edited:
Just a bad logic, observe last page with your webbrowser !

Read first data on current page then
if not exists Next button the loop must ends …

Do
…​
Loop While InStr()

So no need to trap any error :​
Code:
Sub Demo_yify()
    Const URL = "https://www.yify-torrent.org/genres/western/p-"
    Dim oReq As New XMLHTTP60, oDoc As New HTMLDocument, oElt As Object, P%, R&
    ActiveSheet.UsedRange.Clear
Do
        P = P + 1
        oReq.Open "GET", URL & P & "/", False
        oReq.send
        oDoc.body.innerHTML = oReq.responseText
    For Each oElt In oDoc.getElementsByClassName("mv")
        With oElt.getElementsByTagName("div")
            If .Length Then R = R + 1: Cells(R, 1).Value = .Item(0).innerText Else Exit Do
        End With
    Next
Loop While InStr(oReq.responseText, "/"">Next</a>")
    Set oReq = Nothing:  Set oDoc = Nothing
    MsgBox ActiveSheet.UsedRange.Rows.Count & " items downloaded …", vbInformation, "  Demo yify"
End Sub
 
Thanks Marc L for your answer. Again you put an end to my doubt of choosing the ideal way. This is just awesome. Believe me, when you suggested to use something like you've shown above, I tried but I got stuck because of my inability to fix the "responsetext part" in the right way. Thanks again.
 
Back
Top