• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

How to modify my below code to adjust everything within a single subroutine?

shahin

Active Member
Few days back when I insisted a lot to Marc L on creating me a recursive crawler in vba, he provided me with an effective one. I modified that piece of code to apply on another website to scrape all the movie names traversing multiple pages (3 pages long here). It is just working great. However, this time I would like to do the same without creating two subroutines. As you can see (if you take a look at the below scraper), I've used two subroutines to accomplish the task.

This is the scraper which parses the site recursively:

Code:
Sub Get_Info(link As String)
    Dim HTTP As New XMLHTTP60, HTML As New HTMLDocument
    Dim posts As Object, post As Object, elem As Object
    Dim URL As String, base As String, preff As String
  
    base = "https://yts.am/browse-movies/0/all/biography/8/rating?page="
  
    With HTTP
        .Open "GET", link, False
        .send
        HTML.body.innerHTML = .responseText
    End With
  
    For Each post In HTML.getElementsByClassName("browse-movie-title")
        Range("A" & Rows.Count).End(xlUp).Offset(1, 0).Value = post.innerText
    Next post
  
    For Each elem In HTML.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(elem.innerText, "Next") > 0 Then preff = Split(elem.href, "page=")(1): Exit For
    Next elem
  
    If preff <> "" Then
        URL = base & preff
        Get_Info (URL)
    End If
End Sub

Sub RCrawler()
    Get_Info "https://yts.am/browse-movies/0/all/biography/8/rating"
End Sub


Once again, my goal is to create one subroutine and still it will parse the data recursively.
 
Hi Shahin ,

I think you have not understood how recursion works. Let me try and explain.

Suppose we have a recursive procedure to calculate the factorial of a number.

The procedure will use the pseudo code :

Factorial(anynumber) = anynumber * factorial(anynumber - 1)

Of course , the code will contain a check to see when we have reached factorial(0) which will return 1 , and the execution will thereafter unwind by going back up.

So , if we have a recursive procedure called factorial(anynumber) , we will need another procedure to call this recursive procedure with the initial number whose factorial we wish to calculate.

Thus , if I want to calculate the factorial of 15 , I will have to have a line of code which says :

factorial(15)

Clearly this has to be in a separate procedure.

In your case , the procedure RCrawler is the initial calling procedure which calls the recursive procedure Get_Info with the starting URL ; thereafter Get_Info is recursively called with progressively deeper URLs.

All recursive routines will have a second routine to call them with the starting parameter , and this parameter is then reduced and passed to the recursive procedure on successive calls.

Narayan
 
Sorry Narayan, it was a dumb thinking of mine. I thought it might be possible like I see in other languages.
 
Last edited:
I short what Narayan is explaining that... you can't do it.
Since you have to initiate recursive process with starting value. Which cannot be supplied within recursive procedure itself.

It has to be initiated by another sub.
 
Thanks sir, for your comment. It seems I was expecting much from vba. However, is it possible to use different IDE to write codes other than on the built-in vbe?
 
... Recursive procedure, be it VBA, Python or any other language, follows same principle (and construct). You call the function/sub via some other process/function to initiate.
 
Unless there's very compelling reason to do so, such as branching calculation where identical operation/calculation occur in multiple times in a process... there really is no reason to use iterative process.

In such case (most famous being Fibonacci numbers), iterative calculation/process is much faster as recursion will not remember previously calculated values.

However, by using dictionary object as memory to store previously calculated value in recursion, you can actually get faster result than iterative version.

Especially in your sample, there isn't identical process repeated multiple times to warrant writing more complex iterative procedure (over recursion).

Edit: Complex as in harder to read and much longer ;)
 
To sir Chihiro: there is something I've created recently using python. I did it in a slightly different way. However, I tried to do the same thing in vba as well but ended up what I did above. That is the reason for this post actually to find out similar technique. Here is the one I just talked about.

The first "URL" has been used once in the request. The request can't take the first "URL" anymore because of the indentation. However, the one with the same name located below takes that place instead and goes on until the condition forces the loop to break.
Code:
import requests
from bs4 import BeautifulSoup

URL = "https://yts.am/browse-movies/0/1080p/comedy/7/seeds"
while True:
    response = requests.get(URL)  
    soup = BeautifulSoup(response.text, "html5lib")
  
    for result in soup.select(".browse-movie-title"):
        print(result.text)

    next_page = [item['href'] for item in soup.select(".tsc_pagination li a") if "Next" in item.text]
    if not next_page:break
    
    next_page = next_page[0].split("page=")[1]
    URL = "https://yts.am/browse-movies/0/1080p/comedy/7/seeds?page={}".format(next_page)
Apology to others who come across this post because this is not the right place to demonstrate this code.
 
So in python you are creating list object "next_page" to hold links to iterate over.

... just build array, collection, concatenated string or any thing that you can populate in vba during following loop and then iterate over in separate loop.
Code:
    For Each elem In HTML.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(elem.innerText, "Next") > 0 Then preff = Split(elem.href, "page=")(1): Exit For
    Next elem
 
@sir Chihiro, you made my head spin using those terms "array", "collection object". I always see how others write code and try to follow along. I highly doubt that I'll be able to create one following your guideline. Thanks.
 
Actually you don't even need that. You just need to update link variable and use GoTo statement.

Code:
Sub Get_Info()
    Dim HTTP As New XMLHTTP60, HTML As New HTMLDocument
    Dim posts As Object, post As Object, elem As Object
    Dim URL As String, base As String, preff As String
    link = "https://yts.am/browse-movies/0/all/biography/8/rating"
    base = "https://yts.am/browse-movies/0/all/biography/8/rating?page="

StartProcess:
    With HTTP
        .Open "GET", link, False
        .send
        HTML.body.innerHTML = .responseText
    End With
 
    For Each post In HTML.getElementsByClassName("browse-movie-title")
        Range("A" & Rows.Count).End(xlUp).Offset(1, 0).Value = post.innerText
    Next post
    link = ""
    For Each elem In HTML.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(elem.innerText, "Next") > 0 Then
            link = base & Split(elem.href, "page=")(1)
            Exit For
        End If
    Next elem

    If link <> "" Then
        GoTo StartProcess:
    Else
        Exit Sub
    End If
   
End Sub
 
This is the code I've been trying to create for years. An epic code sir. Tremendous one. I can't find the right word to praise you. Would like to know something to get the clarity: is this portion link="" you have used in the above code to make sure that every time when the command crosses that line "link =", the variable becomes empty to get refilled with a new one used right after instr().
 
Another variant using While loop.
Code:
Sub Get_Info()
    Dim HTTP As New XMLHTTP60, HTML As New HTMLDocument
    Dim posts As Object, post As Object, elem As Object
    Dim URL As String, base As String, preff As String
    link = "https://yts.am/browse-movies/0/all/biography/8/rating"
    base = "https://yts.am/browse-movies/0/all/biography/8/rating?page="

While link <> ""
    With HTTP
        .Open "GET", link, False
        .send
        HTML.body.innerHTML = .responseText
    End With
    For Each post In HTML.getElementsByClassName("browse-movie-title")
        Range("A" & Rows.Count).End(xlUp).Offset(1, 0).Value = post.innerText
    Next post
    link = ""
    For Each elem In HTML.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(elem.innerText, "Next") > 0 Then
            link = base & Split(elem.href, "page=")(1)
            Exit For
        End If
    Next elem
Wend
 
End Sub
 
you have used in the above code to make sure that every time when the command crosses that line "link =", the variable becomes empty to get refilled with a new one used right after instr().

The purpose to set it to blank is to ensure that if Instr condition isn't met, loop will terminate based on link being blank. Otherwise, you will have infinite loop.
 
The last one is identical to the one I've created in python. You are just impossible. I never thought it would be possible in vba.
 
It's just following logic. Use break points, debug.print, locals window, watch window etc to understand how variable is updated and how it should be handled.
 
That means, link="" is directly connected to else block, right? I will store this two codes in my desktop so that every time I switch my PC on, I can see it. Thanks a lot sir.
 
@ sir Chihiro, I still can't believe that It is possible what you have shown above. What a plain and simple logic and still awesomely elegant. Thankssssssssss a zillion.
 
@ sir Chihiro, do you know any link where I can learn how to work with "locals window" and "watch window"? I only know a little about immediate window. Thanks in advance sir.
 
Back
Top