Should I use dictionary in my script to experience better performance?

shahin · Feb 27, 2018

I've created a script in vba to parse all the available quotes from "http://quotes.toscrape.com/". The site is created for scraping practices. There are few topics are there upon which several quotes can be found. The way I've written my below scraper comply with the following logic:

1. From the landing page, the scraper first parses the ten topic links available in the right sided bar.
2. Tracking each topic links the scraper goes to concerning pages and parses the quotes.
3. Some topic links have traversed next pages through pagination and my scraper can fetch them as well.

The total quotes are 86 in number including duplicates (few quotes at the same time fall under different topics because they are relevant). My scraper can parse all the 86 quotes flawlessly.

Few days back while trying to create a recursive crawler, I got an awesomely nice code from sir Chihiro to do the trick applying iterative approach. I've tried the same logic here to achieve the result.

I've used dictionary so that I can go with iterative approach. My question : Should I stick to the way I've done below or there is any different logic which is faster or more elegant? Any suggestion or guideline on this will be highly appreciated. I'm pasting a vba script and a python script. I tried to follow the logic of python script that is why it is somewhat related. Once again, both of the scripts run flawlessly.

This is what I've tried with vba:

Code:

Sub Parse_Quotes()
    Const URL As String = "http://quotes.toscrape.com/"
    Dim html As New HTMLDocument, posts As HTMLDivElement
    Dim post As HTMLDivElement, elem As HTMLHtmlElement
    Dim ldic As Object, key As Variant
   
    Set ldic = CreateObject("Scripting.Dictionary")
   
    With CreateObject("MSXML2.serverXMLHTTP")
        .Open "GET", URL, False
        .send
        html.body.innerHTML = .responseText
    End With
   
    For Each posts In html.getElementsByClassName("tag-item")
        With posts.getElementsByClassName("tag")
            If .Length Then ldic(URL & Split(.Item(0).getAttribute("href"), ":/")(1)) = 1
        End With
    Next posts
   
    For Each key In ldic.keys
        While key <> ""
            With CreateObject("MSXML2.serverXMLHTTP")
                .Open "GET", key, False
                .send
                html.body.innerHTML = .responseText
            End With
           
            For Each post In html.getElementsByClassName("quote")
                With post.getElementsByClassName("text")
                    If .Length Then r = r + 1: Cells(r, 1) = .Item(0).innerText
                End With
            Next post
           
            key = ""
           
            If Not html.getElementsByClassName("next")(0) Is Nothing Then
                For Each elem In html.getElementsByClassName("next")(0).getElementsByTagName("a")
                    key = URL & Split(elem.getAttribute("href"), ":/")(1)
                Next elem
            End If
        Wend
    Next key
End Sub

Here goes the python script:

Code:

import requests ; from lxml import html

core_link = "http://quotes.toscrape.com/"

def quotes_scraper(base_link):
    response = requests.get(base_link)
    tree = html.fromstring(response.text)
    for titles in tree.cssselect("span.tag-item a.tag"):
        processing_docs(core_link + titles.attrib['href'])

def processing_docs(base_link):
    response = requests.get(base_link).text
    root = html.fromstring(response)
    for soups in root.cssselect("div.quote"):
        quote = soups.cssselect("span.text")[0].text
        author = soups.cssselect("small.author")[0].text
        print(quote, author)

    next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
    if next_page:
        page_link = core_link + next_page
        processing_docs(page_link)

quotes_scraper(core_link)

Marc L · Feb 28, 2018

Hi !

You use a dictionary for the wrong tag part as you get duplicates quotes !
As always you forgot to free object variables before the process end
and R variable is not declared …

To avoid duplicates, instead of a dictionary you can use a collection
or just an easy Excel worksheet function : MATCH …
Or using RemoveDuplicates.

Marc L · Feb 28, 2018

shahin said:
The total quotes are 86 in number including duplicates

No, in fact there are 100 quotes without any duplicate !
So no need a dictionary and it was a bad idea to scan by tags …

Code:

Sub Demo1()
    Dim oReq As New WinHttpRequest, oDoc As New HTMLDocument, oDiv As HTMLDivElement, R&, S$
    Me.UsedRange.Clear
Do
        oReq.Open "GET", "http://quotes.toscrape.com/" & S, False
        oReq.Send
        oDoc.body.innerHTML = oReq.responseText
    For Each oDiv In oDoc.getElementsByClassName("quote")
        R = R + 1
        Cells(R, 1).Value = Replace(Replace(oDiv.children(1).innerText, "by ", ""), " (about) ", "")
        Cells(R, 2).Value = Replace(Replace(oDiv.children(0).innerText, "“", ""), "”", "")
    Next
    With oDoc.getElementsByClassName("next")
        If .length Then S = .item(0).all(0).pathname Else Exit Do
    End With
Loop
    Set oReq = Nothing:  Set oDoc = Nothing
    Me.UsedRange.Columns(1).AutoFit
End Sub

Do you like it ? So thanks to click on bottom right Like !

shahin · Feb 28, 2018

You are always numerous steps ahead no matter what idea I come up with. The idea you applied here "GET", URL & S, False" is very much appreciable. I've never seen such usage before. A little clarification on this ".all(0).pathname" would be very great. Thanks a trillion, Marc L.

Marc L · Feb 28, 2018

.all(0).pathname is the result of my object observation within VBE
Locals window as everybody can just setting an object variable on
oDoc.getElementsByClassName("next").
It's like href content but without "about:/" as
for the first Next button it's pathname content is "page/2/" …

shahin said:
The idea you applied here "GET", URL & S, False" is very much appreciable. I've never seen such usage before.

You've already seen this within this forum and even in your own thread !

Instead of the path name you can just use a page number variable :

Code:

Sub Demo2()
    Dim oReq As New WinHttpRequest, oDoc As New HTMLDocument, oDiv As HTMLDivElement, P%, R&
    Me.UsedRange.Clear
Do
        P = P + 1
        oReq.Open "GET", "http://quotes.toscrape.com/page/" & P, False
        oReq.Send
        oDoc.body.innerHTML = oReq.responseText
    For Each oDiv In oDoc.getElementsByClassName("quote")
        R = R + 1
        Cells(R, 1).Value = Replace(Replace(oDiv.children(1).innerText, "by ", ""), " (about) ", "")
        Cells(R, 2).Value = Replace(Replace(oDiv.children(0).innerText, "“", ""), "”", "")
    Next
Loop While oDoc.getElementsByClassName("next").length
    Set oReq = Nothing:  Set oDoc = Nothing
    Me.UsedRange.Columns(1).AutoFit
End Sub

You may Like it !

shahin · Feb 28, 2018

Yeah, you are very right that this approaches have been used before in this forum. However, the thing is I could not understand those approaches very neatly in those times because of my intelligibility. So I followed along to serve the purposes. Now, I'm trying to understand each portion of it that is why few things which have been used before seem to be very new to me when I come them across.

Marc L · Feb 28, 2018

As again it's just about O B S E R V A T I O N !

When you click on Next button, you must see the new url
ending by "page/2/" and when clicking again on same button
as the new url ends by "page/3/" is it so difficult to combine
logic and imagination in order to just dare to try "page/1/" ?‼
(At child level …)

When it works for the first page like here, with this only 10 seconds
browser navigation test, you get your main & easy code lead !

As the great Albert wrote :
« Logic will get you from A to Z; imagination will get you everywhere. »

shahin · Feb 28, 2018

@Marc L, don't scold me as I keep posting similar problems to get a solution for. I like to follow the approach you have demonstrated in post # 3. I tried like below but the loop seems to be never ending. How can I fix it?

Code:

Sub Get_Info()
    Dim HTTP As New XMLHTTP60, HTML As New HTMLDocument
    Dim post As HTMLDivElement, elem As Object, link As Variant
    base = "https://yts.am/browse-movies/0/all/biography/8/rating?page="

    Do
        With HTTP
            .Open "GET", base & link, False
            .send
            HTML.body.innerHTML = .responseText
        End With
       
        For Each post In HTML.getElementsByClassName("browse-movie-bottom")
            With post.getElementsByClassName("browse-movie-title")
                If .Length Then Range("A" & Rows.Count).End(xlUp).Offset(1, 0).Value = .Item(0).innerText
            End With
        Next post

        For Each elem In HTML.getElementsByClassName("tsc_pagination tsc_paginationA tsc_paginationA06")
            With elem.getElementsByTagName("a")
                If InStr(.Item(0).innerText, "Next") > 0 Then
                    If .Length Then link = Split(.Item(0).href, "page=")(1)
                    Exit For
                End If
            End With
        Next elem
    Loop Until link <= 1
End Sub

Marc L · Feb 28, 2018

Not a scolding but an essential reminding,
what you must do before starting this kind of code !
So when you ask again & again same type of question
don't be surprised if some helpers are less present …

Your loop condition is Until link <= 1
but you forgot to set this variable value to zero;
just follow your code in step by step mode and you will see !

Or you can just apply Demo1 Exit Do under condition …
Or you can just apply Demo2 page number …
Or you can compare link with previous one …

Chihiro · Feb 28, 2018

Dictionary, is only needed, if you need to store unique key-item pair that meet condition and perform further manipulation down the line.

If you don't need to retrieve the data and process it isn't required. Though I sometimes use it for convenience factor.

If you didn't need to use dictionary in Python, then you wouldn't need it in VBA either.

Marc L · Mar 1, 2018

Start the chrono …
• Browsing website to observe url & buttons change between pages,
chrono shows 10 seconds …
• Inspecting elements needed to code, chrono shows 30 seconds …
• Creating Demo3 from Demo2, mod it according to elements needed,
using a variant variable to observe elements during the execution
within VBE Locals window, mod code accordingly,
chrono shows less than 2 minutes …
• Testing procedure, checking result, code final cleaning,
chrono end time : ~ 3 minutes.

Code:

Sub Demo3()
  Const URL = "https://yts.am/browse-movies/0/all/biography/8/rating", _
        ULI = "tsc_pagination tsc_paginationA tsc_paginationA06"
    Dim oReq As New WinHttpRequest, oDoc As New HTMLDocument, oAnc As HTMLAnchorElement, P%, R&
        Cells(1).CurrentRegion.Clear
Do
        P = P + 1
        oReq.Open "GET", IIf(P > 1, URL & "?page=" & P, URL)
        oReq.Send
        oDoc.body.innerHTML = oReq.ResponseText
    For Each oAnc In oDoc.getElementsByClassName("browse-movie-title")
        R = R + 1
        Cells(R, 1).Value = oAnc.innerText
    Next
Loop Until IsNumeric(oDoc.getElementsByClassName(ULI)(0).lastChild.innerText)
    Set oReq = Nothing:  Set oDoc = Nothing
End Sub

You should Like it !

Should I use dictionary in my script to experience better performance?

shahin

Active Member

Marc L

Excel Ninja

Marc L

Excel Ninja

shahin

Active Member

Marc L

Excel Ninja

shahin

Active Member

Marc L

Excel Ninja

shahin

Active Member

Marc L

Excel Ninja

Chihiro

Excel Ninja

Marc L

Excel Ninja