shahin
Active Member
I've created a script in vba to parse all the available quotes from "http://quotes.toscrape.com/". The site is created for scraping practices. There are few topics are there upon which several quotes can be found. The way I've written my below scraper comply with the following logic:
1. From the landing page, the scraper first parses the ten topic links available in the right sided bar.
2. Tracking each topic links the scraper goes to concerning pages and parses the quotes.
3. Some topic links have traversed next pages through pagination and my scraper can fetch them as well.
The total quotes are 86 in number including duplicates (few quotes at the same time fall under different topics because they are relevant). My scraper can parse all the 86 quotes flawlessly.
Few days back while trying to create a recursive crawler, I got an awesomely nice code from sir Chihiro to do the trick applying iterative approach. I've tried the same logic here to achieve the result.
I've used dictionary so that I can go with iterative approach. My question : Should I stick to the way I've done below or there is any different logic which is faster or more elegant? Any suggestion or guideline on this will be highly appreciated. I'm pasting a vba script and a python script. I tried to follow the logic of python script that is why it is somewhat related. Once again, both of the scripts run flawlessly.
This is what I've tried with vba:
Here goes the python script:
1. From the landing page, the scraper first parses the ten topic links available in the right sided bar.
2. Tracking each topic links the scraper goes to concerning pages and parses the quotes.
3. Some topic links have traversed next pages through pagination and my scraper can fetch them as well.
The total quotes are 86 in number including duplicates (few quotes at the same time fall under different topics because they are relevant). My scraper can parse all the 86 quotes flawlessly.
Few days back while trying to create a recursive crawler, I got an awesomely nice code from sir Chihiro to do the trick applying iterative approach. I've tried the same logic here to achieve the result.
I've used dictionary so that I can go with iterative approach. My question : Should I stick to the way I've done below or there is any different logic which is faster or more elegant? Any suggestion or guideline on this will be highly appreciated. I'm pasting a vba script and a python script. I tried to follow the logic of python script that is why it is somewhat related. Once again, both of the scripts run flawlessly.
This is what I've tried with vba:
Code:
Sub Parse_Quotes()
Const URL As String = "http://quotes.toscrape.com/"
Dim html As New HTMLDocument, posts As HTMLDivElement
Dim post As HTMLDivElement, elem As HTMLHtmlElement
Dim ldic As Object, key As Variant
Set ldic = CreateObject("Scripting.Dictionary")
With CreateObject("MSXML2.serverXMLHTTP")
.Open "GET", URL, False
.send
html.body.innerHTML = .responseText
End With
For Each posts In html.getElementsByClassName("tag-item")
With posts.getElementsByClassName("tag")
If .Length Then ldic(URL & Split(.Item(0).getAttribute("href"), ":/")(1)) = 1
End With
Next posts
For Each key In ldic.keys
While key <> ""
With CreateObject("MSXML2.serverXMLHTTP")
.Open "GET", key, False
.send
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByClassName("quote")
With post.getElementsByClassName("text")
If .Length Then r = r + 1: Cells(r, 1) = .Item(0).innerText
End With
Next post
key = ""
If Not html.getElementsByClassName("next")(0) Is Nothing Then
For Each elem In html.getElementsByClassName("next")(0).getElementsByTagName("a")
key = URL & Split(elem.getAttribute("href"), ":/")(1)
Next elem
End If
Wend
Next key
End Sub
Here goes the python script:
Code:
import requests ; from lxml import html
core_link = "http://quotes.toscrape.com/"
def quotes_scraper(base_link):
response = requests.get(base_link)
tree = html.fromstring(response.text)
for titles in tree.cssselect("span.tag-item a.tag"):
processing_docs(core_link + titles.attrib['href'])
def processing_docs(base_link):
response = requests.get(base_link).text
root = html.fromstring(response)
for soups in root.cssselect("div.quote"):
quote = soups.cssselect("span.text")[0].text
author = soups.cssselect("small.author")[0].text
print(quote, author)
next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
if next_page:
page_link = core_link + next_page
processing_docs(page_link)
quotes_scraper(core_link)