• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Web-scraping using IE

shahin

Active Member
I have written some code using IE to parse some email addresses from google maps (seriously complicated one). There is no way to get a pass if hardcoded delay is applied to serve the purpose. I'm really enthralled to see how it works. Even if for python + selenium it is quite difficult to go with but in case of IE, I found it very helpful because the "Explicit Wait" defined within IE is just incredible. I thought to demonstrate this piece of code to let the fellow programmer know IE is no less than any scripting language when It comes to deal with web-scraping.

Just run the code and see the magic (try to overlook the way I named the variables):

Code:
Sub Fetch_Google_Data()
    URL$ = "https://www.google.com/maps/d/u/0/embed?mid=1PCGcGIVnNMtDQtEcWDqBx-G-iT8&ll=32.444813638299706%2C-87.71010306562499&z=8"
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim posts As Object, post As Object, elem As Object
    Dim topics As Object, topic As Object

    With IE
        .Visible = True
        .navigate URL
        While .Busy = True Or .ReadyState < 4: DoEvents: Wend
        Set HTML = .Document
    End With
   
    Do: Set topics = HTML.getElementsByClassName("i4ewOd-pzNkMb-ornU0b-b0t70b-Bz112c")(0): DoEvents: Loop While topics Is Nothing
    topics.Click
    Do: Set topic = HTML.querySelectorAll("div[role='checkbox']")(1): DoEvents: Loop While topic Is Nothing
    topic.Click
   
    With HTML.querySelectorAll(".HzV7m-pbTTYe-JNdkSc .suEOdc")
        For i = 1 To .Length - 1
            .Item(i).Click
            If Not HTML.querySelector("a[href^='mailto:']") Is Nothing Then
                r = r + 1: Cells(r, 1) = HTML.querySelector("a[href^='mailto:']").innerText
            End If
            Do: Set posts = HTML.querySelector(".HzV7m-tJHJj-LgbsSe-Bz112c.qqvbed-a4fUwd-LgbsSe-Bz112c"): DoEvents: Loop While posts Is Nothing
            posts.Click
            Do: Set elem = HTML.querySelector(".qqvbed-p83tee"): DoEvents: Loop While elem Is Nothing
        Next i
    End With
End Sub

Reference to add to the library:

1. Microsoft Internet Controls
2. Microsoft HTML Object Library
 
This is very effective method as long as element you are trying to access is always present in a web page that's being scraped.

Do be careful though. Especially when you are scraping by address/company name etc.

This type of coding can cause infinite loop, if script gives it different element tag and/or skips element altogether when value is absent (or if page is dynamically generated via jquery, with best estimate search result).

Edit: It may be beneficial to add counter check as well. If loop count exceeds certain amount, consider it lost cause and move on. This will prevent it from being stuck in infinite loop.
 
@ Sir chihiro, I was just trying to figure out why most people tend to ignore VBA when it comes to automate information from the website when It has tremendous capabilities. This is the only reason I always try to compare python and vba to see if there is any lacking in it (in the field of web-scraping). However, except for multithreading and asynchronous approach (I'm not fully sure, though) I didn't find any bigger shortcoming to automate stuffs using vba.
 
It's purpose is different and it will largely depend on your need, environment and preference.

1. VBA
  • If you are scraping data for use in Excel, it's the obvious choice (i.e. runs on your local machine).
  • Works well with PC, not so much for MAC. As far as I know, no Linux.
  • Requires bit more coding than python when dealing with JSON format
  • Easier to access and set up environment when compared to python
  • Requires you to have Office/Excel installation
2. Python
  • Obvious choice if you want scraping done on schedule and data pushed to database and/or web site (i.e. housed in server side).
  • JSON can be directly put into dictionary object and used.
  • Much easier to write cross platform code (and works in MAC, PC, Linux and many other OS without issue).
  • All open source and free
  • As you mentioned, multi-threading and asynchronous is very powerful tool to have. Especially when scraping large amount of info.
 
Hi folks ! :cool:

For JSon, there are almost easy ways for VBA (some shown here) …

There are ways too (one yet shown here on Chandoo) to launch
VBA "multi-threading requests" through Windows.
For example 60 VBA sequentials requests take an average of 53 seconds
and through Windows only an average of 7 seconds …

Of course VBA is less efficient but with some "external objects" faculties.
 
For multi-threading yes it can be done. Then again it's dependent on it being Windows environment. Async I don't believe can be done via VBA.

Really it's mostly what's your environment and what you are comfortable with.

With enough persistence and ingenuity, almost anything is possible via either approach. But it boils down to what's best suited to "YOUR" need and skill based on investment needed (time, effort, $ cost etc).
 
I'm damn sure, all of you are already update with the news that using power query it's the simplest thing to parse any json data from a webpage.
 
Yes. I use API through PowerQuery to get JSON data and report on it.

Though it has bit of quirk (search for Formula.Firewall PowerQuery) and you have to be careful when you build multi-stage query (i.e. Using 1 API to get parameters, then using parameters obtained in 1st API in 2nd to query data).

I'm still experimenting with that portion.
 
Follow up on above. What I discovered is that unless there is connector. I can't use dynamic query on API that uses bearer token on PowerBI Cloud service (you can do it on desktop or using PowerQuery in Excel).

Workaround -
On local server have process to query API and store data in a db table. Set up On-Premises Gateway Server, which only transfers data queried in secure manner using Azure Service Bus.
https://powerbi.microsoft.com/en-us/gateway/

This has added benefit in that I can store historical data beyond what's made available through API. Though not ideal for realtime data. For that, data feed/streaming is better route.
 
Back
Top