1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Web-scraping using IE

Discussion in 'The Vault' started by shahin, Mar 8, 2018.

  1. shahin

    shahin Active Member

    Messages:
    899
    I have written some code using IE to parse some email addresses from google maps (seriously complicated one). There is no way to get a pass if hardcoded delay is applied to serve the purpose. I'm really enthralled to see how it works. Even if for python + selenium it is quite difficult to go with but in case of IE, I found it very helpful because the "Explicit Wait" defined within IE is just incredible. I thought to demonstrate this piece of code to let the fellow programmer know IE is no less than any scripting language when It comes to deal with web-scraping.

    Just run the code and see the magic (try to overlook the way I named the variables):

    Code (vb):

    Sub Fetch_Google_Data()
        URL$ = "https://www.google.com/maps/d/u/0/embed?mid=1PCGcGIVnNMtDQtEcWDqBx-G-iT8&ll=32.444813638299706%2C-87.71010306562499&z=8"
        Dim IE As New InternetExplorer, HTML As HTMLDocument
        Dim posts As Object, post As Object, elem As Object
        Dim topics As Object, topic As Object

        With IE
            .Visible = True
            .navigate URL
            While .Busy = True Or .ReadyState < 4: DoEvents: Wend
            Set HTML = .Document
        End With
       
        Do: Set topics = HTML.getElementsByClassName("i4ewOd-pzNkMb-ornU0b-b0t70b-Bz112c")(0): DoEvents: Loop While topics Is Nothing
        topics.Click
        Do: Set topic = HTML.querySelectorAll("div[role='checkbox']")(1): DoEvents: Loop While topic Is Nothing
        topic.Click
       
        With HTML.querySelectorAll(".HzV7m-pbTTYe-JNdkSc .suEOdc")
            For i = 1 To .Length - 1
                .Item(i).Click
                If Not HTML.querySelector("a[href^='mailto:']") Is Nothing Then
                    r = r + 1: Cells(r, 1) = HTML.querySelector("a[href^='mailto:']").innerText
                End If
                Do: Set posts = HTML.querySelector(".HzV7m-tJHJj-LgbsSe-Bz112c.qqvbed-a4fUwd-LgbsSe-Bz112c"): DoEvents: Loop While posts Is Nothing
                posts.Click
                Do: Set elem = HTML.querySelector(".qqvbed-p83tee"): DoEvents: Loop While elem Is Nothing
            Next i
        End With
    End Sub
     
    Reference to add to the library:

    1. Microsoft Internet Controls
    2. Microsoft HTML Object Library
    Chihiro and NARAYANK991 like this.
  2. Chihiro

    Chihiro Excel Ninja

    Messages:
    5,004
    This is very effective method as long as element you are trying to access is always present in a web page that's being scraped.

    Do be careful though. Especially when you are scraping by address/company name etc.

    This type of coding can cause infinite loop, if script gives it different element tag and/or skips element altogether when value is absent (or if page is dynamically generated via jquery, with best estimate search result).

    Edit: It may be beneficial to add counter check as well. If loop count exceeds certain amount, consider it lost cause and move on. This will prevent it from being stuck in infinite loop.
    shahin likes this.
  3. shahin

    shahin Active Member

    Messages:
    899
    @ Sir chihiro, I was just trying to figure out why most people tend to ignore VBA when it comes to automate information from the website when It has tremendous capabilities. This is the only reason I always try to compare python and vba to see if there is any lacking in it (in the field of web-scraping). However, except for multithreading and asynchronous approach (I'm not fully sure, though) I didn't find any bigger shortcoming to automate stuffs using vba.
  4. Chihiro

    Chihiro Excel Ninja

    Messages:
    5,004
    It's purpose is different and it will largely depend on your need, environment and preference.

    1. VBA
    • If you are scraping data for use in Excel, it's the obvious choice (i.e. runs on your local machine).
    • Works well with PC, not so much for MAC. As far as I know, no Linux.
    • Requires bit more coding than python when dealing with JSON format
    • Easier to access and set up environment when compared to python
    • Requires you to have Office/Excel installation
    2. Python
    • Obvious choice if you want scraping done on schedule and data pushed to database and/or web site (i.e. housed in server side).
    • JSON can be directly put into dictionary object and used.
    • Much easier to write cross platform code (and works in MAC, PC, Linux and many other OS without issue).
    • All open source and free
    • As you mentioned, multi-threading and asynchronous is very powerful tool to have. Especially when scraping large amount of info.
  5. Marc L

    Marc L Excel Ninja

    Messages:
    4,253
    Hi folks ! :cool:

    For JSon, there are almost easy ways for VBA (some shown here) …

    There are ways too (one yet shown here on Chandoo) to launch
    VBA "multi-threading requests" through Windows.
    For example 60 VBA sequentials requests take an average of 53 seconds
    and through Windows only an average of 7 seconds …

    Of course VBA is less efficient but with some "external objects" faculties.
    shahin likes this.
  6. Chihiro

    Chihiro Excel Ninja

    Messages:
    5,004
    For multi-threading yes it can be done. Then again it's dependent on it being Windows environment. Async I don't believe can be done via VBA.

    Really it's mostly what's your environment and what you are comfortable with.

    With enough persistence and ingenuity, almost anything is possible via either approach. But it boils down to what's best suited to "YOUR" need and skill based on investment needed (time, effort, $ cost etc).
    Chirag R Raval and shahin like this.
  7. shahin

    shahin Active Member

    Messages:
    899
    I'm damn sure, all of you are already update with the news that using power query it's the simplest thing to parse any json data from a webpage.
  8. Chihiro

    Chihiro Excel Ninja

    Messages:
    5,004
    Yes. I use API through PowerQuery to get JSON data and report on it.

    Though it has bit of quirk (search for Formula.Firewall PowerQuery) and you have to be careful when you build multi-stage query (i.e. Using 1 API to get parameters, then using parameters obtained in 1st API in 2nd to query data).

    I'm still experimenting with that portion.
    shahin likes this.
  9. Chihiro

    Chihiro Excel Ninja

    Messages:
    5,004
    Follow up on above. What I discovered is that unless there is connector. I can't use dynamic query on API that uses bearer token on PowerBI Cloud service (you can do it on desktop or using PowerQuery in Excel).

    Workaround -
    On local server have process to query API and store data in a db table. Set up On-Premises Gateway Server, which only transfers data queried in secure manner using Azure Service Bus.
    https://powerbi.microsoft.com/en-us/gateway/

    This has added benefit in that I can store historical data beyond what's made available through API. Though not ideal for realtime data. For that, data feed/streaming is better route.
    shahin likes this.

Share This Page