1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

scrapping data from specific site

Discussion in 'VBA Macros' started by claudia80, Aug 10, 2018.

  1. claudia80

    claudia80 Member

    Messages:
    81
    could you help me scrapping schools for each state?
    It seems that the only data that can be extracted are: name, address, telephone and website. I do not think I have seen that there is also the contact email between the data available.

    site web for scrapping:
    https://maps.me/catalog/education/amenity-school/

    Thanks in adavance
  2. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Try this code
    Code (vb):
    Sub Test()
        Dim html        As New HTMLDocument
        Dim posts      As Object
        Dim post        As Object
        Dim v          As Object
        Dim w          As Object
        Dim x          As Object
        Dim p          As Long
        Dim r          As Long

        ReDim a(1 To 100000, 1 To 4)

        Application.ScreenUpdating = False
            For p = 1 To 2
                html.body.innerHTML = GetHTMLSource("https://maps.me/catalog/education/amenity-school/?page=" & p)
       
                Set posts = html.getElementsByClassName("item__title")
       
                For Each post In posts
                    r = r + 1
                    a(r, 1) = post.innerText
       
                    Set v = post.NextSibling.getElementsByTagName("p")(0)
                    If Not v Is Nothing Then a(r, 2) = v.innerText
       
                    Set w = post.NextSibling.getElementsByTagName("p")(1)
                    If Not w Is Nothing Then If Mid(w.innerText, 1, 5) = "Phone" Then a(r, 3) = Replace(w.innerText, "Phone: ", "'")
       
                    Set x = post.NextSibling.getElementsByTagName("p")(2)
                    If Not x Is Nothing Then If Mid(x.innerText, 1, 7) = "Website" Then a(r, 4) = Replace(x.innerText, "Website: ", "")
                Next post
       
                Set html = Nothing
            Next p
       
            Columns("A:D").ClearContents
            Range("A1").Resize(1, 4).Font.Bold = True
            Range("A1").Resize(1, 4).Value = Array("School Name", "Address", "Phone", "Website")
            Range("A2").Resize(r, UBound(a, 2)).Value = a
        Application.ScreenUpdating = True
    End Sub

    Function GetHTMLSource(strURL As String) As String
        Dim XMLHTTP    As Object

        Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")

        With XMLHTTP
            .setTimeouts 4000, 4000, 4000, 4000
            .Open "GET", strURL, False
            .send
            If .Status = 200 Then GetHTMLSource = XMLHTTP.responseText Else GetHTMLSource = ""
        End With
    End Function
  3. claudia80

    claudia80 Member

    Messages:
    81
    If I start the macro, a mask appears with an error message:
    compilation error.
    user-defined type not defined.

    By clicking on the "?" direct me to this page:

    https://msdn.microsoft.com/it-it/VB...ce-shared-vblr6-chm1032807?f1url=https://msdn. microsoft.com 2Fquery%%% 2Fdev11.query 3FappId% 3DDev11IDEF1% 26l% 3Dit-EN% 26k% 3DK (vblr6.chm1032807)% 3BK (TargetFrameworkMoniker-Office.Version 3Dv15%)%% 26rd 3Dtrue

    Should I activate other references?

    Attached Files:

    • 1.PNG
      1.PNG
      File size:
      10.1 KB
      Views:
      6
  4. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    You have to enable the reference "Microsoft HTML Object Library" or replace the first line in declaration of variables to be Dim html As Object
  5. Marc L

    Marc L Excel Ninja

    Messages:
    4,221
    So do not forget the CreateObject function …
    YasserKhalil likes this.
  6. claudia80

    claudia80 Member

    Messages:
    81
    I enabled "Microsoft HTML Object Library" and launched the macro. It only worked up to the second page by stopping at: # 6 Kindergarten
    Could the block be solved? Moreover, could the extraction be done by state (at the top of the page is there the possibility to choose the country) making sure that in the first extraction column the status is also inserted?
    I noticed that clicking on the links also exit the geographic coordinates (GPS). Can they be extracted?
  7. Marc L

    Marc L Excel Ninja

    Messages:
    4,221
    Can you crytal clear explain ? (As state = country ?!)
    Can you attach an expected result ?
    What is the purpose to download more than 51 000 webpages
    with one of the slowest language ?!
    Looks like Stefano …
  8. claudia80

    claudia80 Member

    Messages:
    81
    answer
    The macro is fast

    Attached Files:

  9. Marc L

    Marc L Excel Ninja

    Messages:
    4,221
    Obviously just mod the last page # in the code …
  10. claudia80

    claudia80 Member

    Messages:
    81
    could you make changes to the code?
  11. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    I can't get what you mean exactly ..
  12. claudia80

    claudia80 Member

    Messages:
    81
    I would need these accommodations:
    1) the macro hangs on the second page. Could you make the macro scrape all the pages?
    2) Is it possible to extrapolate the geographical coordinates? you can find them when you click on the name of a school corresponding to the words: GPS
  13. shahin

    shahin Active Member

    Messages:
    887
    You have already got an excellent answer. I'm just trying to provide you with a script which is created using ".querySelector()" instead of usual approach. This method is comparatively easier to deal with. However, the webpage itself is real slow. The following script will give you "Names", "Address along with co-ordinates", and "Phone" numbers. I've used hardcoded number as pagination because the number of pages are huge. Feel free to replace the number "10" with the highest number of page you wanna grasp data from.

    Try this:
    Code (vb):

    Sub GetInformation()
        Dim Http As New XMLHTTP60, Html As New HTMLDocument
        Dim pagenum&, I&, R&
     
        For pagenum = 1 To 10
            With Http
                .Open "GET", "https://maps.me/catalog/education/amenity-school/?page=" & pagenum, False
                .setRequestHeader "User-Agent", "Mozilla/5.0"
                .send
                Html.body.innerHTML = .responseText
            End With
         
            With Html.querySelectorAll(".item .item__title a")
                For I = 0 To .Length - 1
                    Http.Open "GET", "https://maps.me" & Split(.Item(I).getAttribute("href"), "about:")(1), False
                    Http.setRequestHeader "User-Agent", "Mozilla/5.0"
                    Http.send
                    Html.body.innerHTML = Http.responseText
                 
                    R = R + 1: Cells(R, 1) = Html.querySelector("h1[itemprop='name']").innerText
                    If Not Html.querySelector(".item__desc-location span[itemprop='address']") Is Nothing Then
                        Cells(R, 2) = Html.querySelector(".item__desc-location span[itemprop='address']").innerText
                    End If
                    If Not Html.querySelector("p.item__desc-phone span") Is Nothing Then
                        Cells(R, 3) = Split(Html.querySelector("p.item__desc-phone span").innerText, "Phone: ")(1)
                    End If
                Next I
            End With
        Next pagenum
    End Sub
     
    To execute the above script, you need to add the following references to the library:

    Code (vb):

    Microsoft XML, V6.0
    Microsoft HTML Object Library
     
  14. claudia80

    claudia80 Member

    Messages:
    81
    Thank you for your intervention.
    I tried the macro and I can say that in the first macro the data were extracted directly from the first search page and the extrapolation is done in an orderly manner but the coordinates that are found in the page that appears by clicking on the name of the school are missing.In the second the data are extrapolated directly from the page that appears by clicking on the name of the school (with another macro I will split the data name school from the address, etc.) but the extraction of the website is missing.

    I ask if it's possible:
    1) add it to the code extraction of the website (An example of school with the website: https://maps.me/catalog/education/a...ionnaya-shkola-izucheniya-inostrannyh-yazykov -4,611,686,023,970,145,825 /)
    2) make sure that the extraction takes place automatically for all the pages without having to enter the number of pages in the code
    3) understand from the code setting of the web page if some schools could also be contact email? If you can I would also be interested in the email.

    thank you so much
  15. shahin

    shahin Active Member

    Messages:
    887
    @claudia80, You didn't mention whether my approach could traverse multiple pages. Answer could be "yes" or "no". Additional requirement can be fulfilled later. Thanks.
  16. claudia80

    claudia80 Member

    Messages:
    81
    I confirm that the macro extrapolates the data of the pages indicated in the macro. It was implied in the previous message when I asked if it was possible to extract all the pages without having to enter the number of these in the code.
    Is it necessary to format the excel sheet first by type of data and type of alphabet (example column of the telephone number)?
  17. Marc L

    Marc L Excel Ninja

    Messages:
    4,221
    It seems a classic VBA procedure needs at least more than 10 days
    to load all the pages if only Windows or Excel won't hang !

    It's safer to move on a more efficient way (pro tool, web language) …
    YasserKhalil likes this.
  18. claudia80

    claudia80 Member

    Messages:
    81
    Mr Marc. If you want, you could put the most appropriate code in your logic.
    Last edited by a moderator: Aug 12, 2018
  19. Marc L

    Marc L Excel Ninja

    Messages:
    4,221
    As it's pretty the same as Yasser in post #2 and
    scrapping 51 000+ webpages under VBA is insane …

    See with your IT to prepare a PC without any web upload service
    (windows, antivirus, java, acrobat reader, …) to try to avoid any issue …
  20. claudia80

    claudia80 Member

    Messages:
    81
    What would be the system to which you referred in the message 17?
    I could proceed by degrees using the shahin macro dividing the pages in several days.
  21. Marc L

    Marc L Excel Ninja

    Messages:
    4,221


    Some web oriented language or tool, ask your IT …​
  22. claudia80

    claudia80 Member

    Messages:
    81
    I have no IT.
    For me, the two macros can work well if they are arranged with the missing data and functions.
    Meanwhile, I wait for the modification of the code by Shahin ...
  23. Marc L

    Marc L Excel Ninja

    Messages:
    4,221
    AutoIt may be near to beginner level like Import.io, OutWit, Scrape Box,
    whatever, any tool you can easily find yourself on Web …

    As any modification should dramatically slow down the execution,
    first try actual code after updating it with the last page #
    in order to have an idea how long yet lasts its execution
    (to multiply by 20 for a complete VBA scrapping)
    and to see how viable it is as well
    as some websites monitor their activity and
    can block the IP address in case of such scrapping …
  24. claudia80

    claudia80 Member

    Messages:
    81
    I would need the final modification of the shahin code for the extraction of the school website. Then I can proceed to the extraction to do a test from page 1 to page 1000. If I try it without change then I'll have to do it again (I would waste time twice).
  25. Marc L

    Marc L Excel Ninja

    Messages:
    4,221

    As I do not think any helper has more time to waste on this subject,
    better is you first try to load the first 1000 pages
    and revert how many time it needs on your end …​

Share This Page