• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

scrapping data from specific site

YasserKhalil

Well-Known Member
Try this code
Code:
Sub Test()
    Dim html        As New HTMLDocument
    Dim posts      As Object
    Dim post        As Object
    Dim v          As Object
    Dim w          As Object
    Dim x          As Object
    Dim p          As Long
    Dim r          As Long

    ReDim a(1 To 100000, 1 To 4)

    Application.ScreenUpdating = False
        For p = 1 To 2
            html.body.innerHTML = GetHTMLSource("https://maps.me/catalog/education/amenity-school/?page=" & p)
   
            Set posts = html.getElementsByClassName("item__title")
   
            For Each post In posts
                r = r + 1
                a(r, 1) = post.innerText
   
                Set v = post.NextSibling.getElementsByTagName("p")(0)
                If Not v Is Nothing Then a(r, 2) = v.innerText
   
                Set w = post.NextSibling.getElementsByTagName("p")(1)
                If Not w Is Nothing Then If Mid(w.innerText, 1, 5) = "Phone" Then a(r, 3) = Replace(w.innerText, "Phone: ", "'")
   
                Set x = post.NextSibling.getElementsByTagName("p")(2)
                If Not x Is Nothing Then If Mid(x.innerText, 1, 7) = "Website" Then a(r, 4) = Replace(x.innerText, "Website: ", "")
            Next post
   
            Set html = Nothing
        Next p
   
        Columns("A:D").ClearContents
        Range("A1").Resize(1, 4).Font.Bold = True
        Range("A1").Resize(1, 4).Value = Array("School Name", "Address", "Phone", "Website")
        Range("A2").Resize(r, UBound(a, 2)).Value = a
    Application.ScreenUpdating = True
End Sub

Function GetHTMLSource(strURL As String) As String
    Dim XMLHTTP    As Object

    Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")

    With XMLHTTP
        .setTimeouts 4000, 4000, 4000, 4000
        .Open "GET", strURL, False
        .send
        If .Status = 200 Then GetHTMLSource = XMLHTTP.responseText Else GetHTMLSource = ""
    End With
End Function
 

claudia80

Member
If I start the macro, a mask appears with an error message:
compilation error.
user-defined type not defined.

By clicking on the "?" direct me to this page:

https://msdn.microsoft.com/it-it/VBA/office-f1-landing/user-defined-type-not-defined-office-shared-vblr6-chm1032807?f1url=https://msdn. microsoft.com 2Fquery%%% 2Fdev11.query 3FappId% 3DDev11IDEF1% 26l% 3Dit-EN% 26k% 3DK (vblr6.chm1032807)% 3BK (TargetFrameworkMoniker-Office.Version 3Dv15%)%% 26rd 3Dtrue

Should I activate other references?
 

Attachments

YasserKhalil

Well-Known Member
You have to enable the reference "Microsoft HTML Object Library" or replace the first line in declaration of variables to be Dim html As Object
 

claudia80

Member
I enabled "Microsoft HTML Object Library" and launched the macro. It only worked up to the second page by stopping at: # 6 Kindergarten
Could the block be solved? Moreover, could the extraction be done by state (at the top of the page is there the possibility to choose the country) making sure that in the first extraction column the status is also inserted?
I noticed that clicking on the links also exit the geographic coordinates (GPS). Can they be extracted?
 

Marc L

Excel Ninja
Moreover, could the extraction be done by state (at the top of the page is there the possibility to choose the country) making sure that in the first extraction column the status is also inserted?
Can you crytal clear explain ? (As state = country ?!)
Can you attach an expected result ?
What is the purpose to download more than 51 000 webpages
with one of the slowest language ?!
Looks like Stefano …
 

claudia80

Member
I would need these accommodations:
1) the macro hangs on the second page. Could you make the macro scrape all the pages?
2) Is it possible to extrapolate the geographical coordinates? you can find them when you click on the name of a school corresponding to the words: GPS
 

shahin

Active Member
You have already got an excellent answer. I'm just trying to provide you with a script which is created using ".querySelector()" instead of usual approach. This method is comparatively easier to deal with. However, the webpage itself is real slow. The following script will give you "Names", "Address along with co-ordinates", and "Phone" numbers. I've used hardcoded number as pagination because the number of pages are huge. Feel free to replace the number "10" with the highest number of page you wanna grasp data from.

Try this:
Code:
Sub GetInformation()
    Dim Http As New XMLHTTP60, Html As New HTMLDocument
    Dim pagenum&, I&, R&
  
    For pagenum = 1 To 10
        With Http
            .Open "GET", "https://maps.me/catalog/education/amenity-school/?page=" & pagenum, False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            .send
            Html.body.innerHTML = .responseText
        End With
      
        With Html.querySelectorAll(".item .item__title a")
            For I = 0 To .Length - 1
                Http.Open "GET", "https://maps.me" & Split(.Item(I).getAttribute("href"), "about:")(1), False
                Http.setRequestHeader "User-Agent", "Mozilla/5.0"
                Http.send
                Html.body.innerHTML = Http.responseText
              
                R = R + 1: Cells(R, 1) = Html.querySelector("h1[itemprop='name']").innerText
                If Not Html.querySelector(".item__desc-location span[itemprop='address']") Is Nothing Then
                    Cells(R, 2) = Html.querySelector(".item__desc-location span[itemprop='address']").innerText
                End If
                If Not Html.querySelector("p.item__desc-phone span") Is Nothing Then
                    Cells(R, 3) = Split(Html.querySelector("p.item__desc-phone span").innerText, "Phone: ")(1)
                End If
            Next I
        End With
    Next pagenum
End Sub
To execute the above script, you need to add the following references to the library:

Code:
Microsoft XML, V6.0
Microsoft HTML Object Library
 

claudia80

Member
Thank you for your intervention.
I tried the macro and I can say that in the first macro the data were extracted directly from the first search page and the extrapolation is done in an orderly manner but the coordinates that are found in the page that appears by clicking on the name of the school are missing.In the second the data are extrapolated directly from the page that appears by clicking on the name of the school (with another macro I will split the data name school from the address, etc.) but the extraction of the website is missing.

I ask if it's possible:
1) add it to the code extraction of the website (An example of school with the website: https://maps.me/catalog/education/amenity-school/get-club-innovacionnaya-shkola-izucheniya-inostrannyh-yazykov -4,611,686,023,970,145,825 /)
2) make sure that the extraction takes place automatically for all the pages without having to enter the number of pages in the code
3) understand from the code setting of the web page if some schools could also be contact email? If you can I would also be interested in the email.

thank you so much
 

shahin

Active Member
@claudia80, You didn't mention whether my approach could traverse multiple pages. Answer could be "yes" or "no". Additional requirement can be fulfilled later. Thanks.
 

claudia80

Member
I confirm that the macro extrapolates the data of the pages indicated in the macro. It was implied in the previous message when I asked if it was possible to extract all the pages without having to enter the number of these in the code.
Is it necessary to format the excel sheet first by type of data and type of alphabet (example column of the telephone number)?
 

Marc L

Excel Ninja
It seems a classic VBA procedure needs at least more than 10 days
to load all the pages if only Windows or Excel won't hang !

It's safer to move on a more efficient way (pro tool, web language) …
 

Marc L

Excel Ninja
As it's pretty the same as Yasser in post #2 and
scrapping 51 000+ webpages under VBA is insane …

See with your IT to prepare a PC without any web upload service
(windows, antivirus, java, acrobat reader, …) to try to avoid any issue …
 

claudia80

Member
What would be the system to which you referred in the message 17?
I could proceed by degrees using the shahin macro dividing the pages in several days.
 

claudia80

Member
I have no IT.
For me, the two macros can work well if they are arranged with the missing data and functions.
Meanwhile, I wait for the modification of the code by Shahin ...
 

Marc L

Excel Ninja
AutoIt may be near to beginner level like Import.io, OutWit, Scrape Box,
whatever, any tool you can easily find yourself on Web …

As any modification should dramatically slow down the execution,
first try actual code after updating it with the last page #
in order to have an idea how long yet lasts its execution
(to multiply by 20 for a complete VBA scrapping)
and to see how viable it is as well
as some websites monitor their activity and
can block the IP address in case of such scrapping …
 

claudia80

Member
I would need the final modification of the shahin code for the extraction of the school website. Then I can proceed to the extraction to do a test from page 1 to page 1000. If I try it without change then I'll have to do it again (I would waste time twice).
 

Marc L

Excel Ninja

As I do not think any helper has more time to waste on this subject,
better is you first try to load the first 1000 pages
and revert how many time it needs on your end …​
 
Top