scrapping data from specific site

claudia80 · Aug 10, 2018

could you help me scrapping schools for each state?
It seems that the only data that can be extracted are: name, address, telephone and website. I do not think I have seen that there is also the contact email between the data available.

site web for scrapping:
https://maps.me/catalog/education/amenity-school/

Thanks in adavance

YasserKhalil · Aug 10, 2018

Try this code

Code:

Sub Test()
    Dim html        As New HTMLDocument
    Dim posts      As Object
    Dim post        As Object
    Dim v          As Object
    Dim w          As Object
    Dim x          As Object
    Dim p          As Long
    Dim r          As Long

    ReDim a(1 To 100000, 1 To 4)

    Application.ScreenUpdating = False
        For p = 1 To 2
            html.body.innerHTML = GetHTMLSource("https://maps.me/catalog/education/amenity-school/?page=" & p)
   
            Set posts = html.getElementsByClassName("item__title")
   
            For Each post In posts
                r = r + 1
                a(r, 1) = post.innerText
   
                Set v = post.NextSibling.getElementsByTagName("p")(0)
                If Not v Is Nothing Then a(r, 2) = v.innerText
   
                Set w = post.NextSibling.getElementsByTagName("p")(1)
                If Not w Is Nothing Then If Mid(w.innerText, 1, 5) = "Phone" Then a(r, 3) = Replace(w.innerText, "Phone: ", "'")
   
                Set x = post.NextSibling.getElementsByTagName("p")(2)
                If Not x Is Nothing Then If Mid(x.innerText, 1, 7) = "Website" Then a(r, 4) = Replace(x.innerText, "Website: ", "")
            Next post
   
            Set html = Nothing
        Next p
   
        Columns("A:D").ClearContents
        Range("A1").Resize(1, 4).Font.Bold = True
        Range("A1").Resize(1, 4).Value = Array("School Name", "Address", "Phone", "Website")
        Range("A2").Resize(r, UBound(a, 2)).Value = a
    Application.ScreenUpdating = True
End Sub

Function GetHTMLSource(strURL As String) As String
    Dim XMLHTTP    As Object

    Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")

    With XMLHTTP
        .setTimeouts 4000, 4000, 4000, 4000
        .Open "GET", strURL, False
        .send
        If .Status = 200 Then GetHTMLSource = XMLHTTP.responseText Else GetHTMLSource = ""
    End With
End Function

claudia80 · Aug 10, 2018

If I start the macro, a mask appears with an error message:
compilation error.
user-defined type not defined.

By clicking on the "?" direct me to this page:

https://msdn.microsoft.com/it-it/VB...ce-shared-vblr6-chm1032807?f1url=https://msdn. microsoft.com 2Fquery%%% 2Fdev11.query 3FappId% 3DDev11IDEF1% 26l% 3Dit-EN% 26k% 3DK (vblr6.chm1032807)% 3BK (TargetFrameworkMoniker-Office.Version 3Dv15%)%% 26rd 3Dtrue

Should I activate other references?

YasserKhalil · Aug 10, 2018

You have to enable the reference "Microsoft HTML Object Library" or replace the first line in declaration of variables to be Dim html As Object

Marc L · Aug 10, 2018

YasserKhalil said:
or replace the first line in declaration of variables to be
Dim html As Object

So do not forget the CreateObject function …

claudia80 · Aug 10, 2018

I enabled "Microsoft HTML Object Library" and launched the macro. It only worked up to the second page by stopping at: # 6 Kindergarten
Could the block be solved? Moreover, could the extraction be done by state (at the top of the page is there the possibility to choose the country) making sure that in the first extraction column the status is also inserted?
I noticed that clicking on the links also exit the geographic coordinates (GPS). Can they be extracted?

Marc L · Aug 10, 2018

claudia80 said:
Moreover, could the extraction be done by state (at the top of the page is there the possibility to choose the country) making sure that in the first extraction column the status is also inserted?

Can you crytal clear explain ? (As state = country ?!)
Can you attach an expected result ?
What is the purpose to download more than 51 000 webpages
with one of the slowest language ?!
Looks like Stefano …

claudia80 · Aug 10, 2018

answer
The macro is fast

Marc L · Aug 11, 2018

claudia80 said:
I enabled "Microsoft HTML Object Library" and launched the macro. It only worked up to the second page by stopping at: # 6 Kindergarten
Could the block be solved?

Obviously just mod the last page # in the code …

claudia80 · Aug 11, 2018

could you make changes to the code?

YasserKhalil · Aug 11, 2018

I can't get what you mean exactly ..

claudia80 · Aug 11, 2018

I would need these accommodations:
1) the macro hangs on the second page. Could you make the macro scrape all the pages?
2) Is it possible to extrapolate the geographical coordinates? you can find them when you click on the name of a school corresponding to the words: GPS

shahin · Aug 11, 2018

You have already got an excellent answer. I'm just trying to provide you with a script which is created using ".querySelector()" instead of usual approach. This method is comparatively easier to deal with. However, the webpage itself is real slow. The following script will give you "Names", "Address along with co-ordinates", and "Phone" numbers. I've used hardcoded number as pagination because the number of pages are huge. Feel free to replace the number "10" with the highest number of page you wanna grasp data from.

Try this:

Code:

Sub GetInformation()
    Dim Http As New XMLHTTP60, Html As New HTMLDocument
    Dim pagenum&, I&, R&
  
    For pagenum = 1 To 10
        With Http
            .Open "GET", "https://maps.me/catalog/education/amenity-school/?page=" & pagenum, False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            .send
            Html.body.innerHTML = .responseText
        End With
      
        With Html.querySelectorAll(".item .item__title a")
            For I = 0 To .Length - 1
                Http.Open "GET", "https://maps.me" & Split(.Item(I).getAttribute("href"), "about:")(1), False
                Http.setRequestHeader "User-Agent", "Mozilla/5.0"
                Http.send
                Html.body.innerHTML = Http.responseText
              
                R = R + 1: Cells(R, 1) = Html.querySelector("h1[itemprop='name']").innerText
                If Not Html.querySelector(".item__desc-location span[itemprop='address']") Is Nothing Then
                    Cells(R, 2) = Html.querySelector(".item__desc-location span[itemprop='address']").innerText
                End If
                If Not Html.querySelector("p.item__desc-phone span") Is Nothing Then
                    Cells(R, 3) = Split(Html.querySelector("p.item__desc-phone span").innerText, "Phone: ")(1)
                End If
            Next I
        End With
    Next pagenum
End Sub

To execute the above script, you need to add the following references to the library:

Code:

Microsoft XML, V6.0
Microsoft HTML Object Library

claudia80 · Aug 12, 2018

Thank you for your intervention.
I tried the macro and I can say that in the first macro the data were extracted directly from the first search page and the extrapolation is done in an orderly manner but the coordinates that are found in the page that appears by clicking on the name of the school are missing.In the second the data are extrapolated directly from the page that appears by clicking on the name of the school (with another macro I will split the data name school from the address, etc.) but the extraction of the website is missing.

I ask if it's possible:
1) add it to the code extraction of the website (An example of school with the website: https://maps.me/catalog/education/a...ionnaya-shkola-izucheniya-inostrannyh-yazykov -4,611,686,023,970,145,825 /)
2) make sure that the extraction takes place automatically for all the pages without having to enter the number of pages in the code
3) understand from the code setting of the web page if some schools could also be contact email? If you can I would also be interested in the email.

thank you so much

shahin · Aug 12, 2018

@claudia80, You didn't mention whether my approach could traverse multiple pages. Answer could be "yes" or "no". Additional requirement can be fulfilled later. Thanks.

claudia80 · Aug 12, 2018

I confirm that the macro extrapolates the data of the pages indicated in the macro. It was implied in the previous message when I asked if it was possible to extract all the pages without having to enter the number of these in the code.
Is it necessary to format the excel sheet first by type of data and type of alphabet (example column of the telephone number)?

Marc L · Aug 12, 2018

It seems a classic VBA procedure needs at least more than 10 days
to load all the pages if only Windows or Excel won't hang !

It's safer to move on a more efficient way (pro tool, web language) …

claudia80 · Aug 12, 2018

Mr Marc. If you want, you could put the most appropriate code in your logic.

Marc L · Aug 12, 2018

As it's pretty the same as Yasser in post #2 and
scrapping 51 000+ webpages under VBA is insane …

See with your IT to prepare a PC without any web upload service
(windows, antivirus, java, acrobat reader, …) to try to avoid any issue …

claudia80 · Aug 12, 2018

What would be the system to which you referred in the message 17?
I could proceed by degrees using the shahin macro dividing the pages in several days.

Marc L · Aug 12, 2018

Some web oriented language or tool, ask your IT …

claudia80 · Aug 12, 2018

I have no IT.
For me, the two macros can work well if they are arranged with the missing data and functions.
Meanwhile, I wait for the modification of the code by Shahin ...

Marc L · Aug 13, 2018

AutoIt may be near to beginner level like Import.io, OutWit, Scrape Box,
whatever, any tool you can easily find yourself on Web …

As any modification should dramatically slow down the execution,
first try actual code after updating it with the last page #
in order to have an idea how long yet lasts its execution
(to multiply by 20 for a complete VBA scrapping)
and to see how viable it is as well
as some websites monitor their activity and
can block the IP address in case of such scrapping …

claudia80 · Aug 13, 2018

I would need the final modification of the shahin code for the extraction of the school website. Then I can proceed to the extraction to do a test from page 1 to page 1000. If I try it without change then I'll have to do it again (I would waste time twice).

Marc L · Aug 13, 2018

As I do not think any helper has more time to waste on this subject,
better is you first try to load the first 1000 pages
and revert how many time it needs on your end …

scrapping data from specific site

Member

Well-Known Member

Member

Attachments

Well-Known Member

Excel Ninja

Member

Excel Ninja

Member

Attachments

Excel Ninja

Member

Well-Known Member

Member

Active Member

Member

Active Member

Member

Excel Ninja

Member

Excel Ninja

Member

Excel Ninja

Member

Excel Ninja

Member

Excel Ninja