• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

scrapping data from specific site

Marc L

Excel Ninja
Well, it seems you just misread …
in the webpage display of any school ?!
Compare with the phone part
in the code of the web page it is written in the two versions:
url and website.
… I hope now you see the difference as I never wrote « code » !
So open any school webpage under your favorite webbrowser
and just see what appears on screen …

If you see something else than I wrote so I just expect a screenshot
or notice the country, the page # and the school name …

With a couple of neurons, that just needs a couple of chars to cut via Split :
what are the two chars just before where to cut the string ?
(at easy beginner level as just reading a string on a webpage screen …)
 

claudia80

Member
I know that the website exits on the web page.
I also wrote that I tried to write website in the code of the macro but it does not change anything ..
 

Marc L

Excel Ninja

As I can't guess your try, better is to post it as I wanna see …
As I yet wrote only a couple of chars are necessary …
As a reminder, see Yasser's code …​
 

claudia80

Member
Now the macro no longer works. Perhaps the site temporarily blocked the ip.

Code:
 If Not Html.querySelector("p.item__desc-url span") Is Nothing Then
                    Cells(R, 4) = Split(Html.querySelector("p.item__desc-url span").innerText, "Website: ")(2)

I tried to put "href =" and it does not go ..
 

Marc L

Excel Ninja
Ok you fixed the first error even if there are superfluous chars …

The second error in dark red : , "Website: ")(2)

Again, just compare with the phone part …
 

claudia80

Member
Ok you fixed the first error even if there are superfluous chars …

The second error in dark red : , "Website: ")(2)

Again, just compare with the phone part …

I saw that at first there was no number then (1) and so I thought it was a progression for this I have included ...

should I take it off?
 

Marc L

Excel Ninja

No ! As it represents the part to keep.

And just reading the help for Split,
you must see even no delimiter char is necessary …
 

Marc L

Excel Ninja
See the original codes of Yasser and shahin which value they use
as I yet wrote to compare with the phone part !
What did you mod something you do not understand, that's weird …

Don't need to explain what is yet written in the VBA inner help :
« Returns a zero-based one-dimensional array
containing a specified number of substrings. »

So the first sub-string has the index zero … :rolleyes:
 

claudia80

Member
I understood the progression of numbers.
I did not understand this part of your message:
you must see even no delimiter char is necessary ...
what would it be: delimiter char
 

claudia80

Member
I'm trying to understand by comparing the code to separate the address and the tel code in the page code to get to extrapolate the url.So far analyzing the page of the website I understood that:1) we must consider the "p class" when at the beginning of the macro I find "Dim Http As New XMLHTTP60, Html As New HTMLDocument";2) the "div class" part must be considered when "object" is placed at the beginning of the macro.Probably a page can be analyzed in the html version or for objects.Then in the first case to extrapolate a text you must select the part that precedes it with the code = Html.querySelector ()The part part that follows the code is inserted in parentheses: (p class =) by inserting before writing: (.p). It is continued by inserting: span - next to this code there is immediately the part that interests should be left so otherwise between the square brackets [] it is necessary to insert the code string that precedes the text to extract (as in the case of the address: [ itemprop = 'address'].In the case of the extraction of the website as for the address if the reasoning is right, I would have to insert the value up to (hrf =) between the square brackets, so it would become:

Code:
If Not Html.querySelector ("p.item__desc-url span [Website: a herf =]") Is Nothing ThenCells (R, 4) = Split (Html.querySelector ("p.item__desc-url span [Website: <a herf =]"). InnerText, "rel =") (2)

Then in the next part of the code I should insert a delimiter (https://www.techonthenet.com/excel/formulas/split.php) that tells the macro to extract up to the first of the word "rel" - this always if you should extract the url that is on the left instead of the one on the right ...If I have reasoned well so far I can not understand how to indicate to the macro to exclude the whole next part ..
 

Marc L

Excel Ninja
Again as yet written twice just compare the phone part

Cells(R, 3) = Split(Html.querySelector("p.item__desc-phone span").innerText, "Phone: ")(1)

wtih your website part

Cells(R, 4) = Split(Html.querySelector("p.item__desc-url span").innerText, "Website: ")(2)

as the error is just in red(superfluous in bold & underlined, see VBA help of Split)
 

claudia80

Member
I understood that (1) refers to the text so inserting it I extracted the website.
Now I'm trying to split the second column into: status, address, GPS:
I'm trying to apply the split function. Maybe I did not understand it well ..
Between the quotation marks "" I indicate which parameter to take into consideration (in my case the comma) so it becomes ",". While the number before the parenthesis indicates up to which comma (the value between the quotation marks) to extract. At least this I understood unless there is an other function for extracting the text ..

I entered this part of the code:

Code:
R = R + 1: Cells(R, 1) = Html.querySelector("h1[itemprop='name']").innerText
                If Not Html.querySelector(".item__desc-location span[itemprop='address']") Is Nothing Then
                    Cells(R, 2) = Html.querySelector(".item__desc-location span[itemprop='address']").innerText
                End If
                If Not Html.querySelector(".item__desc-location span[itemprop='address']") Is Nothing Then
                    Cells(R, 3) = Split(Html.querySelector(".item__desc-location span[itemprop='address']").innerText, ", ", 2)
                End If
                If Not Html.querySelector(".item__desc-location span[itemprop='address']") Is Nothing Then
                    Cells(R, 4) = Split(Html.querySelector(".item__desc-location span[itemprop='address']").innerText, "GPS:", 2)
                End If
                If Not Html.querySelector("p.item__desc-phone span") Is Nothing Then
                    Cells(R, 5) = Split(Html.querySelector("p.item__desc-phone span").innerText, "Phone: ")(1)
                End If
                If Not Html.querySelector("p.item__desc-url span") Is Nothing Then
                    Cells(R, 6) = Split(Html.querySelector("p.item__desc-url span").innerText, "Website: ")(1)
                End If

I can extract the state but I can not continue. Why?
 

Marc L

Excel Ninja
I understood that (1) refers to the text
As the function Split returns a zero-based array the (#) just
indicates the index # of the desired substring within this array …

Split is the easy way for those having a good logic.
You can use a variable to store the array result.

Other VBA text functions like in Yasser's code
via InStr, Left, Mid, Replace & Right to see in VBA help …

split the second column into: status
« status » ?‼ What it could be, any clear sample ?

 

claudia80

Member
This part is difficult for me ..
The macro may be fine. I could extract the data with formulas and insert the pages by inserting the links of each state ... I was wrong to write before, I wanted to write state
I would need the code to save the data when the macro ends the search or stops. Where can I find it?
 

Marc L

Excel Ninja

To save for example ThisWorkbook.Save and apart the VBA help
(where you can see the methods of the Workbook object)
do not forget the beginner best friend : the Macro Recorder !​
 

Marc L

Excel Ninja

The issue to parse the global location is the inconsistent data
of this website as some webpages don't have any address
so an error will raise when trying to split such lacking data
without any mind checking strategy …
 

claudia80

Member
then it is better to divide the data with formulas.

Ps. I have vouched a fundamental thing.
How do you make the empty lines appear at the start of the macro without deleting the previously saved data?
Thank you
 

Marc L

Excel Ninja

Initialize the variable R with the last row used, different ways,

an example : R = Cells(Rows.Count, 1).End(xlUp).Row …​
 

claudia80

Member
R = R + 1
becomes:
R = Cells (Rows.Count, 1) .End (xlUp) .Row + 1

here the complete code I will use will end. More in the I will try to understand how to set the time of the macro and the ability to set pauses.

thank you


Code:
Sub GetInformation()
    Dim Http As New XMLHTTP60, Html As New HTMLDocument
    Dim pagenum&, I&, R&
    For pagenum = 1 To 1
        With Http
            .Open "GET", "https://maps.me/catalog/education/amenity-school/?page=" & pagenum, False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            .send
            Html.body.innerHTML = .responseText
        End With
   
        With Html.querySelectorAll(".item .item__title a")
            For I = 0 To .Length - 1
                Http.Open "GET", "https://maps.me" & Split(.Item(I).getAttribute("href"), "about:")(1), False
                Http.setRequestHeader "User-Agent", "Mozilla/5.0"
                Http.send
                Html.body.innerHTML = Http.responseText
           
                R = Cells(Rows.Count, 1).End(xlUp).Row + 1: Cells(R, 1) = Html.querySelector("h1[itemprop='name']").innerText
                If Not Html.querySelector(".item__desc-location span[itemprop='address']") Is Nothing Then
                    Cells(R, 2) = Html.querySelector(".item__desc-location span[itemprop='address']").innerText
                End If
                If Not Html.querySelector(".item__desc-location span[itemprop='address']") Is Nothing Then
                    Cells(R, 3) = Split(Html.querySelector(".item__desc-location span[itemprop='address']").innerText, ", ", 2)
                End If
                If Not Html.querySelector(".item__desc-location span[itemprop='address']") Is Nothing Then
                    Cells(R, 4) = Split(Html.querySelector(".item__desc-location span[itemprop='address']").innerText, "GPS:", 2)
                End If
                If Not Html.querySelector("p.item__desc-phone span") Is Nothing Then
                    Cells(R, 5) = Split(Html.querySelector("p.item__desc-phone span").innerText, "Phone: ")(1)
                End If
                If Not Html.querySelector("p.item__desc-url span") Is Nothing Then
                    Cells(R, 6) = Split(Html.querySelector("p.item__desc-url span").innerText, "Website: ")(1)
                End If
            Next I
        End With
    Next pagenum
    ThisWorkbook.Save
End Sub
 

Marc L

Excel Ninja
R = R + 1
becomes:
R = Cells (Rows.Count, 1) .End (xlUp) .Row + 1
I never wrote that as I just wrote I N I T I A L I Z E so
at the beginning of the procedure before the first loop.
By chance here it's not a concern as recalculating each time the
last used row from the last row of the worksheet (# 1 048 576)
instead of just adding 1 to the previous row …
 
Top