Help me find "href"

shahin · Feb 22, 2017

Hi there all! Good day. Hope you are doing well. I'm here to seek help to find "href". I have tried to write a code to parse "Title" and "href" of that Title. I can parse Title but in case of "href" I can't write that portion. Any help would be greatly appreciated. Here is the code I've written. Thanks in advance.

Code:

Sub TorrentData()
Const URL = "https://www.yify-torrent.org/search/1080p/"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim P As Long, N As Long, L As Long, str As Variant

L = 2

 http.Open "GET", URL, False
 http.send
 html.body.innerHTML = http.ResponseText
 str = Split(http.ResponseText, "<span class=""name"">")
 N = UBound(str)
    For P = 1 To N - 1
        Cells(L, 1) = Split(str(P), "<")(0)
        ''Cells(L, 2) = Here should be that code for "href"
        L = L + 1
    Next P
End Sub

Html element for "href" is between the block I'm pasting.

Code:

<a href="/movie/51608/download-iliza-shlesinger-confirmed-kills-2016-1080p-mp4-yify-torrent.html" target="_blank" title="Download Iliza Shlesinger: Confirmed Kills (2016) 1080p"><span class="play" style="display: none;"><span class="name">Iliza Shlesinger: Confirmed Kills (2016) 1080p</span></span><img src="//pic.yify-torrent.org/20170222/51608/iliza-shlesinger-confirmed-kills-2016-1080p-poster.jpg" alt="Iliza Shlesinger: Confirmed Kills (2016) 1080p"></a>

Chihiro · Feb 22, 2017

Just use html.getElementsByTagName("H3")(i).innerText for title.

Then html.getElementsByTagName("H3")(i).getElementsByTagName("A")(0).href

So something like.

Code:

Sub TorrentData()
Const Site = "https://www.yify-torrent.org"
Const SearchP = "/search/1080p/"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim i As Long, L As Long
Dim objH3 As Object

L = 2

http.Open "GET", Site & SearchP, False
http.send
html.body.innerHTML = http.ResponseText
Set objH3 = html.getElementsByTagName("H3")
For i = 0 To objH3.Length - 1
    Cells(L, 1) = objH3(i).innerText
    Cells(L, 2) = Replace(objH3(i).getElementsByTagName("A")(0).href, "about:", Site)
    L = L + 1
Next
End Sub

shahin · Feb 22, 2017

Thanks sir Chihiro, I have already learnt from you the style you applied here. Basically, I always try to follow your style in my code. However, I wish to parse the "href" using the method I started here. Thanks again.

Chihiro · Feb 22, 2017

You are splitting responseText at "<span class=""name"">" and looping from str(1 to N-1). So you are not able to capture href which occurred before that.
EDIT: You are actually excluding LBound(str) & UBound(str) using 1 to N-1. Since str is Base 0 array.

Instead split responseText at "<h3><a href='".
Also since you are directly working with http.responseText, no need to put it into html.body.innerHTML.

So code would be something like...

Code:

Sub TorrentData()
Const URL = "https://www.yify-torrent.org/search/1080p/"
Dim http As New MSXML2.XMLHTTP60
Dim P As Long, N As Long, L As Long, str As Variant

L = 2

    http.Open "GET", URL, False
    http.send
    str = Split(http.ResponseText, "<h3><a href='")

    N = UBound(str)

    For P = 1 To N
        Cells(L, 1) = Split(Split(str(P), "title=""")(1), """>")
        Cells(L, 2) = Split(str(P), "'")(0)
        L = L + 1
    Next P
End Sub

shahin · Feb 23, 2017

You are just awesome. This is what i wanted. Actually, I didn't even have any basic idea on this style that is because you found bunch of mistakes in my code. Now I'm gonna study on yours to learn. Thankssssssssssssssssssssssssssssssssssss.

shahin · Feb 23, 2017

Dear sir, can you please provide me with any link following which I can learn more on this style to avoid errors. Thanks again.

Marc L · Feb 23, 2017

Hi !

It is not about any style but just about Logic …

Like I yet wrote in your previous thread on same subject,
just train with easy VBA basics text functions as explained
in VBA inner help and in tutorials on web !

For those not able to well read a webpage code,
parsing through object structure like in Chihiro's post #2 is easier,
just needs at least to understand document' structure …

shahin · Feb 23, 2017

Dear sir Chihiro, you might have noticed that using only the class elements in this three categories, one can get results with some flaws. Because, if any elements among them is missing then the alignments doesn't go parallely and
becomes a genuine mess.[More specifically-suppose a page should contain: Name count 30, phone count 30, address count 30 but if there are 25 phone counts and 27 address count in that page then the three columns are not parallel with the elements of same item]. Is there any way to get rid of that problem or the usage of three class elements is itself a flaw? If you run the code you will get what i meant. Thanks in advance.

Code:

Sub GetData()
Const URL = "https://www.yellowpages.com.au/search/listings?clue=coffee+shops&locationClue=all+states&lat=&lon=&selectedViewMode=list"
Const weblink = "https://www.yellowpages.com.au"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim topics As Object, topic As Object, posts As Object
Dim post As Object, links As Object, link As Object
Dim x As Long, i As Long

x = 2

http.Open "GET", URL, False
http.send
html.body.innerHTML = http.ResponseText

Set topics = html.getElementsByClassName("listing-name")
Set topic = html.getElementsByClassName("click-to-call contact contact-preferred contact-phone ")
Set posts = html.getElementsByClassName("listing-address mappable-address mappable-address-with-poi")

For i = 0 To topics.Length - 1
If topics.Length > 0 Or topic.Length > 0 Or posts.Length > 0 Then
Cells(x, 1) = topics(i).innerText
Cells(x, 2) = topic(i).innerText
Cells(x, 3) = posts(i).innerText
x = x + 1
End If
Next i
End Sub

Chihiro · Feb 23, 2017

That's because all of the class that you used is actually wrapped in div class="listing listing-search listing-data" element. Not all class need to be present within and you should not use those to loop through the list.

Like Marc L and I have said before, this is more to do with understanding HTML source code structure rather than VBA skill. I'd recommend studying HTML/DOM structure by reading and testing at w3School.

shahin · Feb 24, 2017

Dear sir Chihiro, is it possible to make any logical statement or something
(if you consider retaining the original code of my second post in this thread ) to store the scraped document in its right position? I meant, as you said earlier that some class elements are missing values that is because phone-number or address of certain names are getting settled in wrong rows especially when it finds a gap or has filled in a gap already. For this reason, name address and phone numbers placed across the rows are not always of the same identity. I suppose there should be any way to fix this. Thanks in advance.

Chihiro · Feb 24, 2017

So you scrape using class="listing listing-search listing-data" for each row.

And then within it, search for each of the class you used (it may be easier to do string manipulation or use getElementsByTagName). If error (i.e. not found) then just leave cell blank.

shahin · Feb 24, 2017

Dear sir Chihiro, in my opinion, using second request can be a remedy of this problem. I did it and found the results as i expected it to be. Your one request is enough to scrape the whole docs accordingly, though! Will there be any issue if i stick to this second request thing to serve my purpose? Your any suggestion on this will be a great help to me. Thanks in advance.

Code:

Sub GetData()
Const URL = "https://www.yellowpages.com.au/search/listings?clue=coffee+shops&locationClue=all+states&lat=&lon=&selectedViewMode=list"
Const weblink = "https://www.yellowpages.com.au"
Dim http As New MSXML2.XMLHTTP60, htm As New HTMLDocument, html As New HTMLDocument
Dim topics As Object, topic As Object, posts As Object
Dim post As Object, links As Object, link As Object
Dim str As String
Dim x As Long, i As Long

x = 2
    http.Open "GET", URL, False
    http.send
    htm.body.innerHTML = http.responseText
   
    Set links = htm.getElementsByClassName("listing-name")

    For Each link In links
    str = weblink & Replace(link.href, "about:", "")
   
    http.Open "GET", str, False
    http.send
    html.body.innerHTML = http.responseText
   
    Set topics = html.getElementsByClassName("listing-name")
    Set topic = html.getElementsByClassName("listing-address mappable-address mappable-address-with-poi")
    Set posts = html.getElementsByClassName("text middle  ")
   
        For i = 0 To topics.Length - 1
            If topics.Length > 0 And topic.Length > 0 And posts.Length > 0 Then
            Cells(x, 1) = topics(i).innerText
            Cells(x, 2) = topic(i).innerText
            Cells(x, 3) = posts(i).getElementsByTagName("div")(0).innerText
            x = x + 1
            End If
        Next i
    Next link
End Sub

Chihiro · Feb 24, 2017

Whichever way you feel comfortable with is the one you should go with.

There are many ways to accomplish same end result and in this instance there isn't clear advantage for one over the other that immediately comes to mind.

shahin · Feb 24, 2017

Thanks sir. I'm very happy to hear that.

shahin · Feb 26, 2017

Dear sir Chihiro, hope you are doing well. Following your instruction I made a parser which is scraping perfectly except for 3 or 4 phone numbers out of all. Am i doing anything wrong? Here is the complete code. Phone number is in column 6. Thanks in advance.

Code:

Sub RealYP()
Const Site = "http://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=San%20Francisco%2C%20CA&page=2"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim i As Long, L As Long
Dim topics As Object

L = 2

http.Open "GET", Site, False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("info")
For i = 0 To topics.Length - 1
    Cells(L, 1) = topics(i).getElementsByTagName("a")(0).innerText
    Cells(L, 2) = topics(i).getElementsByTagName("p")(0).getElementsByTagName("span")(0).innerText
    Cells(L, 3) = topics(i).getElementsByTagName("p")(0).getElementsByTagName("span")(1).innerText
    Cells(L, 4) = topics(i).getElementsByTagName("p")(0).getElementsByTagName("span")(2).innerText
    Cells(L, 5) = topics(i).getElementsByTagName("p")(0).getElementsByTagName("span")(3).innerText
    Cells(L, 6) = topics(i).getElementsByTagName("div")(2).innerText
    L = L + 1
Next i
End Sub

shahin · Feb 26, 2017

Solved it finally.
The line should be:
Cells(L, 6) = topics(i).getElementsByTagName("div")(0).LastChild.innerText

shahin · Feb 28, 2017

For those who like to use response text in combination with split function to parse a site. I've not been able to write code for image, though!

Code:

Sub GettingData()
Const URL = "http://www.yellowpages.com/search?search_terms=coffee%20shops&geo_location_terms=San%20Francisco%2C%20CA&page=2"
Const link = "http://www.yellowpages.com"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim P As Long, N As Long, L As Long, str As Variant

L = 2
http.Open "GET", URL, False
http.send
str = Split(http.responseText, "<div class=""info"">")
N = UBound(str)
For P = 1 To N

    Cells(L, 1) = Split(Split(str(P), "<span itemprop=""name"">")(1), "<")(0)
    Cells(L, 2) = link & Split(Split(str(P), "<a href=""")(1), """")(0)
    Cells(L, 3) = Split(Split(str(P), "itemprop=""streetAddress"" class=""street-address"">")(1), "<")(0)
    Cells(L, 4) = Replace(Split(Split(str(P), "<span itemprop=""addressLocality"" class=""locality"">")(1), "<")(0), ",&nbsp;", "")
    Cells(L, 5) = Split(Split(str(P), "<span itemprop=""addressRegion"">")(1), "<")(0)
    Cells(L, 6) = Split(Split(str(P), "<span itemprop=""postalCode"">")(1), "<")(0)
    Cells(L, 7) = Split(Split(str(P), "itemprop=""telephone"" class=""phones phone primary"">")(1), "<")(0)

    L = L + 1
Next P
End Sub

shahin · Mar 1, 2017

But, in case of table data, It's hard for me to sort out. However, what I've written is able to parse the required data. Not refined, though! It brings "td" and "tr" signs with the result. Any help would be greatly appreciated on this. Here is what I am up to.

Code:

Const URL = "http://mymd.ae/docdetail/4633"
Sub ParsingDocs()
Dim http As Object
Dim P As Long, N As Long, L As Long, str As Variant

L = 2

Set http = CreateObject("Msxml2.ServerXMLHTTP")

http.Open "GET", URL, False
http.send
str = Split(http.responseText, "<table class=""description_table m_bottom_10"">")
N = UBound(str)
For P = 1 To N
    Cells(L, 1) = Split(Split(str(P), "<td>""")(0), "</td>")(0)
    Cells(L, 2) = Split(Split(str(P), "<td>""")(0), "</td>")(1)
    Cells(L + 1, 1) = Split(Split(str(P), "<td>""")(0), "</td>")(2)
    Cells(L + 1, 2) = Split(Split(str(P), "<td>""")(0), "</td>")(3)
    Cells(L + 2, 1) = Split(Split(str(P), "<td>""")(0), "</td>")(4)
    Cells(L + 2, 2) = Split(Split(str(P), "<td>""")(0), "</td>")(5)
    Cells(L + 3, 1) = Split(Split(str(P), "<td>""")(0), "</td>")(6)
    Cells(L + 3, 2) = Split(Split(str(P), "<td>""")(0), "</td>")(7)
Next P
End Sub

Chihiro · Mar 1, 2017

For table data. Just use object to loop through as I showed you in the past instead of using text to split. It's much easier to manage and more readable.

Code:

Sub TableData()

Dim xmlpage As New MSXML2.XMLHTTP60
Dim htmldoc As New MSHTML.HTMLDocument
Dim htmlas As Object, htmla As Object
Dim tRow As Object, tCel As Object
Dim s As Long, x As Long, c As Long

x = 2
c = 1
xmlpage.Open "GET", "https://fantasy.premierleague.com/player-list/", False
xmlpage.send
htmldoc.body.innerHTML = xmlpage.responseText

Set htmlas = htmldoc.getElementsByTagName("table")(0)

For Each tRow In htmlas.Rows
    For Each tCel In tRow.Cells
        Sheet1.Cells(x, c) = tCel.innerText
        c = c + 1
    Next tCel
    c = 1
    x = x + 1
Next tRow


End Sub

shahin · Mar 1, 2017

Thanks sir Chihiro for your kind reply. I was just trying to apply the split function to extract a table data using responsetext so that , if necessary, I can work with this method as well. Btw, there is no doubt that the method you applied here is undoubtedly the best in this case.

shahin · Mar 2, 2017

Dear sir Chihiro, using the class name "listing listing-search listing-data" in yellowpage Australia (Coffee shop category) I have tried to make a scraper which is now able to grab the Name and the Link. However, even after trying a lot I can't write the code for Address and Phone Number of that Name using the class I mentioned first. I hope in your free time you will take a look into this to come up with a solution. I'm pasting the code below I've written so far. Thanks in advance.

Code:

Sub GetData()
Const URL = "https://www.yellowpages.com.au/search/listings?clue=coffee+shops&locationClue=all+states&lat=&lon=&selectedViewMode=list"
Const weblink = "https://www.yellowpages.com.au"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim topics As Object, topic As Object, posts As Object, post As Object
Dim x As Long

x = 2

http.Open "GET", URL, False
http.send
html.body.innerHTML = http.responseText

Set topics = html.getElementsByClassName("listing listing-search listing-data")

For Each topic In topics
    Set posts = topic.getElementsByTagName("div")(0).getElementsByTagName("a")(1)
    Set post = topic.getElementsByTagName("div")(0).getElementsByTagName("a")(1)

        Cells(x, 1) = posts.innerText
        Cells(x, 2) = weblink & Replace(post.href, "about:", "")
    x = x + 1
Next topic
End Sub

shahin · May 17, 2017

Rectified version of fourth demo in this thread.

Code:

Sub GetData()
Const URL = "https://www.yellowpages.com.au/search/listings?clue=coffee+shops&locationClue=all+states&lat=&lon=&selectedViewMode=list"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim obj As HTMLHtmlElement

With http
    .Open "GET", URL, False
    .send
    html.body.innerHTML = .responseText
End With
x = 2

For Each obj In html.getElementsByClassName("listing listing-search listing-data")
    With obj.getElementsByClassName("listing-name")
        If .Length Then Cells(x, 1) = .Item(0).innerText
    End With
   
    With obj.getElementsByClassName("click-to-call contact contact-preferred contact-phone")
        If .Length Then Cells(x, 2) = .Item(0).innerText
    End With
   
    With obj.getElementsByClassName("listing-address mappable-address mappable-address-with-poi")
        If .Length Then Cells(x, 3) = .Item(0).innerText
    End With
    x = x + 1
Next obj
Set http = Nothing: Set html = Nothing: Set obj = Nothing
End Sub

stefanoste78 · May 17, 2017

this macro dont work on my pc

Chihiro · May 17, 2017

this macro dont work on my pc

My guess, you don't have references to appropriate libraries in your VBA project. For above code to work, you'll need Microsoft XML v6.0 and Microsoft HTML Object libraries referenced.

stefanoste78 · May 17, 2017

thank you.
I'll do as you wrote.

Help me find "href"

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Active Member

Active Member

Active Member

Excel Ninja

Active Member

Active Member

Active Member

Member

Excel Ninja

Member