1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Help me find "href"

Discussion in 'VBA Macros' started by shahin, Feb 22, 2017.

  1. shahin

    shahin Active Member

    Messages:
    433
    Hi there all! Good day. Hope you are doing well. I'm here to seek help to find "href". I have tried to write a code to parse "Title" and "href" of that Title. I can parse Title but in case of "href" I can't write that portion. Any help would be greatly appreciated. Here is the code I've written. Thanks in advance.

    Code (vb):

    Sub TorrentData()
    Const URL = "https://www.yify-torrent.org/search/1080p/"
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
    Dim P As Long, N As Long, L As Long, str As Variant

    L = 2

     http.Open "GET", URL, False
     http.send
     html.body.innerHTML = http.ResponseText
     str = Split(http.ResponseText, "<span class=""name"">")
     N = UBound(str)
        For P = 1 To N - 1
            Cells(L, 1) = Split(str(P), "<")(0)
            ''Cells(L, 2) = Here should be that code for "href"
           L = L + 1
        Next P
    End Sub

     

    Html element for "href" is between the block I'm pasting.

    Code (vb):

    <a href="/movie/51608/download-iliza-shlesinger-confirmed-kills-2016-1080p-mp4-yify-torrent.html" target="_blank" title="Download Iliza Shlesinger: Confirmed Kills (2016) 1080p"><span class="play" style="display: none;"><span class="name">Iliza Shlesinger: Confirmed Kills (2016) 1080p</span></span><img src="//pic.yify-torrent.org/20170222/51608/iliza-shlesinger-confirmed-kills-2016-1080p-poster.jpg" alt="Iliza Shlesinger: Confirmed Kills (2016) 1080p"></a>
     
  2. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,585
    Just use html.getElementsByTagName("H3")(i).innerText for title.

    Then html.getElementsByTagName("H3")(i).getElementsByTagName("A")(0).href

    So something like.
    Code (vb):
    Sub TorrentData()
    Const Site = "https://www.yify-torrent.org"
    Const SearchP = "/search/1080p/"
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
    Dim i As Long, L As Long
    Dim objH3 As Object

    L = 2

    http.Open "GET", Site & SearchP, False
    http.send
    html.body.innerHTML = http.ResponseText
    Set objH3 = html.getElementsByTagName("H3")
    For i = 0 To objH3.Length - 1
        Cells(L, 1) = objH3(i).innerText
        Cells(L, 2) = Replace(objH3(i).getElementsByTagName("A")(0).href, "about:", Site)
        L = L + 1
    Next
    End Sub
    shahin likes this.
  3. shahin

    shahin Active Member

    Messages:
    433
    Thanks sir Chihiro, I have already learnt from you the style you applied here. Basically, I always try to follow your style in my code. However, I wish to parse the "href" using the method I started here. Thanks again.
  4. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,585
    You are splitting responseText at "<span class=""name"">" and looping from str(1 to N-1). So you are not able to capture href which occurred before that.
    EDIT: You are actually excluding LBound(str) & UBound(str) using 1 to N-1. Since str is Base 0 array.

    Instead split responseText at "<h3><a href='".
    Also since you are directly working with http.responseText, no need to put it into html.body.innerHTML.

    So code would be something like...
    Code (vb):
    Sub TorrentData()
    Const URL = "https://www.yify-torrent.org/search/1080p/"
    Dim http As New MSXML2.XMLHTTP60
    Dim P As Long, N As Long, L As Long, str As Variant

    L = 2

        http.Open "GET", URL, False
        http.send
        str = Split(http.ResponseText, "<h3><a href='")

        N = UBound(str)

        For P = 1 To N
            Cells(L, 1) = Split(Split(str(P), "title=""")(1), """>")
            Cells(L, 2) = Split(str(P), "'")(0)
            L = L + 1
        Next P
    End Sub
    Last edited: Feb 22, 2017
    Thomas Kuriakose and shahin like this.
  5. shahin

    shahin Active Member

    Messages:
    433
    You are just awesome. This is what i wanted. Actually, I didn't even have any basic idea on this style that is because you found bunch of mistakes in my code. Now I'm gonna study on yours to learn. Thankssssssssssssssssssssssssssssssssssss.:):):):)
  6. shahin

    shahin Active Member

    Messages:
    433
    Dear sir, can you please provide me with any link following which I can learn more on this style to avoid errors. Thanks again.
  7. Marc L

    Marc L Excel Ninja

    Messages:
    3,146
    Hi !

    It is not about any style but just about Logic …

    Like I yet wrote in your previous thread on same subject,
    just train with easy VBA basics text functions as explained
    in VBA inner help and in tutorials on web !

    For those not able to well read a webpage code,
    parsing through object structure like in Chihiro's post #2 is easier,
    just needs at least to understand document' structure …
  8. shahin

    shahin Active Member

    Messages:
    433
    Dear sir Chihiro, you might have noticed that using only the class elements in this three categories, one can get results with some flaws. Because, if any elements among them is missing then the alignments doesn't go parallely and
    becomes a genuine mess.[More specifically-suppose a page should contain: Name count 30, phone count 30, address count 30 but if there are 25 phone counts and 27 address count in that page then the three columns are not parallel with the elements of same item]. Is there any way to get rid of that problem or the usage of three class elements is itself a flaw? If you run the code you will get what i meant. Thanks in advance.

    Code (vb):

    Sub GetData()
    Const URL = "https://www.yellowpages.com.au/search/listings?clue=coffee+shops&locationClue=all+states&lat=&lon=&selectedViewMode=list"
    Const weblink = "https://www.yellowpages.com.au"
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
    Dim topics As Object, topic As Object, posts As Object
    Dim post As Object, links As Object, link As Object
    Dim x As Long, i As Long

    x = 2

    http.Open "GET", URL, False
    http.send
    html.body.innerHTML = http.ResponseText

    Set topics = html.getElementsByClassName("listing-name")
    Set topic = html.getElementsByClassName("click-to-call contact contact-preferred contact-phone ")
    Set posts = html.getElementsByClassName("listing-address mappable-address mappable-address-with-poi")

    For i = 0 To topics.Length - 1
    If topics.Length > 0 Or topic.Length > 0 Or posts.Length > 0 Then
    Cells(x, 1) = topics(i).innerText
    Cells(x, 2) = topic(i).innerText
    Cells(x, 3) = posts(i).innerText
    x = x + 1
    End If
    Next i
    End Sub

     
  9. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,585
    That's because all of the class that you used is actually wrapped in div class="listing listing-search listing-data" element. Not all class need to be present within and you should not use those to loop through the list.

    Like Marc L and I have said before, this is more to do with understanding HTML source code structure rather than VBA skill. I'd recommend studying HTML/DOM structure by reading and testing at w3School.
    shahin likes this.
  10. shahin

    shahin Active Member

    Messages:
    433
    Dear sir Chihiro, is it possible to make any logical statement or something
    (if you consider retaining the original code of my second post in this thread ) to store the scraped document in its right position? I meant, as you said earlier that some class elements are missing values that is because phone-number or address of certain names are getting settled in wrong rows especially when it finds a gap or has filled in a gap already. For this reason, name address and phone numbers placed across the rows are not always of the same identity. I suppose there should be any way to fix this. Thanks in advance.
  11. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,585
    So you scrape using class="listing listing-search listing-data" for each row.

    And then within it, search for each of the class you used (it may be easier to do string manipulation or use getElementsByTagName). If error (i.e. not found) then just leave cell blank.
    shahin likes this.
  12. shahin

    shahin Active Member

    Messages:
    433
    Dear sir Chihiro, in my opinion, using second request can be a remedy of this problem. I did it and found the results as i expected it to be. Your one request is enough to scrape the whole docs accordingly, though! Will there be any issue if i stick to this second request thing to serve my purpose? Your any suggestion on this will be a great help to me. Thanks in advance.

    Code (vb):

    Sub GetData()
    Const URL = "https://www.yellowpages.com.au/search/listings?clue=coffee+shops&locationClue=all+states&lat=&lon=&selectedViewMode=list"
    Const weblink = "https://www.yellowpages.com.au"
    Dim http As New MSXML2.XMLHTTP60, htm As New HTMLDocument, html As New HTMLDocument
    Dim topics As Object, topic As Object, posts As Object
    Dim post As Object, links As Object, link As Object
    Dim str As String
    Dim x As Long, i As Long

    x = 2
        http.Open "GET", URL, False
        http.send
        htm.body.innerHTML = http.responseText
       
        Set links = htm.getElementsByClassName("listing-name")

        For Each link In links
        str = weblink & Replace(link.href, "about:", "")
       
        http.Open "GET", str, False
        http.send
        html.body.innerHTML = http.responseText
       
        Set topics = html.getElementsByClassName("listing-name")
        Set topic = html.getElementsByClassName("listing-address mappable-address mappable-address-with-poi")
        Set posts = html.getElementsByClassName("text middle  ")
       
            For i = 0 To topics.Length - 1
                If topics.Length > 0 And topic.Length > 0 And posts.Length > 0 Then
                Cells(x, 1) = topics(i).innerText
                Cells(x, 2) = topic(i).innerText
                Cells(x, 3) = posts(i).getElementsByTagName("div")(0).innerText
                x = x + 1
                End If
            Next i
        Next link
    End Sub
     
  13. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,585
    Whichever way you feel comfortable with is the one you should go with.

    There are many ways to accomplish same end result and in this instance there isn't clear advantage for one over the other that immediately comes to mind.
    shahin likes this.
  14. shahin

    shahin Active Member

    Messages:
    433
    Thanks sir. I'm very happy to hear that.:)
  15. shahin

    shahin Active Member

    Messages:
    433
    Dear sir Chihiro, hope you are doing well. Following your instruction I made a parser which is scraping perfectly except for 3 or 4 phone numbers out of all. Am i doing anything wrong? Here is the complete code. Phone number is in column 6. Thanks in advance.

    Code (vb):

    Sub RealYP()
    Const Site = "http://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=San%20Francisco%2C%20CA&page=2"
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
    Dim i As Long, L As Long
    Dim topics As Object

    L = 2

    http.Open "GET", Site, False
    http.send
    html.body.innerHTML = http.responseText
    Set topics = html.getElementsByClassName("info")
    For i = 0 To topics.Length - 1
        Cells(L, 1) = topics(i).getElementsByTagName("a")(0).innerText
        Cells(L, 2) = topics(i).getElementsByTagName("p")(0).getElementsByTagName("span")(0).innerText
        Cells(L, 3) = topics(i).getElementsByTagName("p")(0).getElementsByTagName("span")(1).innerText
        Cells(L, 4) = topics(i).getElementsByTagName("p")(0).getElementsByTagName("span")(2).innerText
        Cells(L, 5) = topics(i).getElementsByTagName("p")(0).getElementsByTagName("span")(3).innerText
        Cells(L, 6) = topics(i).getElementsByTagName("div")(2).innerText
        L = L + 1
    Next i
    End Sub
     
  16. shahin

    shahin Active Member

    Messages:
    433
    Solved it finally.
    The line should be:
    Cells(L, 6) = topics(i).getElementsByTagName("div")(0).LastChild.innerText
  17. shahin

    shahin Active Member

    Messages:
    433
    For those who like to use response text in combination with split function to parse a site. I've not been able to write code for image, though!

    Code (vb):

    Sub GettingData()
    Const URL = "http://www.yellowpages.com/search?search_terms=coffee%20shops&geo_location_terms=San%20Francisco%2C%20CA&page=2"
    Const link = "http://www.yellowpages.com"
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
    Dim P As Long, N As Long, L As Long, str As Variant

    L = 2
    http.Open "GET", URL, False
    http.send
    str = Split(http.responseText, "<div class=""info"">")
    N = UBound(str)
    For P = 1 To N

        Cells(L, 1) = Split(Split(str(P), "<span itemprop=""name"">")(1), "<")(0)
        Cells(L, 2) = link & Split(Split(str(P), "<a href=""")(1), """")(0)
        Cells(L, 3) = Split(Split(str(P), "itemprop=""streetAddress"" class=""street-address"">")(1), "<")(0)
        Cells(L, 4) = Replace(Split(Split(str(P), "<span itemprop=""addressLocality"" class=""locality"">")(1), "<")(0), ",&nbsp;", "")
        Cells(L, 5) = Split(Split(str(P), "<span itemprop=""addressRegion"">")(1), "<")(0)
        Cells(L, 6) = Split(Split(str(P), "<span itemprop=""postalCode"">")(1), "<")(0)
        Cells(L, 7) = Split(Split(str(P), "itemprop=""telephone"" class=""phones phone primary"">")(1), "<")(0)

        L = L + 1
    Next P
    End Sub
     
  18. shahin

    shahin Active Member

    Messages:
    433
    But, in case of table data, It's hard for me to sort out. However, what I've written is able to parse the required data. Not refined, though! It brings "td" and "tr" signs with the result. Any help would be greatly appreciated on this. Here is what I am up to.

    Code (vb):

    Const URL = "http://mymd.ae/docdetail/4633"
    Sub ParsingDocs()
    Dim http As Object
    Dim P As Long, N As Long, L As Long, str As Variant

    L = 2

    Set http = CreateObject("Msxml2.ServerXMLHTTP")

    http.Open "GET", URL, False
    http.send
    str = Split(http.responseText, "<table class=""description_table m_bottom_10"">")
    N = UBound(str)
    For P = 1 To N
        Cells(L, 1) = Split(Split(str(P), "<td>""")(0), "</td>")(0)
        Cells(L, 2) = Split(Split(str(P), "<td>""")(0), "</td>")(1)
        Cells(L + 1, 1) = Split(Split(str(P), "<td>""")(0), "</td>")(2)
        Cells(L + 1, 2) = Split(Split(str(P), "<td>""")(0), "</td>")(3)
        Cells(L + 2, 1) = Split(Split(str(P), "<td>""")(0), "</td>")(4)
        Cells(L + 2, 2) = Split(Split(str(P), "<td>""")(0), "</td>")(5)
        Cells(L + 3, 1) = Split(Split(str(P), "<td>""")(0), "</td>")(6)
        Cells(L + 3, 2) = Split(Split(str(P), "<td>""")(0), "</td>")(7)
    Next P
    End Sub
     
  19. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,585
    For table data. Just use object to loop through as I showed you in the past instead of using text to split. It's much easier to manage and more readable.

    Code (vb):
    Sub TableData()

    Dim xmlpage As New MSXML2.XMLHTTP60
    Dim htmldoc As New MSHTML.HTMLDocument
    Dim htmlas As Object, htmla As Object
    Dim tRow As Object, tCel As Object
    Dim s As Long, x As Long, c As Long

    x = 2
    c = 1
    xmlpage.Open "GET", "https://fantasy.premierleague.com/player-list/", False
    xmlpage.send
    htmldoc.body.innerHTML = xmlpage.responseText

    Set htmlas = htmldoc.getElementsByTagName("table")(0)

    For Each tRow In htmlas.Rows
        For Each tCel In tRow.Cells
            Sheet1.Cells(x, c) = tCel.innerText
            c = c + 1
        Next tCel
        c = 1
        x = x + 1
    Next tRow


    End Sub
    shahin likes this.
  20. shahin

    shahin Active Member

    Messages:
    433
    Thanks sir Chihiro for your kind reply. I was just trying to apply the split function to extract a table data using responsetext so that , if necessary, I can work with this method as well. Btw, there is no doubt that the method you applied here is undoubtedly the best in this case.
  21. shahin

    shahin Active Member

    Messages:
    433
    Dear sir Chihiro, using the class name "listing listing-search listing-data" in yellowpage Australia (Coffee shop category) I have tried to make a scraper which is now able to grab the Name and the Link. However, even after trying a lot I can't write the code for Address and Phone Number of that Name using the class I mentioned first. I hope in your free time you will take a look into this to come up with a solution. I'm pasting the code below I've written so far. Thanks in advance.

    Code (vb):

    Sub GetData()
    Const URL = "https://www.yellowpages.com.au/search/listings?clue=coffee+shops&locationClue=all+states&lat=&lon=&selectedViewMode=list"
    Const weblink = "https://www.yellowpages.com.au"
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
    Dim topics As Object, topic As Object, posts As Object, post As Object
    Dim x As Long

    x = 2

    http.Open "GET", URL, False
    http.send
    html.body.innerHTML = http.responseText

    Set topics = html.getElementsByClassName("listing listing-search listing-data")

    For Each topic In topics
        Set posts = topic.getElementsByTagName("div")(0).getElementsByTagName("a")(1)
        Set post = topic.getElementsByTagName("div")(0).getElementsByTagName("a")(1)

            Cells(x, 1) = posts.innerText
            Cells(x, 2) = weblink & Replace(post.href, "about:", "")
        x = x + 1
    Next topic
    End Sub

     
    Last edited: Mar 2, 2017
  22. shahin

    shahin Active Member

    Messages:
    433
    Rectified version of fourth demo in this thread.
    Code (vb):

    Sub GetData()
    Const URL = "https://www.yellowpages.com.au/search/listings?clue=coffee+shops&locationClue=all+states&lat=&lon=&selectedViewMode=list"
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim obj As HTMLHtmlElement

    With http
        .Open "GET", URL, False
        .send
        html.body.innerHTML = .responseText
    End With
    x = 2

    For Each obj In html.getElementsByClassName("listing listing-search listing-data")
        With obj.getElementsByClassName("listing-name")
            If .Length Then Cells(x, 1) = .Item(0).innerText
        End With
       
        With obj.getElementsByClassName("click-to-call contact contact-preferred contact-phone")
            If .Length Then Cells(x, 2) = .Item(0).innerText
        End With
       
        With obj.getElementsByClassName("listing-address mappable-address mappable-address-with-poi")
            If .Length Then Cells(x, 3) = .Item(0).innerText
        End With
        x = x + 1
    Next obj
    Set http = Nothing: Set html = Nothing: Set obj = Nothing
    End Sub
     
  23. stefanoste78

    stefanoste78 Member

    Messages:
    69
    this macro dont work on my pc
  24. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,585
    My guess, you don't have references to appropriate libraries in your VBA project. For above code to work, you'll need Microsoft XML v6.0 and Microsoft HTML Object libraries referenced.
  25. stefanoste78

    stefanoste78 Member

    Messages:
    69
    thank you.
    I'll do as you wrote.

Share This Page