1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Can't scrape title from webpages

Discussion in 'VBA Macros' started by shahin, Jun 2, 2017.

  1. shahin

    shahin Active Member

    Messages:
    428
    Writing a script when I run it I get error. I can't figure out the mistakes I'm doing. Any help would be appreciated. I wanted to get the title of any random sites I would like to input in my excel file. Thanks in advance. Here is what I've tried so far.

    Code (vb):

    Sub Test_title()
    Dim req As New XMLHTTP60, html As New HTMLDocument
    Dim topic As Object, post As Object
    Dim cel As Range, site As String

    For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).row)
        If cel.Value <> "" Then
            site = "http://" & cel.Value
            With req
                .Open "GET", site, False
                .send
                html.body.innerHTML = .responseText
            End With
       
            Set topic = html.getElementsByTagName("title")
            For Each post In topic
                x = x + 1
                Cells(x, 2) = post.innerText
            Next post
        End If
    Next cel
    End Sub
     
    Websites I tried with:
    Code (vb):

    www.dailynews.com
    www.dailynews.co.zw
    www.gulf-daily-news.com
    www.dailynews.gov.bw
     
    Last edited: Jun 2, 2017
  2. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,577
    Well... at least for "www.dailynews.com" title is in <Head> and not in <Body>.

    Instead of loading it to html.body.innerHTML. Just parse the .responseText

    EDIT: There is html.write method to put <Head> section in the document. But that method can't be directly used via VBA. You will get compile error.
    Last edited: Jun 2, 2017
  3. shahin

    shahin Active Member

    Messages:
    428
    Thanks sir, for your reply. It is agonizing that a simple requirement sometimes i find it difficult to satisfy with vba in a simple manner whereas with only seven lines of code it can very easily be achieved in python.
  4. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,577
    Well, if you have python. Why not just write to text file using python and read result in Excel?

    Though I'm not sure what your use case is.
  5. shahin

    shahin Active Member

    Messages:
    428
    Thanks again sir, for your reply. I always try to do the same with both the languages for learning purpose.
  6. shahin

    shahin Active Member

    Messages:
    428
    I think I've found something to get around. I have tried with several urls and it works perfectly. Before putting any url in the scraper I have made sure putting that in internet explorer first to see what the url looks like. Here is what I have tried with. I can't make the whole process using xmlhttp request. I hope someone will take a look into this. Thanks in advance.

    Code (vb):

    Sub Title_scraping()
    Dim IE As Object
    Dim doc As Object

    Set IE = CreateObject("InternetExplorer.Application")

    IE.navigate "http://www.dailynews.com/"

    While IE.Busy
        DoEvents
    Wend

    Set doc = IE.document
    Cells(1, 2) = doc.title

    End Sub
     
    The urls I've tried with:

    Code (vb):

    http://www.dailynews.com/
    http://www.dailynews.co.zw/
    http://www.gdnonline.com/
    http://www.dailynews.gov.bw/
    http://chandoo.org/forum/forums/vba-macros/
     
    Last edited: Jun 3, 2017
  7. Marc L

    Marc L Excel Ninja

    Messages:
    3,141
    Hi !

    As far easy just taking a glance at webpage html code !
    Search within it the title and use VBA text functions,
    just needing around ten codelines and only one variable
    shahin likes this.
  8. shahin

    shahin Active Member

    Messages:
    428
    Thanks marc L, for your response. Sounds like there is an easy way. I can't really get any idea beyond this.
  9. Marc L

    Marc L Excel Ninja

    Messages:
    3,141

    Did you at least see any webpage code of your sample links ?
    As all is there ! As a webpage code is nothing but text,
    as always using VBA text functions to extract desired text.

    Around ten codelines with only one variable …
  10. shahin

    shahin Active Member

    Messages:
    428
    I'm trying to figure out what you just suggested. Meanwhile, I've made a loop to get all the titles from those links if the links are placed in range A1 to Range A5. The populated results will be displayed in the adjacent cells.
    Code (vb):

    Sub Title_scraping()
    Dim IE As Object
    Dim doc As Object, cel As Range

    For Each cel In Range("A1:A5")
        Set IE = CreateObject("InternetExplorer.Application")
        IE.navigate cel.Value
       
        While IE.Busy
            DoEvents
        Wend
       
        Set doc = IE.document
        x = x + 1
        Cells(x, 2) = doc.title
    Next cel
    End Sub
     
  11. shahin

    shahin Active Member

    Messages:
    428
    Hi Marc L!! Thanks again for showing me how I could do this. I'm almost there. Now, it gives me all the Title tags from each site but I need only the first one from populated results.
    Code (vb):

    Sub Title_Data_from_sites()
    Dim http As New MSXML2.XMLHTTP60
    Dim str As Variant

    For Each cel In Range("A1:A3")
        With http
            .Open "GET", cel.Value, False
            .send
            str = Split(.responseText, " title=")
        End With
     
        x = UBound(str)
     
        For V = 1 To x
            L = L + 1
            Cells(L, 2) = Split(Split(str(V), """")(1), """")(0)
        Next V
    Next cel
    End Sub
     
    Sites I've tried with:

    Code (vb):

    https://stackoverflow.com/documentation/
    https://codereview.stackexchange.com/
    https://yts.ag/browse-movies
     
    Last edited: Jun 3, 2017
  12. Marc L

    Marc L Excel Ninja

    Messages:
    3,141
    So you misread 'cause there is only one title tag …

    • Open a webpage via a webbrowser.
    • Write on a paper exactly its title.
    • Via the webbrowser, activate the command to see the webpage html code
    and search exactly same title as written so you will find the unique tag
    whatever the webpage as it is the official web document structure !

    Doable with one variable only …
    As again you still forgot to release object variable so do not use any !
    shahin likes this.
  13. shahin

    shahin Active Member

    Messages:
    428
    Thanks Marc L, for the proper guidance. I've fixed it already. Here is the code:
    Code (vb):

    Sub Title_Data()
    Dim http As New MSXML2.XMLHTTP60
    Dim str As Variant

    For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).row)
        With http
            .Open "GET", cel.Value, False
            .send
            str = Split(.responseText, "<title>")
        End With
       
        x = UBound(str)
       
        For V = 1 To x
            L = L + 1
            Cells(L, 2) = Split(str(V), "</title>")(0)
        Next V
    Next cel
    Set http = Nothing
    End Sub
     
  14. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,577
    You can combine the string manipulation into single line. Since you know there is only one <title>...</title> string. Splitting string by <title>, 2nd array element (index 1) is always the part you want.

    Code (vb):
    Split(Split(.responseText, "<title>")(1),"</title">)(0)
    shahin likes this.
  15. shahin

    shahin Active Member

    Messages:
    428
    Oh my god!!! Now it is even shorter and works like magic. Thanks sir. Btw, isn't it how the code should look like?
    Code (vb):

    Sub Title_Data()
    Dim http As New MSXML2.XMLHTTP60

    For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).row)
        With http
            .Open "GET", cel.Value, False
            .send
       
            L = L + 1
            Cells(L, 2) = Split(Split(.responseText, "<title>")(1), "</title>")(0)
        End With
    Next cel
    Set http = Nothing
    End Sub
     
  16. stefanoste78

    stefanoste78 Member

    Messages:
    69
    Hello.
    I know that you often make web-related jobs. This is what I'm interested in ...
    Can you tell me why I never go?
    I attach the photo to the issue with the title.
    Would not it be better to insert the file with the macro? Mistakes will be avoided ...
    Thank you

    Attached Files:

  17. shahin

    shahin Active Member

    Messages:
    428
    Hey stefanoste78, I've uploaded a file with the macro. In column A input any website you like to get the title of that in column B. Most of the time it works. If you find it useful, don't forget to give it a thumbs up.

    Attached Files:

    stefanoste78 likes this.
  18. stefanoste78

    stefanoste78 Member

    Messages:
    69
    compliments
  19. Marc L

    Marc L Excel Ninja

    Messages:
    3,141
    Code (vb):
    Sub Demo1()
        With New XMLHTTP60
                On Error Resume Next
            For R& = 1 To Cells(Rows.Count, 1).End(xlUp).Row
                .Open "GET", Cells(R, 1).Value, False
                .send
                If .Status = 200 Then Cells(R, 2).Value = Split(Split(.responseText, "<title>")(1), "</")(0)
            Next
        End With
    End Sub
    shahin likes this.
  20. stefanoste78

    stefanoste78 Member

    Messages:
    69
    Hi mark.

    What does it change from your code to that of shinin?
    Last edited by a moderator: Jun 6, 2017
  21. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,577
    Instead of looping through each cell object and then using another variable for row index, Marc's code uses row index alone.

    Also, by using "With New XMLHTTP60", object is automatically set to "Nothing" at end of procedure without need to set it to "Nothing".

    You can also write it using cell object only (without row index) like...
    Code (vb):
    Sub Demo2()
        Dim cel As Range
        With New XMLHTTP60
                On Error Resume Next
            For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row)
                .Open "GET", cel.Value, False
                .send
                If .Status = 200 Then cel.Offset(, 1).Value = Split(Split(.responseText, "<title>")(1), "</")(0)
            Next
        End With
    End Sub
    shahin likes this.
  22. shahin

    shahin Active Member

    Messages:
    428
    Hi Marc L, is it possible to make an "If" statement so that it will cover both <title> and <TITLE> as few websites use TITLE in uppercase.
  23. stefanoste78

    stefanoste78 Member

    Messages:
    69

    Thanks for the explanation.
    Now can you tell me what changes from your code to mark?
  24. shahin

    shahin Active Member

    Messages:
    428
    One more thing to know about this code: you once suggested me not to use "on error resume next" in my code but this time you used this here. There might be any strong reason behind it. I would like to know. Thanks.
  25. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,577
    @stefanoste78

    Like I wrote, it's just a demo of using different variable. His code uses row index as variable, mine uses cell (range) object as variable.

    @shahin

    You can add argument to split function, "vbTextCompare" or 1. By default it's using binary compare method.
    See link for detail on split function.
    https://msdn.microsoft.com/en-us/library/6x627e5f(v=vs.90).aspx

    As for "On Error Resume Next". It's used to skip over blank cells or when site isn't there. Other than that, there really isn't much that will cause error and isn't hard to debug / fix if needed.

    In more complex code, you will want to pinpoint the cause of error, and thus trap it using methods other than "Resume Next".
    shahin likes this.

Share This Page