• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Can't scrape title from webpages

shahin

Active Member
Writing a script when I run it I get error. I can't figure out the mistakes I'm doing. Any help would be appreciated. I wanted to get the title of any random sites I would like to input in my excel file. Thanks in advance. Here is what I've tried so far.

Code:
Sub Test_title()
Dim req As New XMLHTTP60, html As New HTMLDocument
Dim topic As Object, post As Object
Dim cel As Range, site As String

For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).row)
    If cel.Value <> "" Then
        site = "http://" & cel.Value
        With req
            .Open "GET", site, False
            .send
            html.body.innerHTML = .responseText
        End With
   
        Set topic = html.getElementsByTagName("title")
        For Each post In topic
            x = x + 1
            Cells(x, 2) = post.innerText
        Next post
    End If
Next cel
End Sub

Websites I tried with:
Code:
www.dailynews.com
www.dailynews.co.zw
www.gulf-daily-news.com
www.dailynews.gov.bw
 
Last edited:
Well... at least for "www.dailynews.com" title is in <Head> and not in <Body>.

Instead of loading it to html.body.innerHTML. Just parse the .responseText

EDIT: There is html.write method to put <Head> section in the document. But that method can't be directly used via VBA. You will get compile error.
 
Last edited:
Thanks sir, for your reply. It is agonizing that a simple requirement sometimes i find it difficult to satisfy with vba in a simple manner whereas with only seven lines of code it can very easily be achieved in python.
 
Well, if you have python. Why not just write to text file using python and read result in Excel?

Though I'm not sure what your use case is.
 
Thanks again sir, for your reply. I always try to do the same with both the languages for learning purpose.
 
I think I've found something to get around. I have tried with several urls and it works perfectly. Before putting any url in the scraper I have made sure putting that in internet explorer first to see what the url looks like. Here is what I have tried with. I can't make the whole process using xmlhttp request. I hope someone will take a look into this. Thanks in advance.

Code:
Sub Title_scraping()
Dim IE As Object
Dim doc As Object

Set IE = CreateObject("InternetExplorer.Application")

IE.navigate "http://www.dailynews.com/"

While IE.Busy
    DoEvents
Wend

Set doc = IE.document
Cells(1, 2) = doc.title

End Sub

The urls I've tried with:

Code:
http://www.dailynews.com/
http://www.dailynews.co.zw/
http://www.gdnonline.com/
http://www.dailynews.gov.bw/
http://chandoo.org/forum/forums/vba-macros/
 
Last edited:
Hi !

As far easy just taking a glance at webpage html code !
Search within it the title and use VBA text functions,
just needing around ten codelines and only one variable
 
Thanks marc L, for your response. Sounds like there is an easy way. I can't really get any idea beyond this.
 

Did you at least see any webpage code of your sample links ?
As all is there ! As a webpage code is nothing but text,
as always using VBA text functions to extract desired text.

Around ten codelines with only one variable …
 
I'm trying to figure out what you just suggested. Meanwhile, I've made a loop to get all the titles from those links if the links are placed in range A1 to Range A5. The populated results will be displayed in the adjacent cells.
Code:
Sub Title_scraping()
Dim IE As Object
Dim doc As Object, cel As Range

For Each cel In Range("A1:A5")
    Set IE = CreateObject("InternetExplorer.Application")
    IE.navigate cel.Value
   
    While IE.Busy
        DoEvents
    Wend
   
    Set doc = IE.document
    x = x + 1
    Cells(x, 2) = doc.title
Next cel
End Sub
 
Hi Marc L!! Thanks again for showing me how I could do this. I'm almost there. Now, it gives me all the Title tags from each site but I need only the first one from populated results.
Code:
Sub Title_Data_from_sites()
Dim http As New MSXML2.XMLHTTP60
Dim str As Variant

For Each cel In Range("A1:A3")
    With http
        .Open "GET", cel.Value, False
        .send
        str = Split(.responseText, " title=")
    End With
  
    x = UBound(str)
  
    For V = 1 To x
        L = L + 1
        Cells(L, 2) = Split(Split(str(V), """")(1), """")(0)
    Next V
Next cel
End Sub

Sites I've tried with:

Code:
https://stackoverflow.com/documentation/
https://codereview.stackexchange.com/
https://yts.ag/browse-movies
 
Last edited:
So you misread 'cause there is only one title tag …

• Open a webpage via a webbrowser.
• Write on a paper exactly its title.
• Via the webbrowser, activate the command to see the webpage html code
and search exactly same title as written so you will find the unique tag
whatever the webpage as it is the official web document structure !

Doable with one variable only …
As again you still forgot to release object variable so do not use any !
 
Thanks Marc L, for the proper guidance. I've fixed it already. Here is the code:
Code:
Sub Title_Data()
Dim http As New MSXML2.XMLHTTP60
Dim str As Variant

For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).row)
    With http
        .Open "GET", cel.Value, False
        .send
        str = Split(.responseText, "<title>")
    End With
   
    x = UBound(str)
   
    For V = 1 To x
        L = L + 1
        Cells(L, 2) = Split(str(V), "</title>")(0)
    Next V
Next cel
Set http = Nothing
End Sub
 
You can combine the string manipulation into single line. Since you know there is only one <title>...</title> string. Splitting string by <title>, 2nd array element (index 1) is always the part you want.

Code:
Split(Split(.responseText, "<title>")(1),"</title">)(0)
 
Oh my god!!! Now it is even shorter and works like magic. Thanks sir. Btw, isn't it how the code should look like?
Code:
Sub Title_Data()
Dim http As New MSXML2.XMLHTTP60

For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).row)
    With http
        .Open "GET", cel.Value, False
        .send
   
        L = L + 1
        Cells(L, 2) = Split(Split(.responseText, "<title>")(1), "</title>")(0)
    End With
Next cel
Set http = Nothing
End Sub
 
Hello.
I know that you often make web-related jobs. This is what I'm interested in ...
Can you tell me why I never go?
I attach the photo to the issue with the title.
Would not it be better to insert the file with the macro? Mistakes will be avoided ...
Thank you
 

Attachments

  • error.JPG
    error.JPG
    50.6 KB · Views: 6
Hey stefanoste78, I've uploaded a file with the macro. In column A input any website you like to get the title of that in column B. Most of the time it works. If you find it useful, don't forget to give it a thumbs up.
 

Attachments

  • Scraping_title.xlsm
    17.4 KB · Views: 6
Doable with one variable only …
Code:
Sub Demo1()
    With New XMLHTTP60
            On Error Resume Next
        For R& = 1 To Cells(Rows.Count, 1).End(xlUp).Row
            .Open "GET", Cells(R, 1).Value, False
            .send
            If .Status = 200 Then Cells(R, 2).Value = Split(Split(.responseText, "<title>")(1), "</")(0)
        Next
    End With
End Sub
 
Instead of looping through each cell object and then using another variable for row index, Marc's code uses row index alone.

Also, by using "With New XMLHTTP60", object is automatically set to "Nothing" at end of procedure without need to set it to "Nothing".

You can also write it using cell object only (without row index) like...
Code:
Sub Demo2()
    Dim cel As Range
    With New XMLHTTP60
            On Error Resume Next
        For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row)
            .Open "GET", cel.Value, False
            .send
            If .Status = 200 Then cel.Offset(, 1).Value = Split(Split(.responseText, "<title>")(1), "</")(0)
        Next
    End With
End Sub
 
Hi Marc L, is it possible to make an "If" statement so that it will cover both <title> and <TITLE> as few websites use TITLE in uppercase.
 
Instead of looping through each cell object and then using another variable for row index, Marc's code uses row index alone.

Also, by using "With New XMLHTTP60", object is automatically set to "Nothing" at end of procedure without need to set it to "Nothing".

You can also write it using cell object only (without row index) like...
Code:
Sub Demo2()
    Dim cel As Range
    With New XMLHTTP60
            On Error Resume Next
        For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row)
            .Open "GET", cel.Value, False
            .send
            If .Status = 200 Then cel.Offset(, 1).Value = Split(Split(.responseText, "<title>")(1), "</")(0)
        Next
    End With
End Sub


Thanks for the explanation.
Now can you tell me what changes from your code to mark?
 
One more thing to know about this code: you once suggested me not to use "on error resume next" in my code but this time you used this here. There might be any strong reason behind it. I would like to know. Thanks.
 
@stefanoste78

Like I wrote, it's just a demo of using different variable. His code uses row index as variable, mine uses cell (range) object as variable.

@shahin

You can add argument to split function, "vbTextCompare" or 1. By default it's using binary compare method.
See link for detail on split function.
https://msdn.microsoft.com/en-us/library/6x627e5f(v=vs.90).aspx

As for "On Error Resume Next". It's used to skip over blank cells or when site isn't there. Other than that, there really isn't much that will cause error and isn't hard to debug / fix if needed.

In more complex code, you will want to pinpoint the cause of error, and thus trap it using methods other than "Resume Next".
 
Back
Top