Can't scrape title from webpages

shahin · Jun 2, 2017

Writing a script when I run it I get error. I can't figure out the mistakes I'm doing. Any help would be appreciated. I wanted to get the title of any random sites I would like to input in my excel file. Thanks in advance. Here is what I've tried so far.

Code:

Sub Test_title()
Dim req As New XMLHTTP60, html As New HTMLDocument
Dim topic As Object, post As Object
Dim cel As Range, site As String

For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).row)
    If cel.Value <> "" Then
        site = "http://" & cel.Value
        With req
            .Open "GET", site, False
            .send
            html.body.innerHTML = .responseText
        End With
   
        Set topic = html.getElementsByTagName("title")
        For Each post In topic
            x = x + 1
            Cells(x, 2) = post.innerText
        Next post
    End If
Next cel
End Sub

Websites I tried with:

Code:

www.dailynews.com
www.dailynews.co.zw
www.gulf-daily-news.com
www.dailynews.gov.bw

Chihiro · Jun 2, 2017

Well... at least for "www.dailynews.com" title is in <Head> and not in <Body>.

Instead of loading it to html.body.innerHTML. Just parse the .responseText

EDIT: There is html.write method to put <Head> section in the document. But that method can't be directly used via VBA. You will get compile error.

shahin · Jun 2, 2017

Thanks sir, for your reply. It is agonizing that a simple requirement sometimes i find it difficult to satisfy with vba in a simple manner whereas with only seven lines of code it can very easily be achieved in python.

Chihiro · Jun 2, 2017

Well, if you have python. Why not just write to text file using python and read result in Excel?

Though I'm not sure what your use case is.

shahin · Jun 2, 2017

Thanks again sir, for your reply. I always try to do the same with both the languages for learning purpose.

shahin · Jun 3, 2017

I think I've found something to get around. I have tried with several urls and it works perfectly. Before putting any url in the scraper I have made sure putting that in internet explorer first to see what the url looks like. Here is what I have tried with. I can't make the whole process using xmlhttp request. I hope someone will take a look into this. Thanks in advance.

Code:

Sub Title_scraping()
Dim IE As Object
Dim doc As Object

Set IE = CreateObject("InternetExplorer.Application")

IE.navigate "http://www.dailynews.com/"

While IE.Busy
    DoEvents
Wend

Set doc = IE.document
Cells(1, 2) = doc.title

End Sub

The urls I've tried with:

Code:

http://www.dailynews.com/
http://www.dailynews.co.zw/
http://www.gdnonline.com/
http://www.dailynews.gov.bw/
http://chandoo.org/forum/forums/vba-macros/

Marc L · Jun 3, 2017

Hi !

As far easy just taking a glance at webpage html code !
Search within it the title and use VBA text functions,
just needing around ten codelines and only one variable …

shahin · Jun 3, 2017

Thanks marc L, for your response. Sounds like there is an easy way. I can't really get any idea beyond this.

Marc L · Jun 3, 2017

Did you at least see any webpage code of your sample links ?
As all is there ! As a webpage code is nothing but text,
as always using VBA text functions to extract desired text.

Around ten codelines with only one variable …

shahin · Jun 3, 2017

I'm trying to figure out what you just suggested. Meanwhile, I've made a loop to get all the titles from those links if the links are placed in range A1 to Range A5. The populated results will be displayed in the adjacent cells.

Code:

Sub Title_scraping()
Dim IE As Object
Dim doc As Object, cel As Range

For Each cel In Range("A1:A5")
    Set IE = CreateObject("InternetExplorer.Application")
    IE.navigate cel.Value
   
    While IE.Busy
        DoEvents
    Wend
   
    Set doc = IE.document
    x = x + 1
    Cells(x, 2) = doc.title
Next cel
End Sub

shahin · Jun 3, 2017

Hi Marc L!! Thanks again for showing me how I could do this. I'm almost there. Now, it gives me all the Title tags from each site but I need only the first one from populated results.

Code:

Sub Title_Data_from_sites()
Dim http As New MSXML2.XMLHTTP60
Dim str As Variant

For Each cel In Range("A1:A3")
    With http
        .Open "GET", cel.Value, False
        .send
        str = Split(.responseText, " title=")
    End With
  
    x = UBound(str)
  
    For V = 1 To x
        L = L + 1
        Cells(L, 2) = Split(Split(str(V), """")(1), """")(0)
    Next V
Next cel
End Sub

Sites I've tried with:

Code:

https://stackoverflow.com/documentation/
https://codereview.stackexchange.com/
https://yts.ag/browse-movies

Marc L · Jun 3, 2017

So you misread 'cause there is only one title tag …

• Open a webpage via a webbrowser.
• Write on a paper exactly its title.
• Via the webbrowser, activate the command to see the webpage html code
and search exactly same title as written so you will find the unique tag
whatever the webpage as it is the official web document structure !

Doable with one variable only …
As again you still forgot to release object variable so do not use any !

shahin · Jun 3, 2017

Thanks Marc L, for the proper guidance. I've fixed it already. Here is the code:

Code:

Sub Title_Data()
Dim http As New MSXML2.XMLHTTP60
Dim str As Variant

For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).row)
    With http
        .Open "GET", cel.Value, False
        .send
        str = Split(.responseText, "<title>")
    End With
   
    x = UBound(str)
   
    For V = 1 To x
        L = L + 1
        Cells(L, 2) = Split(str(V), "</title>")(0)
    Next V
Next cel
Set http = Nothing
End Sub

Chihiro · Jun 3, 2017

You can combine the string manipulation into single line. Since you know there is only one <title>...</title> string. Splitting string by <title>, 2nd array element (index 1) is always the part you want.

Code:

Split(Split(.responseText, "<title>")(1),"</title">)(0)

shahin · Jun 3, 2017

Oh my god!!! Now it is even shorter and works like magic. Thanks sir. Btw, isn't it how the code should look like?

Code:

Sub Title_Data()
Dim http As New MSXML2.XMLHTTP60

For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).row)
    With http
        .Open "GET", cel.Value, False
        .send
   
        L = L + 1
        Cells(L, 2) = Split(Split(.responseText, "<title>")(1), "</title>")(0)
    End With
Next cel
Set http = Nothing
End Sub

stefanoste78 · Jun 5, 2017

Hello.
I know that you often make web-related jobs. This is what I'm interested in ...
Can you tell me why I never go?
I attach the photo to the issue with the title.
Would not it be better to insert the file with the macro? Mistakes will be avoided ...
Thank you

shahin · Jun 5, 2017

Hey stefanoste78, I've uploaded a file with the macro. In column A input any website you like to get the title of that in column B. Most of the time it works. If you find it useful, don't forget to give it a thumbs up.

stefanoste78 · Jun 5, 2017

shahin said:
Hey stefanoste78, I've uploaded a file with the macro. In column A input any website you like to get the title of that in column B. Most of the time it works. If you find it useful, don't forget to give it a thumbs up.

compliments

Marc L · Jun 6, 2017

Marc L said:
Doable with one variable only …

Code:

Sub Demo1()
    With New XMLHTTP60
            On Error Resume Next
        For R& = 1 To Cells(Rows.Count, 1).End(xlUp).Row
            .Open "GET", Cells(R, 1).Value, False
            .send
            If .Status = 200 Then Cells(R, 2).Value = Split(Split(.responseText, "<title>")(1), "</")(0)
        Next
    End With
End Sub

stefanoste78 · Jun 6, 2017

Hi mark.

What does it change from your code to that of shinin?

Chihiro · Jun 6, 2017

Instead of looping through each cell object and then using another variable for row index, Marc's code uses row index alone.

Also, by using "With New XMLHTTP60", object is automatically set to "Nothing" at end of procedure without need to set it to "Nothing".

You can also write it using cell object only (without row index) like...

Code:

Sub Demo2()
    Dim cel As Range
    With New XMLHTTP60
            On Error Resume Next
        For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row)
            .Open "GET", cel.Value, False
            .send
            If .Status = 200 Then cel.Offset(, 1).Value = Split(Split(.responseText, "<title>")(1), "</")(0)
        Next
    End With
End Sub

shahin · Jun 6, 2017

Hi Marc L, is it possible to make an "If" statement so that it will cover both <title> and <TITLE> as few websites use TITLE in uppercase.

stefanoste78 · Jun 6, 2017

Chihiro said:
Instead of looping through each cell object and then using another variable for row index, Marc's code uses row index alone.

Also, by using "With New XMLHTTP60", object is automatically set to "Nothing" at end of procedure without need to set it to "Nothing".

You can also write it using cell object only (without row index) like...

Code:

Sub Demo2() Dim cel As Range With New XMLHTTP60 On Error Resume Next For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row) .Open "GET", cel.Value, False .send If .Status = 200 Then cel.Offset(, 1).Value = Split(Split(.responseText, "<title>")(1), "</")(0) Next End With End Sub

Thanks for the explanation.
Now can you tell me what changes from your code to mark?

shahin · Jun 6, 2017

One more thing to know about this code: you once suggested me not to use "on error resume next" in my code but this time you used this here. There might be any strong reason behind it. I would like to know. Thanks.

Chihiro · Jun 6, 2017

@stefanoste78

Like I wrote, it's just a demo of using different variable. His code uses row index as variable, mine uses cell (range) object as variable.

@shahin

You can add argument to split function, "vbTextCompare" or 1. By default it's using binary compare method.
See link for detail on split function.
https://msdn.microsoft.com/en-us/library/6x627e5f(v=vs.90).aspx

As for "On Error Resume Next". It's used to skip over blank cells or when site isn't there. Other than that, there really isn't much that will cause error and isn't hard to debug / fix if needed.

In more complex code, you will want to pinpoint the cause of error, and thus trap it using methods other than "Resume Next".

Can't scrape title from webpages

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Member

Attachments

Active Member

Attachments

Member

Excel Ninja

Member

Excel Ninja

Active Member

Member

Active Member

Excel Ninja