• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

List URLs that include specific string

It would help if you post what you've managed and a sample workbook.

For what you are trying to do, I'd imagine RegEx can be used, or other string manipulation function.

RegEx pattern would be something like...
"http.+excel.+"
 
In fact I have no idea how to start solving this issue
I would appreciate if you can help me out
Thanks a lot
 
Thanks a lot for the links. I have already seen these links but couldn't adapt them.
I can wait for you (I am not in hurry) ..
 
I am also starting to learn some part of it for my own help. So just got interested in your post. Here's a code which is basic and can well be inefficient but works and gets some data. Experts like Chihiro and Marc L will surely give ideas down the line if we start.

You need to add your own check (e.g. Instr) and url search. And note the references written in the beginning. I used them to early bind and see what they offer in intellisense.

Code:
Option Explicit
'\\ Reference set to  1. Microsoft XML, v6.0
'\\                  2. Microsoft HTML Object Library
Public Sub FetchGoogleData()
Dim xmlhttp As MSXML2.XMLHTTP60
Dim strURL As String, strOutput As String
Dim htmlfil As HTMLDocument
Dim objResults As HTMLDivElement
Dim objH3 As Object, objH3S As Object, objLink As Object, objLinks As Object
Dim i As Long

'\\ Get Data from site from specified url
strURL = "https://www.google.com/search?q=excel+forum"
Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
xmlhttp.Open "GET", strURL, False
xmlhttp.setRequestHeader "Content-Type", "Text/xml"
xmlhttp.send

'\\ Load response data in HTML Document and play with it
i = 1
Set htmlfil = CreateObject("htmlfile")
htmlfil.body.innerHTML = xmlhttp.responseText
Set objResults = htmlfil.getElementById("rso")
Set objH3S = objResults.getElementsByTagName("H3")
For Each objH3 In objH3S
    Set objLinks = objH3.getElementsByTagName("a")
    For Each objLink In objLinks
        Cells(i, "A").Value = objLink.href
        i = i + 1
    Next
Next

'\\ Release objects
Set htmlfil = Nothing
Set xmlhttp = Nothing
End Sub
 
That's great. Thank you very much
When I tested the code I encountered an error like that
Help.png

At this line
Code:
xmlhttp.send
 
I have edited this line (by chance) and it worked well
Code:
strURL = "https://www.google.com.eg/search?q=excel+forum"
 
Can you be bit more descriptive?

Using @shrivallabha 's code, you get...

From this, are you only looking to extract domain only? Or as long as domain contains "excel" entire href?

Either case, I'd use VBScript.RegExp

For extracting Domain only.
Code:
    For Each objLink In objLinks
        With CreateObject("VBScript.RegExp")
            .Pattern = "(http.+excel.*\.\w+\/+)"
            If .Test(objLink.href) Then
                Cells(i, "A").Value = .Execute(objLink.href)(0)
                i = i + 1
            End If
        End With
    Next

For extracting entire href when domain contains "excel"
Code:
    For Each objLink In objLinks
        With CreateObject("VBScript.RegExp")
            .Pattern = "http.+excel.*\.\w+\/.*"
            If .Test(objLink.href) Then
                Cells(i, "A").Value = objLink.href
                i = i + 1
            End If
        End With
    Next
 
Thanks a lot for great help Chihiro
I couldn't adapt the snippet you have posted. I encountered an error
Can you post the whole final version of code?

When I tried this
Code:
    Set objResults = htmlfil.getElementById("rso")
    Set objLinks = objResults.getElementsByTagName("H3")

For Each objLink In objLinks
        With CreateObject("VBScript.RegExp")
            .Pattern = "(http.+excel.*\.\w+\/+)"
            If .Test(objLink.href) Then
                Cells(i, "A").Value = .Execute(objLink.href)(0)
                i = i + 1
            End If
        End With
    Next

I encountered an error 438 (object doesn't support this property or method) at this line
Code:
If .Test(objLink.href) Then
 
Hmm? What error message are you getting?

At any rate just replace below section with one of above. No other change made on code.
Code:
    For Each objLink In objLinks
        Cells(i, "A").Value = objLink.href
        i = i + 1
    Next
 
So you must be missing outer loop...

Code:
Option Explicit
'\\ Reference set to  1. Microsoft XML, v6.0
'\\                  2. Microsoft HTML Object Library
Public Sub FetchGoogleData()
Dim xmlhttp As MSXML2.XMLHTTP60
Dim strURL As String, strOutput As String
Dim htmlfil As HTMLDocument
Dim objResults As HTMLDivElement
Dim objH3 As Object, objH3S As Object, objLink As Object, objLinks As Object
Dim i As Long

'\\ Get Data from site from specified url
strURL = "https://www.google.com.eg/search?q=excel+forum"
Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
xmlhttp.Open "GET", strURL, False
xmlhttp.setRequestHeader "Content-Type", "Text/xml"
xmlhttp.send

'\\ Load response data in HTML Document and play with it
i = 1
Set htmlfil = CreateObject("htmlfile")
htmlfil.body.innerHTML = xmlhttp.responseText
Set objResults = htmlfil.getElementById("rso")
Set objH3S = objResults.getElementsByTagName("H3")
For Each objH3 In objH3S
    Set objLinks = objH3.getElementsByTagName("a")
    For Each objLink In objLinks
        With CreateObject("VBScript.RegExp")
            .Pattern = "(http.+excel.*\.\w+\/+)"
            If .Test(objLink.href) Then
                Cells(i, "A").Value = .Execute(objLink.href)(0)
                i = i + 1
            End If
        End With
    Next
Next

'\\ Release objects
Set htmlfil = Nothing
Set xmlhttp = Nothing
End Sub
 
You're right. That's my bad
Thank you very very much for great and awesome help

Last point : Are the results for just one page of results in google search ..?
If so can it be extended for 10 pages for example
 
It is one page. XMLHttp sends request to Web server to fetch result using URL.

If you want more control on result... you'll need to use their JSON/Atom API

However, Google limits number of API request from same IP and will block your IP temporarily after so many attempts.

You used to be able to leverage public proxy servers to circumvent this, but recently (last year or so) Google has black listed most/all of public proxy.

If you want to make heavy use of it, you'll need to subscribe or pay.

There are few other ways to circumvent this... though not as efficient.

1. Pilot IE and read content
2. Understand how Google's search query URL works and manipulate it.

For the 2nd option. To get response for second page... you need.
"https://www.google.com.eg/search?q=excel forum&start=10"

Increment each loop by 10, since google returns 10 items at a time in page.

As for your second question...
The reason your query returns only one result is because when searching for just the string "excel" top search returns are "www.office.com" etc.

Without leveraging advanced search options, search will look for most popular site with "excel" in the content and not in URL.

You can set it to...
"https://www.google.com.eg/search?q=excel&as_occt=url"

But this will not distinguish where in URL "excel" occurs.
Experiment with Advanced search option and study result and the URL generated.
 
As a side note.

Most important thing in web scraping isn't your coding skill, but your skill in deciphering how the site behaves and is structured.

This can be achieved through use of Developer Tool in the browser of your choice, and by practice.

Few other good knowledge to have.
  • RegEx - Very useful in finding the information you are looking for in large/complex string.
  • HTMLDom/XMLDom structure - Knowing this will make it easier to get at the info (study W3School's documentation)
  • JSON - Very popular and compact way to store data for use in web site. You will often encounter sites that use java script to populate table/page from JSON response
  • Java - Sometimes, it's far easier to scrape web site by leveraging java script. See one of example codes done by Marc L in this forum.
 
To find specific link out of numerous you can rely on string manipulation as well. If you run the below script, you can see that it will parse those links containing "wikipedia:" only out of thousand links irrespective of which case they hold.

Code:
Sub Wikidata()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim link As Object

    With http
        .Open "GET", "https://en.wikipedia.org/wiki/Main_Page", False
        .send
        html.body.innerHTML = .responseText
    End With
   
    For Each link In html.getElementsByTagName("a")
        If InStr(1, link.href, "Wikipedia:", 1) > 0 Then
            row = row + 1: Cells(row, 1) = link.href
        End If
    Next link
End Sub
 
Back
Top