List URLs that include specific string

YasserKhalil · Oct 12, 2017

Hello everyone
I am learning web scraping but I find it difficult for me to get it ..
I am seeking to search google for URLs that contains specific string (Say "Excel") so I need to list all the urls that has this string "Excel" in the URL itself
For example :
https://www.excelforum.com
https://www.mrexcel.com/

Thanks advanced for help

Chihiro · Oct 12, 2017

It would help if you post what you've managed and a sample workbook.

For what you are trying to do, I'd imagine RegEx can be used, or other string manipulation function.

RegEx pattern would be something like...
"http.+excel.+"

YasserKhalil · Oct 12, 2017

In fact I have no idea how to start solving this issue
I would appreciate if you can help me out
Thanks a lot

Chihiro · Oct 12, 2017

I'm bit busy at work, check links in the mean time.
http://analystcave.com/web-scraping-tutorial/
http://analystcave.com/scrape-google-search-results-to-csv/

YasserKhalil · Oct 12, 2017

Thanks a lot for the links. I have already seen these links but couldn't adapt them.
I can wait for you (I am not in hurry) ..

shrivallabha · Oct 13, 2017

I am also starting to learn some part of it for my own help. So just got interested in your post. Here's a code which is basic and can well be inefficient but works and gets some data. Experts like Chihiro and Marc L will surely give ideas down the line if we start.

You need to add your own check (e.g. Instr) and url search. And note the references written in the beginning. I used them to early bind and see what they offer in intellisense.

Code:

Option Explicit
'\\ Reference set to  1. Microsoft XML, v6.0
'\\                  2. Microsoft HTML Object Library
Public Sub FetchGoogleData()
Dim xmlhttp As MSXML2.XMLHTTP60
Dim strURL As String, strOutput As String
Dim htmlfil As HTMLDocument
Dim objResults As HTMLDivElement
Dim objH3 As Object, objH3S As Object, objLink As Object, objLinks As Object
Dim i As Long

'\\ Get Data from site from specified url
strURL = "https://www.google.com/search?q=excel+forum"
Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
xmlhttp.Open "GET", strURL, False
xmlhttp.setRequestHeader "Content-Type", "Text/xml"
xmlhttp.send

'\\ Load response data in HTML Document and play with it
i = 1
Set htmlfil = CreateObject("htmlfile")
htmlfil.body.innerHTML = xmlhttp.responseText
Set objResults = htmlfil.getElementById("rso")
Set objH3S = objResults.getElementsByTagName("H3")
For Each objH3 In objH3S
    Set objLinks = objH3.getElementsByTagName("a")
    For Each objLink In objLinks
        Cells(i, "A").Value = objLink.href
        i = i + 1
    Next
Next

'\\ Release objects
Set htmlfil = Nothing
Set xmlhttp = Nothing
End Sub

YasserKhalil · Oct 13, 2017

That's great. Thank you very much
When I tested the code I encountered an error like that

At this line

Code:

xmlhttp.send

YasserKhalil · Oct 13, 2017

I have edited this line (by chance) and it worked well

Code:

strURL = "https://www.google.com.eg/search?q=excel+forum"

YasserKhalil · Oct 13, 2017

How can I restrict and limit the results to search in the main domain only?
I mean such link "https://www.quora.com/Which-are-the-best-forums-for-help-on-Microsoft-Excel" .. This link I don't need it in results as the main domain is "https://www.quora.com/"

Thanks a lot for great and awesome help

Chihiro · Oct 13, 2017

Can you be bit more descriptive?

Using @shrivallabha 's code, you get...

https://www.excelforum.com/
https://www.mrexcel.com/forum/index.php
https://www.mrexcel.com/forum/excel-questions/
https://social.technet.microsoft.com/Forums/office/en-US/home?forum=excel
https://techcommunity.microsoft.com/t5/Excel/ct-p/Excel_Cat
http://www.msofficeforums.com/excel/
https://www.quora.com/Which-are-the-best-forums-for-help-on-Microsoft-Excel
http://excelmatters.com/excel-forums/
https://forum.chandoo.org/
http://www.vbaexpress.com/forum/forumdisplay.php?17-Excel-Help

From this, are you only looking to extract domain only? Or as long as domain contains "excel" entire href?

Either case, I'd use VBScript.RegExp

For extracting Domain only.

Code:

    For Each objLink In objLinks
        With CreateObject("VBScript.RegExp")
            .Pattern = "(http.+excel.*\.\w+\/+)"
            If .Test(objLink.href) Then
                Cells(i, "A").Value = .Execute(objLink.href)(0)
                i = i + 1
            End If
        End With
    Next

For extracting entire href when domain contains "excel"

Code:

    For Each objLink In objLinks
        With CreateObject("VBScript.RegExp")
            .Pattern = "http.+excel.*\.\w+\/.*"
            If .Test(objLink.href) Then
                Cells(i, "A").Value = objLink.href
                i = i + 1
            End If
        End With
    Next

YasserKhalil · Oct 13, 2017

Thanks a lot for great help Chihiro
I couldn't adapt the snippet you have posted. I encountered an error
Can you post the whole final version of code?

When I tried this

Code:

    Set objResults = htmlfil.getElementById("rso")
    Set objLinks = objResults.getElementsByTagName("H3")

For Each objLink In objLinks
        With CreateObject("VBScript.RegExp")
            .Pattern = "(http.+excel.*\.\w+\/+)"
            If .Test(objLink.href) Then
                Cells(i, "A").Value = .Execute(objLink.href)(0)
                i = i + 1
            End If
        End With
    Next

I encountered an error 438 (object doesn't support this property or method) at this line

Code:

If .Test(objLink.href) Then

Chihiro · Oct 13, 2017

Hmm? What error message are you getting?

At any rate just replace below section with one of above. No other change made on code.

Code:

    For Each objLink In objLinks
        Cells(i, "A").Value = objLink.href
        i = i + 1
    Next

YasserKhalil · Oct 13, 2017

The same error I received ...

Chihiro · Oct 13, 2017

So you must be missing outer loop...

Code:

Option Explicit
'\\ Reference set to  1. Microsoft XML, v6.0
'\\                  2. Microsoft HTML Object Library
Public Sub FetchGoogleData()
Dim xmlhttp As MSXML2.XMLHTTP60
Dim strURL As String, strOutput As String
Dim htmlfil As HTMLDocument
Dim objResults As HTMLDivElement
Dim objH3 As Object, objH3S As Object, objLink As Object, objLinks As Object
Dim i As Long

'\\ Get Data from site from specified url
strURL = "https://www.google.com.eg/search?q=excel+forum"
Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
xmlhttp.Open "GET", strURL, False
xmlhttp.setRequestHeader "Content-Type", "Text/xml"
xmlhttp.send

'\\ Load response data in HTML Document and play with it
i = 1
Set htmlfil = CreateObject("htmlfile")
htmlfil.body.innerHTML = xmlhttp.responseText
Set objResults = htmlfil.getElementById("rso")
Set objH3S = objResults.getElementsByTagName("H3")
For Each objH3 In objH3S
    Set objLinks = objH3.getElementsByTagName("a")
    For Each objLink In objLinks
        With CreateObject("VBScript.RegExp")
            .Pattern = "(http.+excel.*\.\w+\/+)"
            If .Test(objLink.href) Then
                Cells(i, "A").Value = .Execute(objLink.href)(0)
                i = i + 1
            End If
        End With
    Next
Next

'\\ Release objects
Set htmlfil = Nothing
Set xmlhttp = Nothing
End Sub

YasserKhalil · Oct 13, 2017

You're right. That's my bad
Thank you very very much for great and awesome help

Last point : Are the results for just one page of results in google search ..?
If so can it be extended for 10 pages for example

YasserKhalil · Oct 13, 2017

When I changed this line

Code:

strURL = "https://www.google.com.eg/search?q=excel"

I just got one result .."https://microsoft-excel.en.softonic.com/" Why?!!

Chihiro · Oct 13, 2017

It is one page. XMLHttp sends request to Web server to fetch result using URL.

If you want more control on result... you'll need to use their JSON/Atom API

However, Google limits number of API request from same IP and will block your IP temporarily after so many attempts.

You used to be able to leverage public proxy servers to circumvent this, but recently (last year or so) Google has black listed most/all of public proxy.

If you want to make heavy use of it, you'll need to subscribe or pay.

There are few other ways to circumvent this... though not as efficient.

1. Pilot IE and read content
2. Understand how Google's search query URL works and manipulate it.

For the 2nd option. To get response for second page... you need.
"https://www.google.com.eg/search?q=excel forum&start=10"

Increment each loop by 10, since google returns 10 items at a time in page.

As for your second question...
The reason your query returns only one result is because when searching for just the string "excel" top search returns are "www.office.com" etc.

Without leveraging advanced search options, search will look for most popular site with "excel" in the content and not in URL.

You can set it to...
"https://www.google.com.eg/search?q=excel&as_occt=url"

But this will not distinguish where in URL "excel" occurs.
Experiment with Advanced search option and study result and the URL generated.

Chihiro · Oct 13, 2017

As a side note.

Most important thing in web scraping isn't your coding skill, but your skill in deciphering how the site behaves and is structured.

This can be achieved through use of Developer Tool in the browser of your choice, and by practice.

Few other good knowledge to have.

RegEx - Very useful in finding the information you are looking for in large/complex string.
HTMLDom/XMLDom structure - Knowing this will make it easier to get at the info (study W3School's documentation)
JSON - Very popular and compact way to store data for use in web site. You will often encounter sites that use java script to populate table/page from JSON response
Java - Sometimes, it's far easier to scrape web site by leveraging java script. See one of example codes done by Marc L in this forum.

shrivallabha · Oct 13, 2017

Thank you @Chihiro for your inputs and guidelines. I will keep these in mind.

shahin · Oct 14, 2017

To find specific link out of numerous you can rely on string manipulation as well. If you run the below script, you can see that it will parse those links containing "wikipedia:" only out of thousand links irrespective of which case they hold.

Code:

Sub Wikidata()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim link As Object

    With http
        .Open "GET", "https://en.wikipedia.org/wiki/Main_Page", False
        .send
        html.body.innerHTML = .responseText
    End With
   
    For Each link In html.getElementsByTagName("a")
        If InStr(1, link.href, "Wikipedia:", 1) > 0 Then
            row = row + 1: Cells(row, 1) = link.href
        End If
    Next link
End Sub

YasserKhalil · Oct 14, 2017

Thank you very much everyone for sharing this issue
You are awesome

Marc L · Oct 14, 2017

Using Google can be a mess as often limited and not accurate …

YasserKhalil · Oct 14, 2017

Thanks Mr. Marc
So what's the alternative approaches?

Marc L · Oct 14, 2017

If result using Google is not what expected,
use another web search engine …

List URLs that include specific string

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja

Well-Known Member

Well-Known Member

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja

Well-Known Member

Well-Known Member

Excel Ninja

Excel Ninja

Excel Ninja

Active Member

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja