1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

List URLs that include specific string

Discussion in 'VBA Macros' started by YasserKhalil, Oct 12, 2017.

  1. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    Hello everyone
    I am learning web scraping but I find it difficult for me to get it ..
    I am seeking to search google for URLs that contains specific string (Say "Excel") so I need to list all the urls that has this string "Excel" in the URL itself
    For example :
    https://www.excelforum.com
    https://www.mrexcel.com/

    Thanks advanced for help
  2. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,914
    It would help if you post what you've managed and a sample workbook.

    For what you are trying to do, I'd imagine RegEx can be used, or other string manipulation function.

    RegEx pattern would be something like...
    "http.+excel.+"
    YasserKhalil likes this.
  3. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    In fact I have no idea how to start solving this issue
    I would appreciate if you can help me out
    Thanks a lot
  4. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,914
  5. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    Thanks a lot for the links. I have already seen these links but couldn't adapt them.
    I can wait for you (I am not in hurry) ..
  6. shrivallabha

    shrivallabha Excel Ninja

    Messages:
    1,787
    I am also starting to learn some part of it for my own help. So just got interested in your post. Here's a code which is basic and can well be inefficient but works and gets some data. Experts like Chihiro and Marc L will surely give ideas down the line if we start.

    You need to add your own check (e.g. Instr) and url search. And note the references written in the beginning. I used them to early bind and see what they offer in intellisense.

    Code (vb):
    Option Explicit
    '\\ Reference set to  1. Microsoft XML, v6.0
    '\\                  2. Microsoft HTML Object Library
    Public Sub FetchGoogleData()
    Dim xmlhttp As MSXML2.XMLHTTP60
    Dim strURL As String, strOutput As String
    Dim htmlfil As HTMLDocument
    Dim objResults As HTMLDivElement
    Dim objH3 As Object, objH3S As Object, objLink As Object, objLinks As Object
    Dim i As Long

    '\\ Get Data from site from specified url
    strURL = "https://www.google.com/search?q=excel+forum"
    Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
    xmlhttp.Open "GET", strURL, False
    xmlhttp.setRequestHeader "Content-Type", "Text/xml"
    xmlhttp.send

    '\\ Load response data in HTML Document and play with it
    i = 1
    Set htmlfil = CreateObject("htmlfile")
    htmlfil.body.innerHTML = xmlhttp.responseText
    Set objResults = htmlfil.getElementById("rso")
    Set objH3S = objResults.getElementsByTagName("H3")
    For Each objH3 In objH3S
        Set objLinks = objH3.getElementsByTagName("a")
        For Each objLink In objLinks
            Cells(i, "A").Value = objLink.href
            i = i + 1
        Next
    Next

    '\\ Release objects
    Set htmlfil = Nothing
    Set xmlhttp = Nothing
    End Sub
    Chihiro and YasserKhalil like this.
  7. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    That's great. Thank you very much
    When I tested the code I encountered an error like that
    Help.png

    At this line
    Code (vb):
    xmlhttp.send
  8. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    I have edited this line (by chance) and it worked well
    Code (vb):
    strURL = "https://www.google.com.eg/search?q=excel+forum"
  9. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
  10. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,914
    Can you be bit more descriptive?

    Using @shrivallabha 's code, you get...
    From this, are you only looking to extract domain only? Or as long as domain contains "excel" entire href?

    Either case, I'd use VBScript.RegExp

    For extracting Domain only.
    Code (vb):
        For Each objLink In objLinks
            With CreateObject("VBScript.RegExp")
                .Pattern = "(http.+excel.*\.\w+\/+)"
                If .Test(objLink.href) Then
                    Cells(i, "A").Value = .Execute(objLink.href)(0)
                    i = i + 1
                End If
            End With
        Next
    For extracting entire href when domain contains "excel"
    Code (vb):
        For Each objLink In objLinks
            With CreateObject("VBScript.RegExp")
                .Pattern = "http.+excel.*\.\w+\/.*"
                If .Test(objLink.href) Then
                    Cells(i, "A").Value = objLink.href
                    i = i + 1
                End If
            End With
        Next
    YasserKhalil likes this.
  11. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    Thanks a lot for great help Chihiro
    I couldn't adapt the snippet you have posted. I encountered an error
    Can you post the whole final version of code?

    When I tried this
    Code (vb):
        Set objResults = htmlfil.getElementById("rso")
        Set objLinks = objResults.getElementsByTagName("H3")

    For Each objLink In objLinks
            With CreateObject("VBScript.RegExp")
                .Pattern = "(http.+excel.*\.\w+\/+)"
                If .Test(objLink.href) Then
                    Cells(i, "A").Value = .Execute(objLink.href)(0)
                    i = i + 1
                End If
            End With
        Next
     
    I encountered an error 438 (object doesn't support this property or method) at this line
    Code (vb):
    If .Test(objLink.href) Then
  12. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,914
    Hmm? What error message are you getting?

    At any rate just replace below section with one of above. No other change made on code.
    Code (vb):
        For Each objLink In objLinks
            Cells(i, "A").Value = objLink.href
            i = i + 1
        Next
  13. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    The same error I received ...
  14. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,914
    So you must be missing outer loop...

    Code (vb):
    Option Explicit
    '\\ Reference set to  1. Microsoft XML, v6.0
    '\\                  2. Microsoft HTML Object Library
    Public Sub FetchGoogleData()
    Dim xmlhttp As MSXML2.XMLHTTP60
    Dim strURL As String, strOutput As String
    Dim htmlfil As HTMLDocument
    Dim objResults As HTMLDivElement
    Dim objH3 As Object, objH3S As Object, objLink As Object, objLinks As Object
    Dim i As Long

    '\\ Get Data from site from specified url
    strURL = "https://www.google.com.eg/search?q=excel+forum"
    Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
    xmlhttp.Open "GET", strURL, False
    xmlhttp.setRequestHeader "Content-Type", "Text/xml"
    xmlhttp.send

    '\\ Load response data in HTML Document and play with it
    i = 1
    Set htmlfil = CreateObject("htmlfile")
    htmlfil.body.innerHTML = xmlhttp.responseText
    Set objResults = htmlfil.getElementById("rso")
    Set objH3S = objResults.getElementsByTagName("H3")
    For Each objH3 In objH3S
        Set objLinks = objH3.getElementsByTagName("a")
        For Each objLink In objLinks
            With CreateObject("VBScript.RegExp")
                .Pattern = "(http.+excel.*\.\w+\/+)"
                If .Test(objLink.href) Then
                    Cells(i, "A").Value = .Execute(objLink.href)(0)
                    i = i + 1
                End If
            End With
        Next
    Next

    '\\ Release objects
    Set htmlfil = Nothing
    Set xmlhttp = Nothing
    End Sub
    YasserKhalil likes this.
  15. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    You're right. That's my bad
    Thank you very very much for great and awesome help

    Last point : Are the results for just one page of results in google search ..?
    If so can it be extended for 10 pages for example
  16. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
  17. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,914
    It is one page. XMLHttp sends request to Web server to fetch result using URL.

    If you want more control on result... you'll need to use their JSON/Atom API

    However, Google limits number of API request from same IP and will block your IP temporarily after so many attempts.

    You used to be able to leverage public proxy servers to circumvent this, but recently (last year or so) Google has black listed most/all of public proxy.

    If you want to make heavy use of it, you'll need to subscribe or pay.

    There are few other ways to circumvent this... though not as efficient.

    1. Pilot IE and read content
    2. Understand how Google's search query URL works and manipulate it.

    For the 2nd option. To get response for second page... you need.
    "https://www.google.com.eg/search?q=excel forum&start=10"

    Increment each loop by 10, since google returns 10 items at a time in page.

    As for your second question...
    The reason your query returns only one result is because when searching for just the string "excel" top search returns are "www.office.com" etc.

    Without leveraging advanced search options, search will look for most popular site with "excel" in the content and not in URL.

    You can set it to...
    "https://www.google.com.eg/search?q=excel&as_occt=url"

    But this will not distinguish where in URL "excel" occurs.
    Experiment with Advanced search option and study result and the URL generated.
    YasserKhalil likes this.
  18. Chihiro

    Chihiro Excel Ninja

    Messages:
    3,914
    As a side note.

    Most important thing in web scraping isn't your coding skill, but your skill in deciphering how the site behaves and is structured.

    This can be achieved through use of Developer Tool in the browser of your choice, and by practice.

    Few other good knowledge to have.
    • RegEx - Very useful in finding the information you are looking for in large/complex string.
    • HTMLDom/XMLDom structure - Knowing this will make it easier to get at the info (study W3School's documentation)
    • JSON - Very popular and compact way to store data for use in web site. You will often encounter sites that use java script to populate table/page from JSON response
    • Java - Sometimes, it's far easier to scrape web site by leveraging java script. See one of example codes done by Marc L in this forum.
    YasserKhalil and shrivallabha like this.
  19. shrivallabha

    shrivallabha Excel Ninja

    Messages:
    1,787
    Thank you @Chihiro for your inputs and guidelines. I will keep these in mind.
  20. shahin

    shahin Active Member

    Messages:
    597
    To find specific link out of numerous you can rely on string manipulation as well. If you run the below script, you can see that it will parse those links containing "wikipedia:" only out of thousand links irrespective of which case they hold.

    Code (vb):

    Sub Wikidata()
        Dim http As New XMLHTTP60, html As New HTMLDocument
        Dim link As Object

        With http
            .Open "GET", "https://en.wikipedia.org/wiki/Main_Page", False
            .send
            html.body.innerHTML = .responseText
        End With
       
        For Each link In html.getElementsByTagName("a")
            If InStr(1, link.href, "Wikipedia:", 1) > 0 Then
                row = row + 1: Cells(row, 1) = link.href
            End If
        Next link
    End Sub
     
    YasserKhalil likes this.
  21. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    Thank you very much everyone for sharing this issue
    You are awesome
  22. Marc L

    Marc L Excel Ninja

    Messages:
    3,277


    Using Google can be a mess as often limited and not accurate …​
    YasserKhalil likes this.
  23. YasserKhalil

    YasserKhalil Active Member

    Messages:
    774
    Thanks Mr. Marc
    So what's the alternative approaches?
  24. Marc L

    Marc L Excel Ninja

    Messages:
    3,277

    If result using Google is not what expected,
    use another web search engine …​
    YasserKhalil likes this.

Share This Page