• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Unable to use regex within vba scraper in the right way

shahin

Active Member
Hi there!! This is my first ever attempt to work with regex which is supposed to be applied on "responsetext" to find any phone number from a webpage. As I never used regex within vba, I'm not at all sure what I did in my below script is at all permissible cause I found it from a site and tried to clone the usage of it. However, it throws an error "run time error 5" "invalid procedure call or argument".

Script I tried with:
Code:
Sub requests_with_regex()
    Dim IE As New InternetExplorer, html As New HTMLDocument
       
    With IE
        .Visible = True
        .navigate "http://www.peekayfarmequipments.com/enquiry.html"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = .document
    End With
   
    With CreateObject("VBScript.RegExp")
        .Pattern = "\+?\d+.-?\(?\d+.\)?-?\s?\w*.-?\s?\w*."
        r = r + 1: Cells(r, 1) = .Execute(html).Item(0).SubMatches(0)  ''error thrown here
    End With
   
    IE.Quit
End Sub

One more thing: i tried to use here (Regex) late binding. It would be awesome if i get any hint how to do it using early binding as well. Thanks in advance. Btw, the credit for the regex pattern goes to me lol!!
 
Hi ,

This at least runs and outputs something !
Code:
Sub requests_with_regex()
    Dim IE As New InternetExplorer, html As New HTMLDocument, regex As Object, allmatches As Object
    Dim i As Long, pos As Long, found As String
    Dim test_string As String
   
    With IE
        .Visible = True
        .navigate "http://www.peekayfarmequipments.com/enquiry.html"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = .document
    End With
   
    Set regex = CreateObject("VBScript.RegExp")
    With regex
        .Pattern = "[\91]*\-\d+"
        .Global = True
        r = r + 1
        Application.Wait (Now + TimeValue("0:00:10"))
        test_string = html.all.Item(99).innerText
        If .Test(test_string) Then
          Set allmatches = .Execute(test_string)
          For i = 0 To allmatches.Count - 1
              Debug.Print allmatches(i)
          Next
      End If
    End With
 
    IE.Quit
End Sub
Narayan
 
It did run and produced results without any issues. You are always the life saver Narayan. In "xmlhttp" requests we usually catch the "responsetext" by using something like "html.body.innerhtml = http.responsetext". What I would like to know is how to get the "responsetext" in IE (InternetExplorer). What should be the line actually? Btw, I've solved the issue using "xmlhttp" requests. It is not giving accurate results though, as my created expression is not accurate. However, it at least responds.

Code:
Sub requests_regex()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim regex As New RegExp, Matches As Object, Match As Object

    With http
        .Open "GET", "https://stackoverflow.com/questions/123559/a-comprehensive-regex-for-phone-number-validation", False
        .send
    End With
    With regex
         .Pattern = "\+?.\d+.-?\(?\d+.\)?-?\s?\w*.-?\s?\w*."
         .Global = True
         Set Matches = .Execute(http.responseText)
'         MsgBox Matches(0).Value
     End With
     For Each Match In Matches
        r = r + 1: Cells(r, 1) = Match
     Next Match
End Sub
 
Hi ,

I am afraid I don't have the answer to your question.

Every website needs to have its page source scrutinized to understand what expression will return what we are interested in.

Even after that , determining what element will return our required string is , on more than one occasion , a matter of trial and error.

Hopefully someone else may have the answer.

Narayan
 
In "xmlhttp" requests we usually catch the "responsetext" by using something like "html.body.innerhtml = http.responsetext". What I would like to know is how to get the "responsetext" in IE (InternetExplorer). What should be the line actually?

In your code, you've set HTMLdocument at...
Code:
Set html = .document

If you want just the body innerhtml... you just access it as...
Code:
html.body.innerhtml

Since .document has all components including .body.

FYI - For site in question you can use following pattern.
\+\d{2}.\d{10}

Always test your RegEx pattern using some tester.
https://regex101.com/

Edited typo.
 
Last edited:
@sir chihiro, what i would like to know is: in "xmlhttp" requests i used this very line and got valid response "Set Matches = .Execute(http.responseText)" [strictly consider the portion within .Execute()]. What to do if i go for IE. I kept that portion blank for your consideration in the below script:
Code:
Sub requests_with_regex()
    Dim IE As New InternetExplorer, html As New HTMLDocument
    Dim regxp As New RegExp, post As Object, phone_list As Object
      
    With IE
        .Visible = True
        .navigate "http://www.peekayfarmequipments.com/enquiry.html"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        html.body.innerHTML = .document
    End With
    Application.Wait (Now + TimeValue("0:00:10"))
    With regxp
        .Pattern = "\+\d{2}.\d{10}"
        Set phone_list = .Execute("this is where i get stuck") ''take a look here
    End With
    For Each post In phone_list
        R = R + 1: Cells(R, 1) = post
    Next post
    IE.Quit
End Sub
 
Last edited:
You misunderstood me...
Code:
Sub requests_with_regex()
    Dim IE As New InternetExplorer, html As New HTMLDocument
    Dim regxp As New RegExp, post As Object, phone_list As Object
     
    With IE
        .Visible = True
        .navigate "http://www.peekayfarmequipments.com/enquiry.html"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = .document
       
    End With
    Application.Wait (Now + TimeValue("0:00:10"))
    With regxp
        .Pattern = "\+\d{2}.\d{10}"
        Set phone_list = .Execute(html.body.innerHTML)
  End With
    For Each post In phone_list
        r = r + 1: Cells(r, 1) = post
    Next post
    IE.Quit
End Sub
 
Definitely i misunderstood you, sir. But i'm happy i did that otherwise i would not get my script refined from you. Thanks a trillion. Gonna give it a try.
 
@sir chihiro, It works like magic. One last thing to know on this topic. What if i attempt to scrape two categories, I meant multiple patterns, as in phone and email in this case. Please take a look below:
Code:
    With regxp
        .Pattern = "\+\d{2}.\d{10}" ''for phone
        .Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+" '' for email
        Set phone_list = .Execute(html.body.innerHTML)
        Range("A1") = phone_list(0)
        Range("B1")= 'what to input here to catch the email
    End With

Once again, i'm very new on this so please don't get bothered.
 
@sir chihiro, If i could understand you properly then this is what you meant, right?
Code:
Sub requests_with_regex()
    Dim IE As New InternetExplorer, html As New HTMLDocument
    Dim regxp As New RegExp, post As Object, phone_list As Object, email_list As Object
    
    With IE
        .Visible = False
        .navigate "http://agroengine.com/contact.html"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = .document
      
    End With
    Application.Wait (Now + TimeValue("0:00:05"))
  
    With regxp
        .Pattern = "\d+.\s\d+.\s\d+."
        Set phone_list = .Execute(html.body.innerHTML)
    End With
  
    With regxp
        .Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
        Set email_list = .Execute(html.body.innerHTML)
    End With
  
    Range("A1") = phone_list(0)
    Range("B1") = email_list(0)
  
    IE.Quit
End Sub

And it fetches the below results:
Code:
0091 124 4323900   
windsor@windsorauto.com
 
Close, but something like below is what I meant.
Code:
    With regxp
        .Pattern = "\d+.\s\d+.\s\d+."
        Set phone_list = .Execute(html.body.innerHTML)
        .Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
        Set email_list = .Execute(html.body.innerHTML)
    End With
 
hi
what would be the change to make to the code to ensure that Macro extrapolates all the mobile phone numbers that it finds rather than just only one?
 
Back
Top