shahin
Active Member
I've created a scraper which is able to parse "contact us" or similar links from any given website and tracking that link it will parse the email address from that webpage (if there is any). It took me days to get the idea and to create this. Although, it is working smoothly, there are rooms for improvement to make it flawless. If you give any website-link in the "link_var" variable and run it, most probably it will fetch you the email address from that site if there is any and unless the site is javascript enabled. Btw, is there any way I can use something other than "on error resume next" in my scraper for future reference?
Code:
Sub Email_parser()
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, link_var As Variant, link as Variant, refined_links As String
Dim rxp As New RegExp, email_list As Object
link_var = Array( _
"http://spltech.in/", _
"http://www.unifrostindia.com/", _
"http://advanta.in/", _
"http://www.superrefrigerations.com/", _
"http://www.greenplanet.in/")
For Each link In link_var
With http
.Open "GET", link, False
.send
html.body.innerHTML = .responseText
End With
On Error Resume Next ''although this line is not required to deal with this particular links but it may be useful when time is right
For Each post In html.getElementsByTagName("a")
If InStr(link, "http:") > 0 Then x = Left(link, InStr(8, link, "/") - 1)
If InStr(1, post.innerText, "contact", 1) > 0 Then refined_links = Replace(post.href, "about:", x & "/"): Exit For
Next post
With http
.Open "GET", refined_links, False
.send
End With
With rxp
.Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
.Global = True
Set email_list = .Execute(http.responseText)
End With
R = R + 1: Cells(R + 1, 2) = email_list(0)
Next link
End Sub
Last edited: