• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Creating a universal email scraper

Status
Not open for further replies.

shahin

Active Member
I've created a scraper which is able to parse "contact us" or similar links from any given website and tracking that link it will parse the email address from that webpage (if there is any). It took me days to get the idea and to create this. Although, it is working smoothly, there are rooms for improvement to make it flawless. If you give any website-link in the "link_var" variable and run it, most probably it will fetch you the email address from that site if there is any and unless the site is javascript enabled. Btw, is there any way I can use something other than "on error resume next" in my scraper for future reference?

Code:
Sub Email_parser()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim post As Object, link_var As Variant, link as Variant, refined_links As String
    Dim rxp As New RegExp, email_list As Object

    link_var = Array( _
        "http://spltech.in/", _
        "http://www.unifrostindia.com/", _
        "http://advanta.in/", _
        "http://www.superrefrigerations.com/", _
        "http://www.greenplanet.in/")

    For Each link In link_var
        With http
            .Open "GET", link, False
            .send
            html.body.innerHTML = .responseText
        End With
     
        On Error Resume Next  ''although this line is not required to deal with this particular links but it may be useful when time is right
     
        For Each post In html.getElementsByTagName("a")
            If InStr(link, "http:") > 0 Then x = Left(link, InStr(8, link, "/") - 1)
            If InStr(1, post.innerText, "contact", 1) > 0 Then refined_links = Replace(post.href, "about:", x & "/"): Exit For
        Next post
     
        With http
            .Open "GET", refined_links, False
            .send
        End With

        With rxp
            .Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
            .Global = True
            Set email_list = .Execute(http.responseText)
        End With
        R = R + 1: Cells(R + 1, 2) = email_list(0)
    Next link
End Sub
 
Last edited:

NARAYANK991

Excel Ninja
Hi ,

These links give a lot of background information on this :

http://www.cpearson.com/excel/errorhandling.htm

https://www.dropbox.com/s/98ipp2lcvr2u8vo/Error_Handling.pdf?dl=0

With specific reference to your code , the outermost loop , which starts here :
Code:
   For Each link In link_var
is unlikely to generate an error.

The next inner loop , which starts here :
Code:
       For Each post In html.getElementsByTagName("a")
can either skip the loop altogether , or not populate the refined_links string variable.

Either way , the refined_links variable will remain a null string.

With this , the subsequent http request will generate an error.

To handle this , you can check for whether refined_links is equal to vbNullString.

Further down , the object email_list is being set to the return value from the RegExp .Execute statement. This can generate an error.

If you can write error handlers to take care of these two types of errors , you can dispense with the On Error statement.

Narayan
 

Chihiro

Excel Ninja
FYI - To avoid producing error on RegExp.Execute(), you can use RegExp.Test() to check if there are any matches beforehand.
 
I tried the shinin macro and sometimes the macro inserts, at the link where it does not find the contact email, the previous email.
Is it possible to prevent this from happening, perhaps even by verifying before the url exists?
One question: Does a macro like this look at all the pages on the site? if so then would it also be possible to extract all the emails of a site?
thank you
 

Chihiro

Excel Ninja
As I've stated time and again, there really is no universal VBA code that will handle web scraping smoothly. You need to understand each site's structure and how information are filled.

VBA probably isn't the right tool for you. Though it's a bit more accessible than other programming languages.

For WEB crawling/scraping, try Python. It's the most popular language for web scraping.
There are ready made frameworks based on Python out there.
 

liamsoren

New Member
I've been using macroses for some time already, and can affirm that they are working very well if you know how to use them.
 

liamsoren

New Member
When I first wanted to scrap some emails, I was afraid that it's not legal, but after a small research I found out that if the data being scraped is publicly available, the scraping procedure is lawful. As a result, email scraping is legal if you collect contacts that are publicly available on the Internet. When scraping a website for contact information, make sure to read the website's Terms of Service. For this operation, I am using an email scraper online which proved to be very efficient, easy to use, and saving a lot of time.
 
Last edited by a moderator:
Status
Not open for further replies.
Top