• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Can't find any way to kick out duplicate leads on the fly

shahin

Active Member
I've written a script to automate some information from a webpage. My script is doing fine if I don't bother about duplicate leads. The script is able to go down a certain level of that page managing lazy-load and parse the information (company name and industry) from there. The only issue I would like to deal with is kick out duplicate leads. Is there any way I can shake off duplicate entries on the fly? Thanks in advance.

Here is my attempt so far:
Code:
Sub Handle_SlowLoad()
    URL$ = "https://www.inc.com/profile/sumup-payments-limited"
    Dim post As Object, container As Object, scroll As Long

    With CreateObject("InternetExplorer.Application")
        .Visible = True
        .navigate URL
        While .Busy = True Or .readyState < 4: DoEvents: Wend
      
        For scroll = 1 To 10
            Set container = .document.getElementsByClassName("profile")
            .document.parentWindow.scrollBy 0, 99999
            Application.Wait Now + TimeValue("00:00:003")
        Next scroll
      
        For Each post In container
            With post.getElementsByTagName("h1")
                If .Length Then r = r + 1: Cells(r, 1) = .Item(0).innerText
            End With
            With post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
                If .Length Then Cells(r, 2) = .Item(0).innerText
            End With
        Next post
        .Quit
    End With
End Sub

This is the partial output I'm having at this moment (the page does not contain any duplicate leads ,though):

Code:
Sumup Payments Limited    IT Services
Sumup Payments Limited    IT Services
Sumup Payments Limited    IT Services
Restel Fast Food Oy    Travel & Hospitality
Restel Fast Food Oy    Travel & Hospitality
 
So you are putting value into cell right as you scrape.

You already have data in cell(s) above, if it's duplicate. Just use Application.Match, Evaluate(), or use collection/dictionary (in these cases I'd store all data and then put into cell at end) to check for duplicate.
 
You can record macro in Excel and clean it up which will tell you to add one more line after
Code:
End With
which will be
Code:
ActiveSheet.UsedRange.RemoveDuplicates Columns:=Array(1, 2), Header:=xlNo

If you want to do this at run-time then follow Chihiro's advice.
 
Happy to get you in touch, sir. The thing is the methods you suggested me to apply are totally foreign to me so, I'm not sure I will ever be able to accomplish it myself.
 
Then go with method @shrivallabha suggested.

In this case it makes no difference in terms of performance really. Since you need to scrape the info to check if its duplicate anyways. Dictionary or other collection method will come in handy if you are collecting list of urls for further scraping down the line.
 
@ sir chihiro, I'm very curious to know how dictionary or other collection method could have been handy. Any link or any hint will be a great help to go further. Thanks sir.
 
@sir chihiro, Yes you did and that was the first time I ever worked with dictionary. That is why I'm not so confident I'm capable of handling this. However, I'll give it a go and try to paste here the refined one for your consideration.
 
It seems I've managed to do it slightly disorganized way. The code doesn't look that good and the output It produces are being written in a single line (which I cant manage).
Here is what I've tried:

Code:
Sub Handle_SlowLoad()
    URL$ = "https://www.inc.com/profile/sumup-payments-limited"
    Dim post As Object, container As Object, scroll As Long
    Dim ldic As Object, key

    Set idic = CreateObject("Scripting.Dictionary")

    With CreateObject("InternetExplorer.Application")
        .Visible = True
        .navigate URL
        While .Busy = True Or .readyState < 4: DoEvents: Wend

        For scroll = 1 To 5
            Set container = .document.getElementsByClassName("profile")
            .document.parentWindow.scrollBy 0, 99999
            Application.Wait Now + TimeValue("00:00:003")
        Next scroll

        For Each post In container
            With post.getElementsByTagName("h1")
                If .Length Then
                    If Not idic.exists(.Item(0).innerText) Then
                        idic.Add .Item(0).innerText, 1
                    End If
                End If
            End With
            With post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
                If .Length Then
                    If Not idic.exists(.Item(0).innerText) Then
                        idic.Add .Item(0).innerText, 1
                    End If
                End If
            End With
        Next post
        For Each key In idic.Keys
            r = r + 1: Cells(r, 1) = key
        Next key
        .Quit
    End With
End Sub

It produces (partial output):
Code:
Sumup Payments Limited
IT Services
Restel Fast Food Oy
Travel & Hospitality
New Teck Well Service Limited Liability Company
Energy

whereas I'm expecting to have like:

Code:
Sumup Payments Limited                             IT Services
Restel Fast Food Oy                                Travel & Hospitality
New Teck Well Service Limited Liability Company    Energy

And last of all: what If there are 4/5 fields to parse? Because, the dictionary is supposed to handle two items (key,value) if I've not seriously misunderstood.
 
Read first dictionary VBA inner help or see documentation
on MSDN or some tutorial elsewhere on Web
without forgetting the samples in threads of this forum !

An EZ dictionary object handle a key and an item;
as the item can handle several data like for example
within a string with a delimiter or an array variable …

Sky is the limit ! (your imagination …)
 
Basically use...
Code:
post.getElementsByTagName("h1")
As Key.

And
Code:
post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
As Item.

Or just concatenate/join the two using delimiter and use that as key.

Also, in this instance, you'd want implicitly add items to dictionary instead of explicitly checking for .Exists and then adding. Since, you are not interested in counting duplicates or doing some other operation to pre-existing dictionary key/item pair.

Ex:Implicit add
Code:
idic(.Item(0).innerText) = 1

FYI - Your declared variable and actual is different.
Code:
    Dim ldic As Object, key

    Set idic = CreateObject("Scripting.Dictionary")
 
That was a typo, sir. On one hand, you are trying to make me understand and on the other hand I'm trying my best to decipher the message but unable to.
 
It's not hard really.
Code:
        For Each post In container
            With post.getElementsByTagName("h1")
                If .Length Then
                    a = .Item(0).innerText
                End If
            End With
            With post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
                If .Length Then
                    b = .Item(0).innerText
                End If
            End With
            idic(a) = b
        Next post
 
@ sir chihiro, you remind me of that quotation "one small step for a man ----". It's unbelievable what you have shown. I would never be able to think that way, not a chance. However, I expect that you will take a look at the below expression as well because things are going beyond my head.

What should be the ending expression if I start here?

Code:
       ForEach key In idic.Keys
            r = r + 1: Cells(r, 1) = key
       Next key


Thanks a zillion once again.
 
Exactly this is what I was expecting to have, sir. I've already rectified the script according to how you suggested. However, I executed the script and found it working great. Here it is for future reference:
Code:
Sub Handle_SlowLoad()
    URL$ = "https://www.inc.com/profile/sumup-payments-limited"
    Dim post As Object, container As Object, scroll As Long
    Dim idic As Object, key
  
    Set idic = CreateObject("Scripting.Dictionary")
  
    With CreateObject("InternetExplorer.Application")
        .Visible = True
        .navigate URL
        While .Busy = True Or .readyState < 4: DoEvents: Wend
      
        For scroll = 1 To 5
            Set container = .document.getElementsByClassName("profile")
            .document.parentWindow.scrollBy 0, 99999
            Application.Wait Now + TimeValue("00:00:003")
        Next scroll
      
        For Each post In container
            With post.getElementsByTagName("h1")
                If .Length Then
                    a = .Item(0).innerText
                End If
            End With
            With post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
                If .Length Then
                    b = .Item(0).innerText
                End If
            End With
            idic(a) = b
        Next post
      
        For Each key In idic.Keys
            r = r + 1: Cells(r, 1) = key
            Cells(r, 2) = idic(key)
        Next key
        .Quit
    End With
End Sub
 
One more thing to ask, sir. I've recently installed "Office 2013" in my window 7, 32 bit operating system. Everything is working fine under it. However, the only issue I notice is that when I copy any particular item from the editor and paste it somewhere else then I can see some illegible greek thing along with it. As example:

The original link:
Code:
https://www.inc.com/profile/sumup-payments-limited

But, I find the same link like below when directly copy from vbe console and paste here
Code:
https://www.inc.com/profile/sumup-payments-limited
)Ì    Ð)Ì    *Ì    z Gèe¸Žö§çe,§çe¬ö§çe,§çe§çe,§çe§çe,§çe§çe,§çe§çe,§çe§çe,§çe§çe,§çe    §çe,§çe 
§çe,§çe
§çe,§çe§çe,§çe
§çe,§çe
§çe,§çe§çe,§çe§çe,§çe §çe,§çe!§çe,§çe"§çe,§çeL‹ö#§çe,§çe$§çe,§çe%§çe,§çe&§çe,§çe'§çe,§çe(§çe,§çe)§çe,§çe¸ŽöL‹ö GèeL‹ö

Is there any change I should make in the setting. It doesn't hamper anything (when i do any stuff) ,though.
 
Back
Top