1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Can't find any way to kick out duplicate leads on the fly

Discussion in 'VBA Macros' started by shahin, Mar 15, 2018.

  1. shahin

    shahin Active Member

    Messages:
    857
    I've written a script to automate some information from a webpage. My script is doing fine if I don't bother about duplicate leads. The script is able to go down a certain level of that page managing lazy-load and parse the information (company name and industry) from there. The only issue I would like to deal with is kick out duplicate leads. Is there any way I can shake off duplicate entries on the fly? Thanks in advance.

    Here is my attempt so far:
    Code (vb):

    Sub Handle_SlowLoad()
        URL$ = "https://www.inc.com/profile/sumup-payments-limited"
        Dim post As Object, container As Object, scroll As Long

        With CreateObject("InternetExplorer.Application")
            .Visible = True
            .navigate URL
            While .Busy = True Or .readyState < 4: DoEvents: Wend
         
            For scroll = 1 To 10
                Set container = .document.getElementsByClassName("profile")
                .document.parentWindow.scrollBy 0, 99999
                Application.Wait Now + TimeValue("00:00:003")
            Next scroll
         
            For Each post In container
                With post.getElementsByTagName("h1")
                    If .Length Then r = r + 1: Cells(r, 1) = .Item(0).innerText
                End With
                With post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
                    If .Length Then Cells(r, 2) = .Item(0).innerText
                End With
            Next post
            .Quit
        End With
    End Sub
     
    This is the partial output I'm having at this moment (the page does not contain any duplicate leads ,though):

    Code (vb):

    Sumup Payments Limited    IT Services
    Sumup Payments Limited    IT Services
    Sumup Payments Limited    IT Services
    Restel Fast Food Oy    Travel & Hospitality
    Restel Fast Food Oy    Travel & Hospitality
     
  2. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,556
    So you are putting value into cell right as you scrape.

    You already have data in cell(s) above, if it's duplicate. Just use Application.Match, Evaluate(), or use collection/dictionary (in these cases I'd store all data and then put into cell at end) to check for duplicate.
    shahin likes this.
  3. shrivallabha

    shrivallabha Excel Ninja

    Messages:
    1,865
    You can record macro in Excel and clean it up which will tell you to add one more line after
    Code (vb):
    End With
    which will be
    Code (vb):
    ActiveSheet.UsedRange.RemoveDuplicates Columns:=Array(1, 2), Header:=xlNo
    If you want to do this at run-time then follow Chihiro's advice.
    shahin likes this.
  4. shahin

    shahin Active Member

    Messages:
    857
    Happy to get you in touch, sir. The thing is the methods you suggested me to apply are totally foreign to me so, I'm not sure I will ever be able to accomplish it myself.
  5. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,556
    Then go with method @shrivallabha suggested.

    In this case it makes no difference in terms of performance really. Since you need to scrape the info to check if its duplicate anyways. Dictionary or other collection method will come in handy if you are collecting list of urls for further scraping down the line.
    shahin likes this.
  6. shahin

    shahin Active Member

    Messages:
    857
    Thanks a lot, shrivallabha. Your suggestion worked as it always does.
  7. shahin

    shahin Active Member

    Messages:
    857
    @ sir chihiro, I'm very curious to know how dictionary or other collection method could have been handy. Any link or any hint will be a great help to go further. Thanks sir.
  8. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,556
    shahin likes this.
  9. shahin

    shahin Active Member

    Messages:
    857
    @sir chihiro, Yes you did and that was the first time I ever worked with dictionary. That is why I'm not so confident I'm capable of handling this. However, I'll give it a go and try to paste here the refined one for your consideration.
  10. Marc L

    Marc L Excel Ninja

    Messages:
    3,793

    Hi !

    See also the dictionary samples within threads of this forum …​
  11. shahin

    shahin Active Member

    Messages:
    857
    It seems I've managed to do it slightly disorganized way. The code doesn't look that good and the output It produces are being written in a single line (which I cant manage).
    Here is what I've tried:

    Code (vb):

    Sub Handle_SlowLoad()
        URL$ = "https://www.inc.com/profile/sumup-payments-limited"
        Dim post As Object, container As Object, scroll As Long
        Dim ldic As Object, key

        Set idic = CreateObject("Scripting.Dictionary")

        With CreateObject("InternetExplorer.Application")
            .Visible = True
            .navigate URL
            While .Busy = True Or .readyState < 4: DoEvents: Wend

            For scroll = 1 To 5
                Set container = .document.getElementsByClassName("profile")
                .document.parentWindow.scrollBy 0, 99999
                Application.Wait Now + TimeValue("00:00:003")
            Next scroll

            For Each post In container
                With post.getElementsByTagName("h1")
                    If .Length Then
                        If Not idic.exists(.Item(0).innerText) Then
                            idic.Add .Item(0).innerText, 1
                        End If
                    End If
                End With
                With post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
                    If .Length Then
                        If Not idic.exists(.Item(0).innerText) Then
                            idic.Add .Item(0).innerText, 1
                        End If
                    End If
                End With
            Next post
            For Each key In idic.Keys
                r = r + 1: Cells(r, 1) = key
            Next key
            .Quit
        End With
    End Sub
     
    It produces (partial output):
    Code (vb):

    Sumup Payments Limited
    IT Services
    Restel Fast Food Oy
    Travel & Hospitality
    New Teck Well Service Limited Liability Company
    Energy
     
    whereas I'm expecting to have like:

    Code (vb):

    Sumup Payments Limited                             IT Services
    Restel Fast Food Oy                                Travel & Hospitality
    New Teck Well Service Limited Liability Company    Energy
     
    And last of all: what If there are 4/5 fields to parse? Because, the dictionary is supposed to handle two items (key,value) if I've not seriously misunderstood.
  12. Marc L

    Marc L Excel Ninja

    Messages:
    3,793
    Read first dictionary VBA inner help or see documentation
    on MSDN or some tutorial elsewhere on Web
    without forgetting the samples in threads of this forum !

    An EZ dictionary object handle a key and an item;
    as the item can handle several data like for example
    within a string with a delimiter or an array variable …

    Sky is the limit ! (your imagination …)
  13. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,556
    Basically use...
    Code (vb):
    post.getElementsByTagName("h1")
    As Key.

    And
    Code (vb):
    post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
    As Item.

    Or just concatenate/join the two using delimiter and use that as key.

    Also, in this instance, you'd want implicitly add items to dictionary instead of explicitly checking for .Exists and then adding. Since, you are not interested in counting duplicates or doing some other operation to pre-existing dictionary key/item pair.

    Ex:Implicit add
    Code (vb):
    idic(.Item(0).innerText) = 1
    FYI - Your declared variable and actual is different.
    Code (vb):
        Dim ldic As Object, key

        Set idic = CreateObject("Scripting.Dictionary")
    shahin likes this.
  14. shahin

    shahin Active Member

    Messages:
    857
    That was a typo, sir. On one hand, you are trying to make me understand and on the other hand I'm trying my best to decipher the message but unable to.
  15. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,556
    It's not hard really.
    Code (vb):
            For Each post In container
                With post.getElementsByTagName("h1")
                    If .Length Then
                        a = .Item(0).innerText
                    End If
                End With
                With post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
                    If .Length Then
                        b = .Item(0).innerText
                    End If
                End With
                idic(a) = b
            Next post
    shahin likes this.
  16. shahin

    shahin Active Member

    Messages:
    857
    @ sir chihiro, you remind me of that quotation "one small step for a man ----". It's unbelievable what you have shown. I would never be able to think that way, not a chance. However, I expect that you will take a look at the below expression as well because things are going beyond my head.

    What should be the ending expression if I start here?

    Code (vb):

           ForEach key In idic.Keys
                r = r + 1: Cells(r, 1) = key
           Next key
     

    Thanks a zillion once again.
  17. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,556
    Code (vb):
          ForEach key In idic.Keys
                r = r + 1: Cells(r, 1) = key: Cells(r, 2) = idic(key)
          Next key
    shahin likes this.
  18. shahin

    shahin Active Member

    Messages:
    857
    Exactly this is what I was expecting to have, sir. I've already rectified the script according to how you suggested. However, I executed the script and found it working great. Here it is for future reference:
    Code (vb):

    Sub Handle_SlowLoad()
        URL$ = "https://www.inc.com/profile/sumup-payments-limited"
        Dim post As Object, container As Object, scroll As Long
        Dim idic As Object, key
     
        Set idic = CreateObject("Scripting.Dictionary")
     
        With CreateObject("InternetExplorer.Application")
            .Visible = True
            .navigate URL
            While .Busy = True Or .readyState < 4: DoEvents: Wend
         
            For scroll = 1 To 5
                Set container = .document.getElementsByClassName("profile")
                .document.parentWindow.scrollBy 0, 99999
                Application.Wait Now + TimeValue("00:00:003")
            Next scroll
         
            For Each post In container
                With post.getElementsByTagName("h1")
                    If .Length Then
                        a = .Item(0).innerText
                    End If
                End With
                With post.getElementsByClassName("ifi_industry")(0).getElementsByTagName("dd")
                    If .Length Then
                        b = .Item(0).innerText
                    End If
                End With
                idic(a) = b
            Next post
         
            For Each key In idic.Keys
                r = r + 1: Cells(r, 1) = key
                Cells(r, 2) = idic(key)
            Next key
            .Quit
        End With
    End Sub
     
  19. shahin

    shahin Active Member

    Messages:
    857
    One more thing to ask, sir. I've recently installed "Office 2013" in my window 7, 32 bit operating system. Everything is working fine under it. However, the only issue I notice is that when I copy any particular item from the editor and paste it somewhere else then I can see some illegible greek thing along with it. As example:

    The original link:
    Code (vb):

    https://www.inc.com/profile/sumup-payments-limited
     
    But, I find the same link like below when directly copy from vbe console and paste here
    Code (vb):

    https://www.inc.com/profile/sumup-payments-limited
    )Ì    Ð)Ì    *Ì    z Gèe¸Žö §çe,§çe¬ö §çe,§çe§çe,§çe§çe,§çe§çe,§çe§çe,§çe§çe,§çe§çe,§çe    §çe,§çe
    §çe,§çe
    §çe,§çe §çe,§çe
    §çe,§çe
    §çe,§çe§çe,§çe§çe,§çe §çe,§çe!§çe,§çe"§çe,§çeL‹ö #§çe,§çe$§çe,§çe%§çe,§çe&§çe,§çe'§çe,§çe(§çe,§çe)§çe,§çe¸Žö L‹ö GèeL‹ö
    Is there any change I should make in the setting. It doesn't hamper anything (when i do any stuff) ,though.
  20. Marc L

    Marc L Excel Ninja

    Messages:
    3,793
    No need a loop to load data in cells as you can see in this thread
  21. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,556
    Yes, I know. ;)

    Just showing shahin how item is accessible via key
  22. Marc L

    Marc L Excel Ninja

    Messages:
    3,793


    OKey ! :cool:

Share This Page