• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Find duplicate values in 1.8M data

Damodar

New Member
Hi,

I am using Excel 2007. I need to find if any duplicate values are in 1.8 million data. (only one value-Mobile number). Since it is 1.8m data, I can not use 'Data - Remove Duplicates'. Please suggest any possible ways.
 
Do you just need them found/highlighted, or need them removed?
Is your data in 2 columns, more than that, something else? Can we sort it? If you can sort it, can use the built in Data - Remove duplicates, just doing it in 2 shots.
 
Hi !​

Why ?​
Internal functions are quicker than any VBA procedure …​

I was guessing the reason to be that at 1.8 million data points, that exceeds the row limit in excel, so you can't analyse all of them at the same time. :(
 
Not sure that it's 1.8 million rows (Excel 2007 limit is lower)​
but 'cause of points like cells I understand …​
And with anymore crystal clear explanation, can't help …​
 
As Luke M has pointed out it cannot be rows of (only one value-Mobile number) as that exceeds the row limit, you could have multiple columns with the same data type but this is not clear from the OP.
 
If you know that your 1.8 million rows will return less than 1,048,576 duplicates, and that data is spread over just a few columns (say columns A and B), then here's an easy way. Use Remove Duplicates on column A, then use Remove Duplicates on column B, and then append the remaining records in A and B and use Remove Duplicates on that.

Otherwise you can use a dictionary object. I've got some code that I'm in the process of amending to suit.
 
Mark, your comment that internal functions are quicker than any VBA procedure is not always true, including in this case.

It depends very much on what you are trying to do, and the nature of the data you are trying to do it on. Many of Excel's routines or functionality are optimized to work on what MS might think is the most common scenario that users are likely to face. This means that a hand-rolled VBA routine optimized to a particular data source or scenario can be many times faster than an internal Excel routine.

To demonstrate, check out the attached file.

If you click the "Populate/Change Sample Data", an input box comes up asking for a number. Put 100 in that box for starters. This populates the whole of column A with random numbers between 1 and whatever number you just specified.

There's three buttons that trigger different approaches to remove duplicates. Two use Excel's internal routines...the 'remove duplicates' functionality and the 'Advanced Filter' functionality. The 3rd button uses a vba routine I put together, that leverages highly off the Dictionary object.

Check out what happens under a range of different scenarios, by running the three macros, then re-clicking the 'Populate/Change Sample Data" button and selecting increasingly larger ranges of numbers.

As per the results table in the attached file, the VBA routine is almost always faster.
 

Attachments

  • Removing Duplicates_20131009 v2.xlsm
    23.6 KB · Views: 16
Damodar - my VBA method that uses the Dictionary approach can be amended to handle non-contigous ranges and so can solve your problem. Can you post some sample data and give me an idea of how the data is laid out?

Also note that you can use SQL to solve this.
 
Jeffrey, you're right with the object Dictionary (which I regularly use)​
'cause like internal Excel functions it's an internal object of a dll with optimized properties compiled …​
Without any crystal clear explanation, I don't want to think in any solving way !​
 
Hi Marc. I don't think his info is poor. Right at the top he says Since it is 1.8m data, I can not use 'Data - Remove Duplicates'.
 
Whow, you're so quick to respond during I modify my post !​
Maybe but need some light, a clear sky without cumulus nimbus …​
 
True enough, it would have been even more clear had he added "...meaning my data is in separate columns, which rules out simple use of the 'Data - Remove Duplicates' functionality" :)
 
I added an SQL routine to my file. SQL is faster than anything in the event that you have around 1,000,000 distinct records. Check it out....very interesting to see 4 approaches side by side.

This was a great learning assignment...thanks Damodar.
 

Attachments

  • Removing Duplicates_20131009 v4.xlsm
    25.1 KB · Views: 43
'Cause of you Jeff I can't not to thing about this issue !​
So just in case the Dictionary way is too huge in RAM ressources (but seems OK with 2007 version),​
I have another way to perform that, depends on the quality of the next explanation …​
 
I'd love to see it. Any chance you could use my sample file, and add it to the different routines there? This is a great learning exercise. :)
 
I didn't see your previous post during I post mine and often I can't upload during a working day …​
But surely my way is already in your code !​
Well done for the SQL way, I never think of it 'cause of simple forum questions …​
 
Good evening Jeff

Have been looking at your file, nice, is there any reason as to why the sql does not populate the figures as do the other three runs?

SirJB7

That would be the best route to go.
 
Hi bobhc. Are you referring to the fact that they are listed in numerical order? Or is it not working at all for you? Or something else?
 
All,
Sorry for gory details.
Usually the code (mobile number extractor) will add in Column "A" and if it exceeds then it will add it in Column "B" and so on. I was not aware at the time I start this thread.
Jeffreyweir, you are awesome man. I saw your code. I just set the usedrange as table2 and the code is working like a GEM. Thanks a lot.
 
All: I wrote something similar regarding this at http://dailydoseofexcel.com/archives/2013/10/23/dictionaries-can-be-rude

I think I'll be writing a couple more articles there on this, because my testing has shown some interesting things.
For instance, put the numbers 1 to 1048576 in column A. Make a couple of those numbers duplicates. Run Excel's Remove Duplicates feature on it. Should take around 1 second.

Now sort it from Z to A. Run Excel's Remove Duplicates feature on it. Takes around 5.5 minutes!

Likewise, the Dictionary object is much faster when data is sorted. For that same sample data, it takes 0:00:25 seconds to dedupe it if sorted A to Z. Z to A takes 40 seconds.

Moral of the story...sort descending before deduping.

Have only tested on very rudimentary data set.
 
Back
Top