Find duplicate values in 1.8M data

Damodar · Oct 8, 2013

Hi,

I am using Excel 2007. I need to find if any duplicate values are in 1.8 million data. (only one value-Mobile number). Since it is 1.8m data, I can not use 'Data - Remove Duplicates'. Please suggest any possible ways.

Luke M · Oct 8, 2013

Do you just need them found/highlighted, or need them removed?
Is your data in 2 columns, more than that, something else? Can we sort it? If you can sort it, can use the built in Data - Remove duplicates, just doing it in 2 shots.

Marc L · Oct 8, 2013

Hi !

Damodar said:
I can not use 'Data - Remove Duplicates'.

Why ?

Internal functions are quicker than any VBA procedure …

Luke M · Oct 8, 2013

Marc L said:
Hi !

Why ?

Internal functions are quicker than any VBA procedure …

I was guessing the reason to be that at 1.8 million data points, that exceeds the row limit in excel, so you can't analyse all of them at the same time.

Marc L · Oct 8, 2013

Not sure that it's 1.8 million rows (Excel 2007 limit is lower)

but 'cause of points like cells I understand …

And with anymore crystal clear explanation, can't help …

bobhc · Oct 8, 2013

As Luke M has pointed out it cannot be rows of (only one value-Mobile number) as that exceeds the row limit, you could have multiple columns with the same data type but this is not clear from the OP.

jeffreyweir · Oct 8, 2013

If you know that your 1.8 million rows will return less than 1,048,576 duplicates, and that data is spread over just a few columns (say columns A and B), then here's an easy way. Use Remove Duplicates on column A, then use Remove Duplicates on column B, and then append the remaining records in A and B and use Remove Duplicates on that.

Otherwise you can use a dictionary object. I've got some code that I'm in the process of amending to suit.

jeffreyweir · Oct 9, 2013

Mark, your comment that internal functions are quicker than any VBA procedure is not always true, including in this case.

It depends very much on what you are trying to do, and the nature of the data you are trying to do it on. Many of Excel's routines or functionality are optimized to work on what MS might think is the most common scenario that users are likely to face. This means that a hand-rolled VBA routine optimized to a particular data source or scenario can be many times faster than an internal Excel routine.

To demonstrate, check out the attached file.

If you click the "Populate/Change Sample Data", an input box comes up asking for a number. Put 100 in that box for starters. This populates the whole of column A with random numbers between 1 and whatever number you just specified.

There's three buttons that trigger different approaches to remove duplicates. Two use Excel's internal routines...the 'remove duplicates' functionality and the 'Advanced Filter' functionality. The 3rd button uses a vba routine I put together, that leverages highly off the Dictionary object.

Check out what happens under a range of different scenarios, by running the three macros, then re-clicking the 'Populate/Change Sample Data" button and selecting increasingly larger ranges of numbers.

As per the results table in the attached file, the VBA routine is almost always faster.

jeffreyweir · Oct 9, 2013

Damodar - my VBA method that uses the Dictionary approach can be amended to handle non-contigous ranges and so can solve your problem. Can you post some sample data and give me an idea of how the data is laid out?

Also note that you can use SQL to solve this.

Marc L · Oct 9, 2013

Jeffrey, you're right with the object Dictionary (which I regularly use)

'cause like internal Excel functions it's an internal object of a dll with optimized properties compiled …

Without any crystal clear explanation, I don't want to think in any solving way !

jeffreyweir · Oct 9, 2013

Hi Marc. I don't think his info is poor. Right at the top he says Since it is 1.8m data, I can not use 'Data - Remove Duplicates'.

Marc L · Oct 9, 2013

Whow, you're so quick to respond during I modify my post !

Maybe but need some light, a clear sky without cumulus nimbus …

jeffreyweir · Oct 9, 2013

True enough, it would have been even more clear had he added "...meaning my data is in separate columns, which rules out simple use of the 'Data - Remove Duplicates' functionality"

jeffreyweir · Oct 9, 2013

I added an SQL routine to my file. SQL is faster than anything in the event that you have around 1,000,000 distinct records. Check it out....very interesting to see 4 approaches side by side.

This was a great learning assignment...thanks Damodar.

Marc L · Oct 9, 2013

'Cause of you Jeff I can't not to thing about this issue !

So just in case the Dictionary way is too huge in RAM ressources (but seems OK with 2007 version),

I have another way to perform that, depends on the quality of the next explanation …

jeffreyweir · Oct 9, 2013

I'd love to see it. Any chance you could use my sample file, and add it to the different routines there? This is a great learning exercise.

Marc L · Oct 9, 2013

I didn't see your previous post during I post mine and often I can't upload during a working day …

But surely my way is already in your code !

Well done for the SQL way, I never think of it 'cause of simple forum questions …

SirJB7 · Oct 11, 2013

Hi, all!
Why not Access-ing it?
Regards!

bobhc · Oct 11, 2013

Good evening Jeff

Have been looking at your file, nice, is there any reason as to why the sql does not populate the figures as do the other three runs?

SirJB7

That would be the best route to go.

jeffreyweir · Oct 12, 2013

Hi bobhc. Are you referring to the fact that they are listed in numerical order? Or is it not working at all for you? Or something else?

jeffreyweir · Oct 12, 2013

Ahh. I see what you mean, bobhc. You've got to first populate the sample data, and then save the file.

dan_l · Oct 12, 2013

Ehhh....access.

Damodar · Oct 12, 2013

All,

Sorry for gory details.

Usually the code (mobile number extractor) will add in Column "A" and if it exceeds then it will add it in Column "B" and so on. I was not aware at the time I start this thread.

Jeffreyweir, you are awesome man. I saw your code. I just set the usedrange as table2 and the code is working like a GEM. Thanks a lot.

jeffreyweir · Oct 12, 2013

Great.

jeffreyweir · Oct 24, 2013

All: I wrote something similar regarding this at http://dailydoseofexcel.com/archives/2013/10/23/dictionaries-can-be-rude

I think I'll be writing a couple more articles there on this, because my testing has shown some interesting things.
For instance, put the numbers 1 to 1048576 in column A. Make a couple of those numbers duplicates. Run Excel's Remove Duplicates feature on it. Should take around 1 second.

Now sort it from Z to A. Run Excel's Remove Duplicates feature on it. Takes around 5.5 minutes!

Likewise, the Dictionary object is much faster when data is sorted. For that same sample data, it takes 0:00:25 seconds to dedupe it if sorted A to Z. Z to A takes 40 seconds.

Moral of the story...sort descending before deduping.

Have only tested on very rudimentary data set.

Find duplicate values in 1.8M data

New Member

Excel Ninja

Excel Ninja

Excel Ninja

Excel Ninja

Excel Ninja

Active Member

Active Member

Attachments

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Attachments

Excel Ninja

Active Member

Excel Ninja

Excel Rōnin

Excel Ninja

Active Member

Active Member

Active Member

New Member

Active Member

Active Member