• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Checking for duplicate or similar rows in an array

JAMIE1BOY

New Member
Hi Guys,

Wonder if someone can help me with this one.


I have a large array of data, with each line representing a number of values which are very similar, but generally not identical.


I would like to highlight full rows in the array which are identical. Is this possible to compare rows, rather than cells? (there is normally abotu 20 populated cells in each row)


Secondly, I would like to identify is the there are any rows which are 'almost' identical to another row in the array....I dont know if this is possible with excel either. So, if all 20 cells are the same as another row which has only maybe 18 identical values, then I would like to highlight that.


I would love to hear if anyone knows how to do this!

Thanks
 
Does the trick described here work for you? It's technically to remove duplicate, but perhaps that is your end goal?

http://answers.oreilly.com/topic/1788-handling-duplicate-rows-in-excel-2010/
 
Luke M. Thanks for your post. I checked that out and I don't think that will work for this issue. I'm not actually looking to delete any duplicates, but only to identify them.


But I guess the key difference is that I do not want to compare 'individual cells', but actual full rows against another full row.


So if all the values in the row are the same in another duplicate row, then I would look to highlight it. The example below, which i have used ** to indicate the duplicate repeating rows.


12 32 41 21 32 33 21 51

62 33 12 13 12 33 12 14 **

56 24 22 31 41 17 17 13

62 33 12 13 12 33 12 14 **

56 22 52 01 21 22 41 32


I hope I have explained this well enough ?

Thanks
 
Longer setup, but relatively simply idea is to use a helper column with formula like:

=A2&"_"&B2&"_"&C2&"_"&...

Then, just check the helper column for duplicates using conditional formatting. Only would work for eact matches, not 18/20 matches, but hopefully that gets close?
 
Thanks Luke.


Would that work essentially the same as CONCATENATING all of the cells and then checking for duplicates in the concatenated column?


Im a novice, so im just trying to undertsand what the formula would do.


Thanks!
 
Exactly. I just personally avoid the CONCATENATE function since it's a long word to write, and it's limited to only 30 words. =/
 
OK, I will give that a whirl.....Thanks


If you happen to figure out how I can compare partial matches then I would love to know. Im guessing that its not possible!


Thanks again

Jamie
 
Hi Jamie,

Interesting problem. Here are a couple of suggestions. The sample data I used is in the range A1:H32, and was named "data" for ease of reference.

[pre]
Code:
A       B       C        D       E       F       G       H        Check1  Check2
12	32	41	21	32	33	21	51	FALSE	FALSE
62	33	12	13	12	1	12	14	FALSE	TRUE
56	24	22	31	41	17	17	13	FALSE	FALSE
2	4	6	8	10	12	14	16	TRUE	TRUE
2	4	6	8	10	12	14	16	TRUE	TRUE
62	33	12	13	1	12	14	12	FALSE	TRUE
56	22	52	1	21	22	41	32	FALSE	FALSE
[/pre]

If the position of the numbers is significant when comparing two rows, then the following formula (for cell I1) should identify duplicates:

=SUM(N(MMULT(N(A1:H1=data),TRANSPOSE(COLUMN(A1:H1)^0))=COLUMNS(A1:H1)))>=2


entered with Ctrl + Shift + Enter

In the sample data, the results are shown in "Check1". Since only rows 4 and 5 are identical, with every value in the same location, the Check1 column shows TRUE for both.


If the position of the numbers is not significant when comparing two rows, then the formula gets a little longer, to accommodate sorting of the list. The formula below is for cell J1:

=SUM(N(MMULT(N(ROUND(MOD(SMALL(ROW(data) + data %, (COLUMN(data))*(ROW(data)^0) + ((ROW($A$1:INDEX($A:$A, ROWS(data)))-1) * COLUMNS(A1:H1))), 1)*100, 0) = SMALL(A1:H1,COLUMN(A1:H1)-COLUMN(A1)+1)), TRANSPOSE(COLUMN(A1:H1)^0))=COLUMNS(A1:H1)))>=2


entered with Ctrl + Shift +_Enter

In the sample data, the results are shown in column "Check2".

In addition to rows 4 and 5, rows 2 and 6 are also identified as duplicates. This is because in rows 2 and 6, the same values are present, but not in the same positions.


Both formulas include a check "=COLUMNS(A1:H1)" to find exact matches. If you need partial matches, replace it with something like ">=COLUMNS(A1:H1)-2" to return rows where up to 2 mismatches are allowed.


In my sample data, as in your original post, the values were less than 100. However, if your data needs to support larger values, adjust the second formula to say "data %%" and replace "100" with "10000". This will then support values up to 9999. To support larger values, adjust the formula to higher decimal places.


Cheers,

Sajan.
 
Back
Top