• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Finding Duplicates - multiple criteria - text word order jumbled

ShyamExcel

New Member
Here's an interesting question for you rockstars. its in 2 parts


1. I have a excel sheet with 6 columns. A, B, C, D, E and F.


All columns are text except F.


Column E contains the core phrase that I'm interesting in finding duplicates of.


A, B, C and D can be looked at as phrases that narrow down the location of the phrase E.


As as example, you can think of A as a country name, B as state name, C as district name and D as city name, E is the person's name. F is his age (the only number column as I mentioned earlier)


So A could be India, B Karnataka, C Shimgoa D Sagar, E John Peter smith. F 25


So the entries A to D could be in duplicates and that's fine. We want to mark rows only when the names are same AND from the same city. I used to do this easily by concatenating the various columns and using countif and so any non zero entry would indicate a duplicate.


Here comes the challenge.


The name column in E can be in any order of the same words in the names. so I would like to consider as a duplicate even if the words (of varying number of words in each name set) are in different order, such as 'Peter John smith' or 'smith john peter' and having the same A, B, C and D.


To clarify, I would like to highlight all people from a particular city (so obviously same district, state and country names) whose name may have been written in any word order.


How would you approach this? I already searched in the forum before posting.


2. Another variation of this question would be to highlight the john peter smith who is the oldest, if there are many john peter smith entries from the same city, with names jumbled up. So in this part, if there are multiple entries for john peter smith from the same city, I would like to highlight the eldest or all but the eldest (which I want to delete)


Thanks!

Shyam
 
Hi


Please see this file:


http://dl.dropbox.com/u/60644346/Name%20Search.xlsx


I have few question like:


1. How will you search for the name? You will list name in some cell and then search or something else?


2. Will you name will always be consisting of 3 parts?


Faseeh
 
Hi Shyam ,


An alternative :


1. Create a range of names you wish to look for , and name this range Look_For


2. Name the range where you want to look for the names , Look_In


3. Select a range of cells in any unused column , which has as many cells as your range Look_In , and enter the following formula , as an array formula , using CTRL SHIFT ENTER :


=MMULT(--ISNUMBER((TRANSPOSE(SEARCH(Look_For,TRANSPOSE(Look_In))))),--ISTEXT(Look_For))=ROWS(Look_For)


This will have TRUE where ever all the names match , and FALSE otherwise.


Narayan
 
Hi Shyam,

If you do not know the names ahead of time, one approach would be to convert the names into a numeric value so that you can then compare two numbers to determine if they are from the same name.


For example, the following formula turns a text into a number. Assuming that the text is in F2:

=SUMPRODUCT(CODE(MID(F2,ROW($A$1:INDEX($A:$A,LEN(F2))),1)))

The formula would return the same value for "Peter John Smith", "Smith John Peter", and "John Peter Smith".


Once you have a way to compare two names, you could use an MMULT formula as Narayan suggests.


For the following sample data, column H shows whether it is a duplicate row:

[pre]
Code:
Row	A	B	C	D	E	F	                G   H
1	Country	State	County	City	NameVal	Name	                Age Duplicate?
2	US	FL	Miami	Miami	1492	Peter John Smith	25  TRUE
3	US	FL	Miami	Miami	1492	Smith Peter John	37  TRUE
4	US	FL	Tampa	Marina	1492	John Peter Smith	23  FALSE
5	US	FL	Orlando	St Luke	1492	Smith John Peter	45  FALSE
[/pre]
Formula for column H is:

=SUMPRODUCT(--(MMULT(--(A2:E2=$A$2:$E$5),{1;1;1;1;1})=5))>1


Please note that I included the "NameVal" column before the Name column to make the formula simpler.


Cheers,

Sajan.
 
@Faseeh Thanks. Yes, the names will be different each time. I just made up the example to avoid sharing sensitive data. All I want to actually do is remove duplicates, retaining just one unique entry satisfying the A to D criteria and treating the name in E as duplicate irrespective of the word order.


@Narayan I like your MMult logic. I can now add that to my kitty. Thanks!


@Sajan I'm going to try out that formula to make the text into number and the MMult formula. It should work I guess. Thanks for your time!
 
Looking at Shyam's first post, Sajan's formula needs small adjustment:

=SUMPRODUCT(CODE(MID(PROPER(A2),ROW($A$1:INDEX($A:$A,LEN(A2))),1)))

just so that it makes

'Peter John Smith' and 'john peter smith' the same.
 
Back
Top