• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Finding a duplicate in multiple columns with varying responses

pr4peace

New Member
Hi,


A example of what I am doing is attached. Basically we are setting up a labour management system to make insurances for them. We dont want to make two insurances for the same person. So a centralized database is being created , but the file should warn when we are entering a duplicate.


The problem is people have similar names. SO we need a small warning when name matches, a bigger warning when both first name and last name matches, a alert when names and age also matches, and call it a duplicate when names, age and blood group matches.


I went through the concatenate and the countif condition which works well when all conditions have to match . But for varying degree of the duplication can it be made automatic.


File: https://www.dropbox.com/s/he88rv5p0wpbhwn/labour%20data.xls


Thanks in advance


Prashanth
 
Would a UDF work for you? Using this function:

[pre]
Code:
Function MatchCount(Crit1 As Variant, Crit1Range As Range, Crit2 As Variant, _
Crit2Range As Range, Crit3 As Variant, Crit3Range As Range, Crit4 As Variant, _
Crit4Range As Range) As String

Dim xCheck1 As Integer
Dim xCheck2 As Integer
Dim xCheck3 As Integer
Dim xCheck4 As Integer

On Error Resume Next
xCheck1 = WorksheetFunction.Match(Crit1, Crit1Range, 0)
xCheck2 = WorksheetFunction.Match(Crit2, Crit2Range, 0)
xCheck3 = WorksheetFunction.Match(Crit3, Crit3Range, 0)
xCheck4 = WorksheetFunction.Match(Crit4, Crit4Range, 0)
On Error GoTo 0

'*************
'Change the following outputs to whatever text you need
If xCheck1 = 0 Then
MatchCount = "No matches"
ElseIf xCheck2 <> xCheck1 Then
MatchCount = "1 match"
ElseIf xCheck3 <> xCheck1 Then
MatchCount = "2 matches"
ElseIf xCheck4 <> xCheck1 Then
MatchCount = "3 matches"
Else
MatchCount = "4 matches!"
End If

End Function
[/pre]
The formula in your workbook then would be something like this:

=MatchCount(firstName,FirstNameRange,LastName,LastNameRange,Age,AgeRange,BloodType,BloodRange)
 
Hi Luke M,


Thank you so much for your time. I understood the function but I am not able to execute it. I am getting a #NAME error on the formula when I type it in the excel sheet.


Sorry for being a noob but I must be missing something fundamental for it to work.


PFA https://www.dropbox.com/s/atun0mv9o6oz4p2/l1abour%20data.xls


Thanks,

Prashanth
 
I can't access your file, but based on the error, I'm guessing that you accidentally put the code into a sheet module instead of a regular module. Go back into the VBE, and make sure that the code is in a plain module (try going to Insert - Module). After putting the UDF there, make sure to delete it from any other modules.
 
Hi, all!


Luke M:


Here's the main worksheet, it's very simple:

[pre]
Code:
GE Id	Date		First Name	Last Name	Age	Blood Group	skill	lw id	ac policy	Medical checkup
0001	25/Jul/12	Manju		S		33	A+ve					ManjuS
0002	25/Jul/12	Manju		T								ManjuT	1 match
[/pre]
pr4peace:


I downloaded your file and it works, as you can see there is the value "1 match" displayed in cell N2 (which is correct regarding the data).

About the #¿NAME? error displayed while typing the formula, I don't get it and the formula for N2 is:

=MatchCount(E3;$E$2:E2;F3;$F$2:F2;G3;$G$2:G2;H3;$H$2:H2)


Regards!
 
Super dumb error, I had disabled macros !!!!! :(


But the UDF still had some problems. It was not returning the right answer when I drag down further


I did some tinkering to your formula and got it working , but I still dont know the use case where it will fail.


Function MatchCount(Crit1 As Variant, Crit1Range As Range, Crit2 As Variant, _

Crit2Range As Range, Crit3 As Variant, Crit3Range As Range, Crit4 As Variant, _

Crit4Range As Range) As String


Dim xCheck1 As Integer

Dim xCheck2 As Integer

Dim xCheck3 As Integer

Dim xCheck4 As Integer


On Error Resume Next

xCheck1 = WorksheetFunction.Match(Crit1, Crit1Range, 0)

xCheck2 = WorksheetFunction.Match(Crit2, Crit2Range, 0)

xCheck3 = WorksheetFunction.Match(Crit3, Crit3Range, 0)

xCheck4 = WorksheetFunction.Match(Crit4, Crit4Range, 0)

On Error GoTo 0


'*************

'Change the following outputs to whatever text you need

If xCheck1 = 0 Then

MatchCount = "No matches"

ElseIf xCheck2 = 0 Then

MatchCount = "1 match"

ElseIf xCheck3 = 0 Then

MatchCount = "2 matches"

ElseIf xCheck4 = 0 Then

MatchCount = "3 matches"

Else

MatchCount = "4 matches!"

End If


End Function


I converted the "<> xCheck1" to "=0" , as the former was not working when both the MATCH functions were returning the same answers.


Thank you so much for the formula.


Prashanth
 
Be careful with that. By just checking if the xCheck's equal 0, you're only determining if that specific criteria appears within that range. You are no longer guaranteeing that the data is all on the same row. So, if one person is named John Smith, and someone else is named Tom Jones, your formula will think that Tom Smith should have 2 matches.
 
Hi Luke M,


:(


Understood your use case. Thats a showstopper!!!

And I am not able to wrap around the solution.


Help.


Just to dissect the solution. Since all the match functions are returning the row numbers. The result is just to see how many of these numbers match . Isnt it?


I.E if all row numbers match (in the column order) = 4 matches

if 3 row number match(again in the column order) = 3 matches and so on....


Am I right or is my logic flawed?


Thanks,


Prashanth
 
Hi Prashanth ,


Are you particular about a VBA solution ? Can you not use helper columns ?


A formula such as :


=SUM(IF(($C$2:$C$13=C2)*(ROW($C$2:$C$13)<>ROW(C2)),1,0))


will return the number of occurrences of the first name matching the entry in C2 , whose row number is different from 2. This will be 0 if there is no other occurrence ; any result > 0 means there is a duplicate. Of course , this will have to be entered as an array formula , using CTRL SHIFT ENTER.


Extending this to two columns , such as First Name and Last Name is a matter of adding more multiplications within the IF statement :


=SUM(IF((($C$2:$C$13=C2)*($D$2:$D$13=D2))*(ROW($C$2:$C$13)<>ROW(C2)),1,0))


I have assumed that your data extends from row 2 through row 13.


Narayan
 
@pr4peace,

Ah, I see what you mean. It's only looking at the first instance in each column, and that's no good.

Here's a formula approach that should work. Requires no helper columns. All possible outputs are listed at end of formula.

Apologies for the length...

[pre]
Code:
=CHOOSE(MAX(1,
2*(SUMPRODUCT(1*(FirstNameRange=FirstName))>0),
3*(SUMPRODUCT((FirstNameRange=FirstName)*(LastNameRange=LastName))>0),
4*(SUMPRODUCT((FirstNameRange=FirstName)*(LastNameRange=LastName)*(AgeRange=Age))>0),
5*(SUMPRODUCT((FirstNameRange=FirstName)*(LastNameRange=LastName)*(AgeRange=Age)*(BloodRange=BloodType))>0)),
"No match","1 match","2 matches","3 matches","4 matches!")
[/pre]
 
Hi Prashanth,

Here is one more formula based option...


Assuming your data is setup with the name "patientdata", similar to the following:

[pre]
Code:
# A     B               C               D               E       F
1 GE ID	Date	        First Name	Last Name	Age	Blood group
2 0001	25-Jul-12	Manju	        S	        56	AB
3 0002	15-Jul-12	Manju	        S	        35	AB
In the above example, "patientdata" would refer to C2:F3


And your search criteria is setup similar to the following:

#   A           B       C       D               E
1   Fname	Lname	Age	BloodType	Matches
2   Manju	S	35	AB	        4
3   Manju	T	35	AB	        3
[/pre]
Then, you could use a formula such as the following (shown for cell E2) to get the number of matches for each search row:

=MAX(MMULT(--(ISNUMBER(MATCH(patientdata,A2:D2,0))),TRANSPOSE(COLUMN(patientdata)^0)))

entered with Ctrl+Shift+Enter


the MMULT function returns a single column matrix with the number of matches in each row in the patientdata. By getting the MAX of that matrix, we get the maximum number of matches that were found across all of the rows in patientdata.


Hope this helps.


Cheers,

Sajan.
 
Hi All,


Firstly sorry for the delayed response, got stuck with work.


Secondly a huge thanks for all the wonderful replies .


@narayank: thanks for the formula, was a little complicated to understand but got it to work.


@luke: that formula is so simple and awesome, I felt stupid not to have thought of it!!! the sumproduct makes the sheet slow , but I guess it is a compromise for the simplicity.


@sajan: learning new formulas everyday now, thanks for this mmult now. Your solution seems to work when I have input field to search for, in my case I need it to alert as and when i am entering the raw data. Great help nevertheless.


I finally used Luke's formula as I could understand it better :)


Thank you so much for all the great support.


Bye

Prashanth
 
Thanks pr4peace. My approach may be simple, but it is bulky and does repeated calculations of the same thing. As you said, advantage is that it's fairly easy to understand.


sthomas's formula is much more efficient than mine, but does work slightly differently. My formula checks for matches in order (first column matches, first 2 columns match, etc), while sthomas's checks for all columns and returns how many matches. This is best seen with #3 of sthomas's sample data in search criteria. Last name does not match any raw data. My formula will return a value of 1, sthomas's returns 3. Based on your situation, I think the former might be better as you are very likely that have repeating blood types and/or ages.


@sthomas

Tip my hat to you! Very efficient formula, and a reminder that I need to learn more about the MMULT function.
 
Hi Luke,

Excellent observation regarding the MATCH function. It does not tell whether a match was found in the correct column. (Thanks for pointing that out. Most of my datasets apparently don't have overlapping data! You gave me something to think about!)


I think a better approach for Prashanth's problem might be to find the actual match pattern (such as "1101" indicating that the match was found in the first, second and fourth column). Since multiple matches could be found in the data, one would need to find the match with the most ones (1s).


A formula such as the following would provide that:

=LOOKUP("1111",DEC2BIN(MMULT((patientdata=$A8:$D8)*{8,4,2,1},{1;1;1;1}),4))


where $A8:$D8 is the data being sought, and "patientdata" refers to the range being searched.


The result from the above could be used to conditionally format the row being searched to highlight just the fields that matched.


I am still learning about all the things one could do with the MMULT function. I agree that it is a powerhouse function, and has a lot to offer!


Regards,

Sajan.
 
Hi Sajan and Luke,


Did not even understand a little bit of both the formulas, hence could not adapt it to my problem :| :(


I will need a forensic on that formula and unless you are ok with time constraints, you dont need to explain it anytime now!!


I am going with a slower sheet :)


Thanks anyways !!!


Bye,

Prashanth
 
Hi,

In case anyone else is interested, here is a brief explanation of the formula... We can write up something more if there is interest.


The following formula generates a pattern for the matches, and returns the pattern with the most number of matches. (The original problem was to check if a given range had matching values in another range, and then identify the maximum number of matches found. The formula below goes one step beyond by providing the actual match pattern, instead of just a max count of the matches.)


For example, a result of "1011" would indicate that a match was found in fields one, three and four.


The formula is:

=LOOKUP("1111",DEC2BIN(MMULT((patientdata=$A8:$D8)*{8,4,2,1},{1;1;1;1}),4))

The formula searches for the pattern where A8:D8 is found in the range called "patientdata".


Reviewing the formula from the middle...


"patientdata=$A8:$D8" returns an array consisting of TRUEs for every match, and FALSEs for every non-match.

(If TRUE is represented by a 1, and FALSE by a 0, you would get a binary pattern of the matches and non-matches.)


The next step is to convert the binary pattern into a decimal number. You might recall that to convert a binary sequence into a decimal, you need to multiply each 1 with the power of 2 for that position, and then sum the values.

e.g.

[pre]
Code:
Binary sequence:                   1   0   1   1
Position in sequence:              3   2   1   0
Power of 2 for the position:       8   4   2   1
Decimal value for binary sequence: 8   0   2   1
Adding them up, you get the decimal value: 8+0+2+1=11
So 1011 in binary is the same as 11 in decimal.
[/pre]
"(patientdata=$A8:$D8)*{8,4,2,1}" converts the TRUEs and FALSEs into values like {8,0,2,1}.


Now that you have a value like {8,0,2,1}, we need to add them up. That is where the MMULT function comes in handy. MMULT performs matrix multiplication.

MMULT((patientdata=$A8:$D8)*{8,4,2,1},{1;1;1;1}) returns a single column array, with the sum of the individual values. (In this example, {8,0,2,1} would become {11})


As you may recall regarding matrix multiplication,

multiplying {8,0,2,1} with {1;1;1;1} results in 8*1 + 0*1 + 2*1 + 1*1, to result in {11}


Now that we have the decimal value (11 in our example), we will convert it back to binary.

DEC2BIN() takes a decimal number, and the number of positions and returns a binary number.

In our example, DEC2BIN(11,4) returns 1011.


You might ask what the point of all that was, to convert from a binary sequence back into the binary number. Well, as you might have observed, the original values were an array, and we effectively concatenated the individual 0s and 1s.

So... {1,0,1,1} became "1011".


Depending on how many rows exist in the range "patientdata" we would have one or more rows of values in the array.

For example, if patientdata referred to a range of 3 rows, then the following formula segment returns an array with 3 values.

DEC2BIN(MMULT((patientdata=$A8:$D8)*{8,4,2,1},{1;1;1;1}),4) would return something like

{"1001";"0000";"0111"}

Now, we need to identify the value with the most 1s. That is where the LOOKUP function comes in handy.

The LOOKUP function will return the closest alphabetical match to the value being sought. In the formula, we search for "1111" which is the binary representation of decimal 15. So, in the above example, it would return "0111".

(In reality, you could search for any value that is alphabetically greater than "1111", including something like "A". I used "1111" because it was more representative of the max value.)

The net result from the formula is something like "0111" indicating that the range A8:D8 matched according to that pattern. (i.e. A8 did not match, but B8, C8 and D8 did.)


We can now use the pattern for additional processing, if desired, such as to conditionally format the range A8:D8 to highlight just the fields that had a match.

(One approach would be to compare the column number of the field to the corresponding digit in the result, and highlight the field if the digit is a 1.)


(The formula was based on the original posted dataset, where the "patientdata" had four columns. That is the reason for the segments like {8,4,2,1} and {1;1;1;1}. The formula can easily be extended to support more columns. I left out the generic version from the post so as to make it easier to read the formula. I can post the generic version if anyone is interested.)


Since I don't have a way to post a sample worksheet demonstrating the formula in action, I hope the above explanation is clear. (If anyone would be willing to receive an e-mail from me, I can send a sample file. That person can then, perhaps, post the sample file to this site.)


Cheers,

Sajan.
 
@sthomas


Hi!

Do you get paid by character? And then people go around saying that I write a lot. :p

Regards!


PS: BTW, I enjoyed very much the binary treatment.


PS2: I'd post it if I could, but I since some time ago I have this policy (which in this case I'd have gladly overruled as it's not applicable to you, but I thing I shouldn't do it):

http://chandoo.org/forums/topic/locking-the-cell-automaticall-once-the-date-passed#post-21903

So we have two ways:

a) you dig a little to find my email address

b) you post yours, I get it and then update your post to clear it

Just tell me if you're willing to.
 
Hi sthomas,


Am interested to study your sample worksheet for personal use. Please send to this email address: delvillar.dennis@gmail.com


Thanks in advance!
 
Hi SirJB7,

Thanks for your kind words. I assumed not many folks read my posts, and that I would be safe with my rambling...! :)


In any case, I will post a sample file in the next few days... (I need to take a few extra steps.)


Regards,

Sajan.
 
@shtomas

Hi!

First assessment, don't mention it.

Second one, I don't know how many, but I do... not always, but I do.

Third, one's never safe since Windows starts.

Fourth, be waiting for your extra steps.

Regards!
 
Hello,

Following is a link to a sample worksheet that illustrate the match pattern formula. The formula has been tweaked a bit.


http://speedy.sh/JDqhY/Highlight-Matching-Fields.xlsx


Cheers,

Sajan.


P.S. Since this is my first file post, hope it comes through.
 
Sajan & Hui,


Thanks for the formula, From last 3 week, I am in a bed-rest, met with an accident, broken Right Collar Bone, Not even able to hold a PEN, spend my whole day with and only support of my keyboard..


Today morning Hui's Formula, and Sajan's Formula force me to pick up Copy-Pen, evaluate the formula to digest it properly..


Thanks again, not for the Formula's, but for the re-build my confidence..

God bless you..

Regards,

Deb
 
@Debraj Roy

Hi!

Very sorry to hear that the accident was so serious. I wish you the best and sooner recovery.

Regards!

PS: BTW, if you can't hold a pen I should call back the six pack of Carlsberg that might be on the way... I don't want you to make any effort not authorized by the doctors.
 
@ SirJB7,

No way, DOC told me that 'bottoms up' with right hand, is the best exercise for recovery of my Collar Bone..
 
Back
Top