• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Unique ID to duplicate rows

jasmin

New Member
Hi Chandoo,


I have 800,000 records in three columns with hundreds of duplicate records. Data is not arranged in any particular (ascending or descending) order. For eg:


X Y J

x1 y1 j1

x1 y2 j1

x1 y1 j1

x3 y3 j2

x4 y3 j3

x1 y2 j1


and I want to assign them IDs based on identical rows:


X Y J ID

x1 y1 j1 1

x1 y2 j1 2

x1 y1 j1 1

x3 y3 j2 3

x4 y3 j3 4

x1 y2 j1 2


so basically I want to assign unique ID to duplicate records. Please help :)
 
Hi ,


First , what do you want to do after assigning IDs to all records ? Do you want to retain duplicate records or remove them from your data set ? If you want to remove duplicates Excel ( depending on the version you have ) may have a Remove Duplicates feature.


Second , you say that you want to assign unique IDs to duplicate records ; but the example you have given , the IDs for two duplicate records are the same.


I would suggest the following :


Just concatenate the three columns to get the unique ID e.g. x4 y3 j3 concatenated will give you x4y3j3 in column D. Now sort the entire set of data , including column D , with column D being used for the sort order. This will ensure that all duplicate records are together.


After that it is up to you to decide what is to be done with this tagged data.


Narayan
 
Hi Jasmin,


Assumptions:

1. The header row is in row 1, X is in cell A1, Y is in cell B2, etc.

2. The data starts in row 2.

3. You cannot insert a helper column to concatenate the data.


I have taken your small sample and made this formula work:

=IFERROR(INDEX($D$2:D2,MATCH(A3&B3&C3,$A$2:A2&$B$2:B2&$C$2:C2,0),1),MAX($D$2:D2)+1)


However you will need to do two things for this to work:

1. In cell D2, where the first ID number would be, enter "1". Because of the way the above formula works, this sidesteps the circular reference problem. I'm sure there is a better way, but this is simple solution :)


2. In cell D3, enter the above formula, but instead of pressing Enter, press Ctrl+Shift+Enter. This makes the formula an array formula. You will know it because Excel will put {} around it like this:

{=IFERROR(INDEX($D$2:D2,MATCH(A3&B3&C3,$A$2:A2&$B$2:B2&$C$2:C2,0),1),MAX($D$2:D2)+1)}


3. Copy/paste the formula down 800,000 rows.


Now mind you, this is likely to bog down the spreadsheet.


I could also suggest this solution:

1. On a separate sheet, copy the data.

2. In the empty fourth column of the second sheet, concatenate the data, i.e. A2&B2&C2.

3. Use Excel's delete duplicate tools to build a list of only unique values.

4. Generate unique IDs for the resulting unique values.

5. Use a lookup type function in the original sheet to find the unique IDs on the second sheet.


Good luck!
 
Thank you Narayan and daffy333 for your inputs. Assigning unique identifier is really easy, but since I am a beginner in excel its little tough for me.

I should have mentioned my next step as well. The problem is that I cannot delete duplicate records from 800,000 data points before finding maximum and minimum from another column 'Z' from the range of duplicates. Yes daffy333 you were right, it works till I concatenate but as soon as I put another array formula in next column it bogs down my system.

Any further suggestions on the same?
 
Hi Jasmin,


Just so that I understand, you said that you have a fourth column 'Z' that plays into this project? The Z column has a value in it, and if it passes a Low or High threshold, you wish to delete rows in the second table in your original post?


I'll second Narayan's point, that I'm curious about what you hope the end result to be. If you can describe it further, we might pick up on some further solution that will more efficient. As we see the array formula is not efficient at all (I probably shouldn't have floated it in the first place) :)


I look forward to finding out more!


Daffy
 
Jasmin

Is this data 3D triangle triplets from a geological/mine planning package?

Why I ask is that some of these packages have this functionality built in
 
Back
Top