• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Reverse Duplicates

rm3aquva

New Member
Hi All,


I have a challenge for the group. I know how to remove duplicates from a normal list. I routinely work with city routes in aviation. For example, JFK-SFO (New York-San Francisco) or LHR-ORD(London Heathrow to Chicago)


Many times I have a list which contain JFK-SFO AND its reverse SFO-JFK. Essentially I would like to remove duplicates in a way so that in a list where both JFK-SFO and SFO-JFK, only one value either JFK-SFO or SFO-JFK remains. This will help pulling data since either way it is written, the route is the same. Currently, Excel recognizes them as unique entries and thus doesn't remove duplicates...


Any ideas? Thanks!
 
Hi,

Does your data only have the start/end cities, or does it also include any stopovers?


Could you post a sample of your data that is representative of the types of situations present in the data? (You could just paste it directly into a post.)


Cheers,

Sajan.
 
This is just a small portion of a 1,000 item list. JFK-ORD is mentioned and then later ORD JFK is mentioned too. Essentially it is the same route but just written opposite. Assume it is a non-stop flight and no stops in the middle.
 
Looks like the delimiter changed from a '-' to a space between your two posts. No problem.


I think your problem is pretty straightforward. I will post a solution later this evening, if you do not get any other responses. (I have to be away from my computer for the next few hours).


Cheers,

Sajan.
 
Hi, rm3aquva!


If you can afford to use a helper column do this:


a) B2: =SI(IZQUIERDA(A2;ENCONTRAR(" ";A2)-1)<=DERECHA(A2;LARGO(A2)-ENCONTRAR(" ";A2));IZQUIERDA(A2;ENCONTRAR(" ";A2)-1);DERECHA(A2;LARGO(A2)-ENCONTRAR(" ";A2)))&" "&SI(IZQUIERDA(A2;ENCONTRAR(" ";A2)-1)>DERECHA(A2;LARGO(A2)-ENCONTRAR(" ";A2));IZQUIERDA(A2;ENCONTRAR(" ";A2)-1);DERECHA(A2;LARGO(A2)-ENCONTRAR(" ";A2))) -----> in english: =IF(LEFT(A2,FIND(" ",A2)-1)<=RIGHT(A2,LEN(A2)-FIND(" ",A2)),LEFT(A2,FIND(" ",A2)-1),RIGHT(A2,LEN(A2)-FIND(" ",A2)))&" "&IF(LEFT(A2,FIND(" ",A2)-1)>RIGHT(A2,LEN(A2)-FIND(" ",A2)),LEFT(A2,FIND(" ",A2)-1),RIGHT(A2,LEN(A2)-FIND(" ",A2)))


b) Copy column B pasting values at column C.


c) Remove duplicates from column C.


If you couldn't afford to do that you'll require VBA code, I guess.


Just advise if any issue.


Regards!


PS: The formula foresees From & To of different lengths than 3 chars each; if you're sure that aviation codes for cities are 3 char length (as I think they are, I'm just leaving the original formula for further uses) then use this easier version:

B2: =SI(IZQUIERDA(A2;3)<=DERECHA(A2;3);IZQUIERDA(A2;3);DERECHA(A2;3))&" "&SI(IZQUIERDA(A2;3)>DERECHA(A2;3);IZQUIERDA(A2;3);DERECHA(A2;3)) -----> in english: =IF(LEFT(A2,3)<=RIGHT(A2,3),LEFT(A2,3),RIGHT(A2,3))&" "&IF(LEFT(A2,3)>RIGHT(A2,3),LEFT(A2,3),RIGHT(A2,3))


@Sajan

Hi!

Please don't come up with a single cell formula to do the job, I'd feel so sad and helpless... but I'd love to give a look at it if I could :)

Regards!
 
Hi SirJB7, in column A I had the code, for example, LHRORD and then everytime the formula said A2, I referenced the cell with LHRORD. It came back as a value error.
 
Hi, rm3aquva!


You posted you had "LHR ORD" and not "LHRORD", such a not so slight difference the space.


Could you post data as what your file actually has? Otherwise consider uploading a sample file (including manual examples of desired output), it'd be very useful for those who read this and might be able to help you. Thank you. Give a look at the green sticky posts at this forums main page for uploading guidelines.


Regards!
 
Hi, rm3aquva!

Something kept going round in my head. If you have a 6 char string as cell contents the 1st general formula won't work since there's no embedded space, but the short last one it should.

Regards!
 
Hello SirJB7,

Assuming that the source text has a delimiter, the following formula (shown for source text in cell A10) will identify the last of the absolute row# where the same city codes exist.


=IFERROR(MAX(LARGE(IF(MMULT(SIGN(IFERROR(SEARCH(TRIM(MID(SUBSTITUTE(A10,Delimiter,REPT(" ",4)),{1,4},{4,8})), List),0)), {1;1})=2, ROW(List)), {1,2})), "")


enter with Ctrl + Shift + Enter


Here List is the Name for the entire range of city routes (such as A1:A1000)

Delimiter can be a hyphen, space, etc.


So, for example, if the source text is duplicated in rows #5 and #17, then both rows will show #17.

If the row is unique, then it will display a blank.


If the source text does not have a delimiter, then the segment of the formula that extracts the city codes will need to be revised.


Cheers,

Sajan.


P.S. Since the OP indicated that the city codes are three chars long, and there are only two city codes per string, I have simplified the formula for that specific case. If city codes can have more than three chars, or if there are more than two city codes in the string, then the formula will need to be generalized.
 
@Sajan

Hi!


In a couple of days maybe I finish analyzing your simple formula (I counted 13 functions vs. 3 of mine)... But that's my fault, who pushes me to ask you for a single cell formula?


In the meanwhile please give a look at this file:

https://dl.dropboxusercontent.com/u/60558749/Reverse%20Duplicates%20%28by%20Sajan%20%26%20SirJB7%20for%20rm3aquva%20at%20chandoo.org%29.xlsx


I works fine but when I duplicate entries in column A a blank cell appears in your list.


Regards!
 
Hi SirJB7,

I duplicated A5 to A7 as you requested, and I got a blank cell in cell D4. That shows that A4 is now a unique entry in the list. (i.e. the blank cell is by design.)


If you wish, you could change the last "" in the formula to "Unique" or some other text.


Cheers,

Sajan.
 
@Sajan

Hi!

My mistake, I misunderstood your previous post, I read as if the numbers resulting from the formula were displaying the actual row, even if unique. Astonishing formula, today's not the time of the day (GMT-3) to dissect it. Thanks and good job, buddy.

Regards!
 
Hi Sajan ,


I was thinking along a slightly different line :


Given the two named ranges of List and Delimiter , and given the constraints of fixed length codes , and an identical delimiter in all the codes , the following formula would give the row number , within the list , of the matching duplicate :


=MATCH(1,(IFERROR(SEARCH(INDEX(List,ROW(A1)),List&Delimiter&List),0)>0)*(ROW(List)<>ROW()),0)


Narayan
 
Hi Narayan,

Your approach certainly makes more sense for the constraints given.


In retrospect, the formula I posted is certainly overkill for that situation. However, I started with an attempt to make it more general purpose so that it could handle more than two tokens. Here is my original formula:


=IFERROR(MAX(LARGE(IF(MMULT(SIGN(IFERROR(SEARCH(MID(A10, (COLUMN($A$1:INDEX($1:$1,LEN(A10)-LEN(SUBSTITUTE(A10,Delimiter,""))+1))-1)*4+1,3), List),0)), ROW($A$1:INDEX($A:$A, LEN(A10)-LEN(SUBSTITUTE(A10,Delimiter,""))+1))^0)=LEN(A10)-LEN(SUBSTITUTE(A10,Delimiter,""))+1, ROW(List)), {1,2})), "")


entered with Ctrl + Shift + Enter


The above formula should handle cases where there are more than two tokens in a string. (It still expects each token to be 3 chars long, and delimiter to be 1 char long.)


As you can see, the formula gets pretty long very quickly, just to determine the number of tokens! Also, the formula has an unintended (from my perspective) side effect when it encounters a proper subset of tokens. (e.g. one string has five tokens, and the second has the same five tokens plus additional tokens. In that case, the formula identifies the subset string as a duplicate.)


I did not attempt to make it more general purpose to handle variable-length tokens, or delimiters larger than 1 char long, since that would have made the formula even longer!


The basic idea was pretty simple to implement. The only part that gave me pause was figuring out how to differentiate between duplicate rows and unique rows in the results.


Perhaps there is a different approach for shortening the formula while enabling a general solution.


Regards,

Sajan.
 
@Sajan

Hi!

I told I was taking a couple of days to dissect and digest your "simple" formula and you change it within that period. That's not fair, you shouldn't do such things...

Regards!

PS: Of writing those formulas, of course :)
 
Hi SirJB7,

As you know, Chandoo pays based on the length of the formulas. A couple of more long ones, and I should be able to retire. I am sure you are working towards the same goals!! :-)


-Sajan.
 
Back
Top