• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

How to find text duplicates within a free-text cell?

Rediska

Member
Hi again Excellians,

How to find duplicate words in a single free-text cell?
The rules are: number of duplicates in one cell in not important, it can be any word, in any position, it has to match exact word and duplicates don’t need to be removed.

For example, in a cell: “I need you help help” formula should detect “help” repeated twice, and in example cell “I need you help hel” formula won’t find any matches.

As an outcome, I’d like to have a message saying “review the input for duplicates”, if no duplicates – leave cell empty.

Many Thanks!
Have a great weekend!
 
Can we use helper cells or VB?
Splitting data up w/o those 2 options will be a bit...tricky.
 
Hello,
You could try the following array formula for a value in cell A2:
=IF(SUM(N(IFERROR(FIND(" "&MID(A2,ROW(OFFSET(A$1,,,LEN(A2))), MMULT(FIND(" ", {""," "} &A2&" ",ROW(OFFSET(A$1,,,LEN(A2)))),{1;-1}))&" ", " "&A2&" "), LEN(A2)+1)<ROW(OFFSET(A$1,,,LEN(A2)))))>0, "Dupes", "All good")

enter with Ctrl + Shift + Enter

I am sure it can be optimized, but I am a bit pressed for time. Will look into it further if needed.

Cheers,
Sajan.
 
Hi,
I noticed that the previous formula would result in some false positives under some conditions.
As such, try the following regular formula for a value in cell A2:
=IF(LOOKUP(1,N(FIND(" "&MID(TRIM(A2), ROW(OFFSET(A$1,,,LEN(TRIM(A2)))), MMULT(FIND(" ",{""," "}&TRIM(A2)&" ",ROW(OFFSET(A$1,,,LEN(TRIM(A2))))),{1;-1}))&" ", " "&TRIM(A2)&" ")<ROW(OFFSET(A$1,,,LEN(TRIM(A2)))))),"Dupes", "No Dupes")

Cheers,
Sajan.
 
Hi,
The following is another approach for a string in A2:
=IF(ISNA(MODE(FIND(" "&TRIM(MID(SUBSTITUTE(TRIM(A2)," ",REPT(" ",100)), (ROW(OFFSET(A$1,,,LEN(TRIM(A2))-LEN(SUBSTITUTE(A2," ",""))+1))-1)*100+1, 100))&" ", " "&TRIM(A2)&" "))), "No Dupes", "Dupes")

Cheers,
Sajan.
 
Luke



Do you have British blood in your American veins as you definitely use the understatement....:p
I've learned awhile ago that as soon as I start making something sound like it will be a monumental amount of work (which it would be for me), someone like Sajan comes along and blows it away. :p
 
Hi Sajan,

Thanks a lot, well done - the third formula is a definite winner!
Both formulas (second & third) recognise duplicate symbols as well as words, however for empty cells second formula returns "#REF, but the third one reads empty cell as a value and returns "No duplicate". To avoid this I replaced "No duplicate" with " ", and everything worked just as doctor ordered.....


All the best!
Thanks again!
 
Hi Rediska,
Thanks for the feedback. Glad to help.

Hi Luke,
We all learn from each other. (I have certainly learned a lot from you.)

Regards,
-Sajan.
 
Good day Luke M

I get the impression that Sajan is monitoring the posts and has a smile on his face as it reads the suggested answers for the mere mortals as they talk about helper columns/text to columns/UDF's/CF's ect, ect and waits until we are all scratching around for an answer and they he pops in a formula that makes others stare and read with mouth open...............:p
 
  • Like
Reactions: Xiq
Sajan,

I am really curious - could you please go through your third formula and "decode" step-by-step brackets in a plain language?
Of course, if you'll have time.
Thanks!
 
Hi Bob,
Interesting comment about monitoring of posts.... Chandoo.org is indeed addictive!

Hi Rediska,
In the third formula, the first argument to FIND uses an old technique that I believe is credited to Rick Rothstein. It expands each space found in the original string into 100 spaces, and then picks up the segments at every 100 chars, and then trims the results. That results in an array of strings corresponding to each substring in the original string.
the second argument to FIND is the original string, padded with the space delimiter on either end. When FIND returns the location where each string in the array was found, the only results from FIND that would get repeated are those for the duplicate strings.
MODE returns the number that is repeated most often, and #N/A if there are no repeats. So it is a good test to see if there were any duplicates.

Hope this helps.

Cheers,
Sajan.
 
Hi Sajan,

So basically the arrays with a "spatial" builder twist are the best answer to complex queries. Didn't though of that, usually avoiding arrays like nowhere land.
And now another question - you used "FIND", but
couldn't "LOOKUP", INDEX_MATCH" or "MATCH" (to the entries) do the same job?

Thanks,
Rediska
 
Hi Rediska,
An array is just a collection of items... The only difference between a single item and an array is that an operation is applied to all of the items in that collection.

FIND is needed to traverse a string. (You could also use SEARCH.)

INDEX and MATCH works with discrete elements in a list, whereas FIND works with portions of a string. A string can be considered as a single-item list.

MATCH with a wild-card search comes close to FIND's functionality for a single item, but FIND returns the location in the string where a match is found, whereas MATCH only tells you if an item as a whole matched the lookup value.

Hope that helps.

-Sajan.
 
Hi,
You are welcome.

Though you did not ask for it, here is a brief explanation of the first formula:

It is a technique I learned from Lori.
=IF(LOOKUP(1,N(FIND(" "&MID(TRIM(A2), ROW(OFFSET(A$1,,,LEN(TRIM(A2)))), MMULT(FIND(" ",{""," "}&TRIM(A2)&" ",ROW(OFFSET(A$1,,,LEN(TRIM(A2))))),{1;-1}))&" ", " "&TRIM(A2)&" ")<ROW(OFFSET(A$1,,,LEN(TRIM(A2)))))),"Dupes", "No Dupes")

The right-most FIND finds the position of the space delimiters in the string, so as to determine the length of each substring. Since the second argument prefixes a blank and a space as a constant array, FIND returns a 2-dimensional array for the condition where the prefix is a blank and the condition where the prefix is a space. We then flatten it with MMULT to get a 1-dimensional array. The resulting array would have positive numbers indicating the length of the substring at that location in the string, and zero or negative numbers in all other cases.

We then pass the one-dimensional array to MID to create an array of substrings. MID will return an error for those positions in the array with zero or negative values.

Now we have the array of substrings.

Then, we locate the array of substrings in the main string (using the left-most FIND). Remember that each substring is at the same position in the original string. This allows us to compare the results from FIND with the corresponding position in the string. If the string is not a duplicate, the result from FIND should match the position in the string. If a duplicate is found, the second (and subsequent) duplicates will not match the position in the string. (i.e. for duplicates, FIND will return a value that points to the first/earlier instance found in the original string.)

if the result from the leftmost FIND does not match all of the corresponding positions in the string, we know we have duplicates in the source string. For every duplicate, we will get a TRUE. These TRUE/FALSE are converted to 1/0 using the N().

Since LOOKUP ignores error values, we can use it to lookup the value of 1, and return 1 if it is found. That gets fed to the IF statement, and we have our final results.

Of course, you can avoid the comparison to the position in the original string (and shorten the formula) using the MODE approach used in the second formula! But what is the fun in that?!! :p

Cheers,
Sajan.
 
So far it is a walk of shame on my part, but actually it is fun (well & long forgotten),
well, you definitely reincarnate curiosity :rolleyes:.

Thanks for the formula explanation, especially in such cases.
 
=IF(ISNA(MODE(FIND(" "&TRIM(MID(SUBSTITUTE(TRIM(A2)," ",REPT(" ",100)), (ROW(OFFSET(A$1,,,LEN(TRIM(A2))-LEN(SUBSTITUTE(A2," ",""))+1))-1)*100+1, 100))&" ", " "&TRIM(A2)&" "))), "No Dupes", "Dupes")

@Sajan,
Do I understand correct that this formula has it's limitations in terms of number of words and word length in the text-string, because of the fixed 100 characters cut in the function MID( ... , ... , 100 )?
 
Back
Top