Handling spelling mistakes while searching your data using excel

Posted on September 25th, 2008 in Analytics , ideas , Learn Excel - 8 comments

Dilbert.com

Spelling mistakes are a thing of day to day carporate life. Most of the data in spreadsheets is entered by people and hence prone to having spelling mistakes or alternate spellings. For eg. a person named John could have been spelled as Jon. And when John calls you back to confirm his reservation and you use the search / vlookup to find his information the result would empty.

handling-spelling-mistakes-data-excel-vba-udfHere is one technique that I use often when the data has spelling mistakes or I need to do fuzzy search to fetch items that sound or spelled similar. Take the 2 texts you want to compare and,

  • Remove all the vowels – AEIOU
  • Replace PH with F, Z & J with G, CK with K, W with V, LL with L, SS with S
  • Remove any Hs
  • Finally compare both texts

To simplify the above 4 steps I have written a small VBA UDF (User Defined Function) that takes a text parameter and performs the above 4 steps.


Function SimpleText(thisTxt As String) As String
' this function generates a simple text from input text that
' can be used for fuzzy search
thisTxt = LCase(thisTxt)
thisTxt = Replace(thisTxt, "a", "")
thisTxt = Replace(thisTxt, "e", "")
thisTxt = Replace(thisTxt, "i", "")
thisTxt = Replace(thisTxt, "o", "")
thisTxt = Replace(thisTxt, "u", "")
thisTxt = Replace(thisTxt, "ph", "f")
thisTxt = Replace(thisTxt, "z", "g")
thisTxt = Replace(thisTxt, "ck", "k")
thisTxt = Replace(thisTxt, "w", "v")
thisTxt = Replace(thisTxt, "j", "g")
thisTxt = Replace(thisTxt, "ll", "l")
thisTxt = Replace(thisTxt, "ss", "s")
thisTxt = Replace(thisTxt, "h", "")
SimpleText = thisTxt
End Function

The above code can be used to perform fuzzy text searches or searches on unclean data. Of course, the above substitution rules are what I find good enough. Feel free to define additional rules as per your needs so that your fuzzy searches work even better.

If you are looking for generating SOUNDEX codes for excel strings you can use this excel soundex UDF. Soundex codes are phonetic codes generated for words based on how they sound, thus 2 words sounding similar (for eg. excess, access) would have same soundex code. You can use these codes to perform fuzzy searches.

More on text processing using excel:

Written by Chandoo
Tags: , , , , , , , , , , ,
Home: Chandoo.org Main Page
? Doubt: Ask an Excel Question

8 Responses to “Handling spelling mistakes while searching your data using excel”

  1. James says:

    As ever, a great "practical" example that is easy to customise / add to :>) I think I will add "K -> C" so that Katherine is matched to Catherine.

    Is similar to the Metaphone function which is an improvement on SOUNDEX.

    I need to do this thing, though, with a few million records :>(

    I'd love to see more practical data cleaning "how to's" e.g. transforming and standardizing phone numbers

    Input: (301) 754-6350
    Transform: (999)999-9999
    Output: 301 | 754 | 6350

  2. Chandoo says:

    @James : thanks very much. Cleaning up phone numbers is a good idea. I will write about it sometime.

  3. Alex J says:

    Chandoo,
    Did you realize that your blog article spells "corporate life as "carporate life".

    Ironic, no? Shouldn't blogs have speel checkers (sic)? How about blog comment boxes?

  4. Chandoo says:

    @Alex... you noticed! That was intentional... believe it or not... it is meant be situational humor. ๐Ÿ™‚

  5. Alex J says:

    so was my comment about "speel checkers" ๐Ÿ™‚

  6. [...] Handling spelling mistakes in your data Splitting text using excel formulas Generating initials from names using excel Adding a range of cells using Concat() [...]

  7. JP says:

    Ross over at Methods In Excel has a post about fuzzy matching. I posted some code and some links there, and Ross has a workbook with some of the more popular methods (Levenshtein, Soundex, etc).

    http://www.blog.methodsinexcel.co.uk/2008/09/17/fuzzy-logic/

    Enjoy,
    JP

  8. [...] I have used fuzzyText UDF (user defined formula) so that we can search against this list even when you have a spelling mistake in the fund name. For more information see fuzzy text search using excel. [...]

Leave a Reply