Handling spelling mistakes while searching your data using excel
Spelling mistakes are a thing of day to day carporate life. Most of the data in spreadsheets is entered by people and hence prone to having spelling mistakes or alternate spellings. For eg. a person named John could have been spelled as Jon. And when John calls you back to confirm his reservation and you use the search / vlookup to find his information the result would empty.
Here is one technique that I use often when the data has spelling mistakes or I need to do fuzzy search to fetch items that sound or spelled similar. Take the 2 texts you want to compare and,
- Remove all the vowels – AEIOU
- Replace PH with F, Z & J with G, CK with K, W with V, LL with L, SS with S
- Remove any Hs
- Finally compare both texts
To simplify the above 4 steps I have written a small VBA UDF (User Defined Function) that takes a text parameter and performs the above 4 steps.
Function SimpleText(thisTxt As String) As String
' this function generates a simple text from input text that
' can be used for fuzzy search
thisTxt = LCase(thisTxt)
thisTxt = Replace(thisTxt, "a", "")
thisTxt = Replace(thisTxt, "e", "")
thisTxt = Replace(thisTxt, "i", "")
thisTxt = Replace(thisTxt, "o", "")
thisTxt = Replace(thisTxt, "u", "")
thisTxt = Replace(thisTxt, "ph", "f")
thisTxt = Replace(thisTxt, "z", "g")
thisTxt = Replace(thisTxt, "ck", "k")
thisTxt = Replace(thisTxt, "w", "v")
thisTxt = Replace(thisTxt, "j", "g")
thisTxt = Replace(thisTxt, "ll", "l")
thisTxt = Replace(thisTxt, "ss", "s")
thisTxt = Replace(thisTxt, "h", "")
SimpleText = thisTxt
End Function
The above code can be used to perform fuzzy text searches or searches on unclean data. Of course, the above substitution rules are what I find good enough. Feel free to define additional rules as per your needs so that your fuzzy searches work even better.
If you are looking for generating SOUNDEX codes for excel strings you can use this excel soundex UDF. Soundex codes are phonetic codes generated for words based on how they sound, thus 2 words sounding similar (for eg. excess, access) would have same soundex code. You can use these codes to perform fuzzy searches.
More on text processing using excel:
- Split text using excel formulas
- Get initials from names
- Access & process RSS Feed data from your excel sheets
Trackbacks & Pingbacks
- Pingback by Clean up Incorrectly Formatted Phone Numbers using Microsoft Excel - Download and use this Free VBA UDF (User Defined Function) | Pointy Haired Dilbert - Chandoo.org on September 30, 2008 @ 1:38 pm
- Pingback by Mutual Fund Tracker - Free Excel Template | Pointy Haired Dilbert: Charting & Excel Tips - Chandoo.org on December 28, 2009 @ 10:16 am
Comments
RSS feed for comments on this post. TrackBack URI
Leave a comment
If you have a question, please ask in the forums



At Pointy Haired Dilbert, I have one goal, "to make you awesome in excel and charting". PHD is started in 2007 and today has 300+ articles and tutorials on using excel, making better charts. 




As ever, a great “practical” example that is easy to customise / add to :>) I think I will add “K -> C” so that Katherine is matched to Catherine.
Is similar to the Metaphone function which is an improvement on SOUNDEX.
I need to do this thing, though, with a few million records :>(
I’d love to see more practical data cleaning “how to’s” e.g. transforming and standardizing phone numbers
Input: (301) 754-6350
Transform: (999)999-9999
Output: 301 | 754 | 6350
@James : thanks very much. Cleaning up phone numbers is a good idea. I will write about it sometime.
Chandoo,
Did you realize that your blog article spells “corporate life as “carporate life”.
Ironic, no? Shouldn’t blogs have speel checkers (sic)? How about blog comment boxes?
@Alex… you noticed! That was intentional… believe it or not… it is meant be situational humor.
so was my comment about “speel checkers”
Ross over at Methods In Excel has a post about fuzzy matching. I posted some code and some links there, and Ross has a workbook with some of the more popular methods (Levenshtein, Soundex, etc).
http://www.blog.methodsinexcel.co.uk/2008/09/17/fuzzy-logic/
Enjoy,
JP