• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

A comprehensive fuzzy match (counting bananas!)

Stuck1

New Member
Hi all,


I recently posted this on Mr Excel, but I've not had a reply in 48 hours so I thought it safe to see if anyone here at Chandoo could help. Hope someone can!


This is a tough one to explain, but I’ll do my best as I’ve not found anyone who can offer a solution.


I have a list of measures that a number of organizations are monitoring and I've been asked to find out which organizations are measuring similar things. There are thousands of measures, so I’ve used the fuzzy match plug-in to identify similarities in the text. I’ve set the threshold at 0.6.


The problem that I’ve now got is that I now need to categorically state who measures similar topics. However, the way the fuzzy match works means that some similar measures might be matched for some organizations and not others, that’s because there are differences in how the measures are worded.


In my example (see below), we’ve got one organization measuring ‘Banana supply’ and the fuzzy match shows that another 3 are measuring the same sort of thing. However, further down the list I’ve got an organization measuring ‘Supply of local bananas’ and the match against this measure has revealed that a further organization is measuring something along the same lines. So, if I wanted to state who has an interest in counting bananas I should be listing a total 5 organizations, I need to somehow amalgamate the two groups.


This is a very simple example. I’m guessing that I won’t be able to avoid doing a manual check at some point, but if anyone can give me any hints as to how I can cut hours out of this task I’d really, really appreciate it.


Let me know if this isn’t clear and I’ll try and clarify.


I'm struggling to format the example table and I can't access any upload sites from my work LAN. If the table below comes out all wonky, here is the example on Mr Excel: http://www.mrexcel.com/forum/excel-questions/670379-comprehensive-fuzzy-match-counting-bananas.html

[pre]
Code:
Organisation     Measure                  Organisation       Similar Measure
Org 1            Banana supply            Org 2              Supply of bananas
Org 3              Banana Volume
Org 4              The number of bananas
Org 2           Supply of local bananas   Org 1              Banana supply
Org 3              Banana Volume
Org 4              The number of bananas
Org 5              Supply of yellow bananas
[/pre]
 
Hi ,


An interesting problem. I have no idea how your fuzzy match plug-in works , but I would think that the starting point would be to establish a set of keywords ; in this case , the first set of data would have the keywords BANANA and SUPPLY , while the second would have BANANA , SUPPLY and LOCAL , assuming that the word LOCAL was significant enough.


If the number of keywords is just 2 , then how would a fuzzy match of Banana Volume with Banana Supply happen ? Does it mean that a match of just 1 out of 2 keywords is enough ? Or can you allocate weights to the keywords , so that BANANA has a greater weight than SUPPLY ? In that case would the words Banana Purchase or even Banana Recipes match , even though Banana Supply is in no way related to Banana Purchase or Banana Recipes ?


Any comments ?


Narayan
 
Many thanks for the reply. I'd considered keywords, but there are so many diverse measures, that it'd take forever to define a set of keywords, there are perhaps over 1000 unique measures and I don't even understand what all of them are about.


My initial thought was to give each different measure for each orgnisation a unique referrence, then I'd look for similiarites within the pools and try to sort them into batches (via a pivot), but the pivot would be huge and I'd still have to do a very, very lengthy manual trawl.


The plug-in is the Microsoft addin for 2007 onwards and just matches the text of the measure.
 
Hi ,


I think if we have to take it forward , having some data before us would help , since visualizing all the possibilities is impossible.


Is there any possibility of uploading your file , if not today then tomorrow ?


Narayan
 
Hi, Stuck1!


First of all welcome to Chandoo's website Excel forums. Thank you for your joining us and glad to have you here.


As a starting point I'd recommend you to read the three first green sticky topics at this forums main page. There you'll find general guidelines about how this site and community operates (introducing yourself, posting files, netiquette rules, and so on).


Among them you're prompted to perform searches within this site before posting, because maybe your question had been answered yet.


Feel free to play with different keywords so as to be led thru a wide variety of articles and posts, and if you don't find anything that solves your problem or guides you towards a solution, you'll always be welcome back here. Tell us what you've done, consider uploading a sample file as recommended, and somebody surely will read your post and help you.


And about your question...


Interesting problem, but I don't fully understand your issue. Let me explain: as per your example, if your fuzzy plug-in retrieves:

[pre]
Code:
Org 1            Banana supply            Org 2              Supply of bananas
Org 3              Banana Volume
Org 4              The number of bananas
[/pre]
Org 2, 3 and 4, as similar or analogue to Org 1, then what's your requirement?


I apologize if I'm missing anything.


Regards!
 
Thanks for the welcome. I've not been able to find a soloution online.


To use my example, my problem is that the fuzzy match has successfuly matched Org1's bananna measure to similar measures in other organisations, but its not found Org 5's measure. That's because it's not met the match threshold as the text isn't 'that' close. However, I know that there should be this extra match because I can see that Org2 are also measuring banannas and it shows that Org5 also have a bananna related measure.


In a way, it's like Org 2's measure is a bridge (or a link) between Org1's measure and Org5's. So, what I want to do is group all these bananna measures together, I guess I want to improve the match accuracy by doing some kind of second sweep.


I was going to upload a file, but I don't think it's help clarify things. Sorry if this hasn't been explained to well.
 
@Stuck1


Hi


As per the instruction of Naryank please upload a sample file it is better to get the correct answer from the readers and Ninja's


Click the Below to Upload a sample WorkBook


http://chandoo.org/forums/topic/posting-a-sample-workbook


Thanks


SP
 
Hi, Stuck1!

Thanks for uploading your file. There's now clear what is your problem, but I'm afraid that it's related to the fuzzy logic process and not to any Excel issues, so how do you think we could help you? Please describe in detail what do you want to do within the Excel workbook that you uploaded.

Regards!
 
Now that the problem is understood, I was hoping that someone may give me an idea how to organise the data or have a process/formulae to help achieve my objective (to group the measures). I realise that the fuzzy match algorythm will never be perfect, I trialed a few different permatations, but this is the best I can get. Maybe I'm pushing the fuzzy match methodolgy too far.
 
I've just checked. It's still available at:


https://rapidshare.com/#!download|92p8|980583197|Fuzzy%20Problems.xlsx|369|0|0

or

https://rapidshare.com/files/980583197/Fuzzy Problems.xlsx


Just select the green download box. Thanks for any help.
 
Hi Stuck1,


I checked last night and it was available, its a big data set, let me try, might i find something! :)


Faseeh
 
I edited it (from 46,000+ rows!) and - as you may have realised - removed some text from the metrics indicator column. Column C is just the Indicators column anomolysed (slightly!), it's not in the actucal working file.


Thanks :)
 
Huummmmmm i think there is no harm in trying something, might we end up with some procedure that other fellow ninja can translate into a vba code? Or at-least i will learn some new thing! :)


EDIT: I didn't read anything. Me.
 
Hi Stuck1,


This is what i have been try, feel free to criticize/appreciate :D


http://dl.dropbox.com/u/60644346/Fuzzy%20Problems%20Trail.xlsx


Regards,
 
Many thanks for your efforts, I only just noticed your last response, I didn't think you'd work so fast! Brilliant.


I see your solution is centred around identifying keywords and then using those to group the measures. I really like your idea of organising the text to help filter and identify keywords, I'll have a play around and see how I get on. Given the size of the file it's a still a daunting task.


I'll apply further thought to grouping the measures without the need for keywords though, I'm sure there must be a way to identify correleation in the 'matching reference' I allocated.


Edit: I should add that I'm, working on Excel 2003 today, so I can't fully view the soloution in full until I get to my laptop on 2007.
 
Back
Top