Fuzzy Lookup - bug/logic issue?

I've been using the fuzzy lookup to try match similarly named source values and have come across one that surely can't be correct. 

For the name "Hopkins W.", I have three potential matches (according to the fuzzy lookup) with similarity as follows

Higgins W.	0.632809937000275
Zoins W.	0.618580460548401	
Hopkins K.	0.583522439002991	

Now considering "Hopkins K." matches on the entire surname and the other two only on the "W" initial, why on earth are the other two rated as a more likely match? Even taking this at purely a token level match, they would both still be 1 token matching each so at worse, should be at least the same. Can anyone come up with a sound reason behind the result or worthy of raising with Microsoft?

July 23rd, 2015 7:42am

This has some explanation regarding fuzzy matching algorithms

http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

Free Windows Admin Tool Kit Click here and download it now
July 23rd, 2015 9:06am

Man, that got my head spinning after the first page. Didn't truly answer why the above occurred but I'm figuring each token has roughly the same value so even though one source token met fully with the surname, that one token match gave the same score as the token match on initial... would suggest quite a simplistic formula as really length of token should have value else a sentence with lots of "a"s, "an"s and "the"s will score higher than it should
July 26th, 2015 4:54pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics