You are not logged in.

#1 2014-05-07 20:53:15

alphaniner
Member
From: Ancapistan
Registered: 2010-07-12
Posts: 2,810

[Python]'Fuzzy' string match

Reference:

get_close_matches(word, possibilities, n=3, cutoff=0.6)
    Use SequenceMatcher to return list of the best "good enough" matches.

    word is a sequence for which close matches are desired (typically a
    string).

    possibilities is a list of sequences against which to match word
    (typically a list of strings).

I found that while searching for a way to 'fuzzy match' strings. But I'm only interested in whether or not two strings 'match'.

As I see it I have three options:

  • Use get_close_matches() directly:

    if difflib.get_close_matches(string1, [string2]):
  • Define a function which basically just returns the value of g_c_m():

    def fuzzy_match(test_string, possible_match):
        return difflib.get_close_matches(test_string, [possible_match])
  • Define a function that is a 'stripped down' version of g_c_m() and returns True/False. FWIW g_c_m() is short and simple, so this was effortless.

IMO even option 2 is worthwhile, if only because of the odd way g_c_m() needs to be called for my case. But I also think it's more 'explicit' and just looks cleaner.

Anyway, I'd like to hear opinions on which of the three options is 'best'.

Last edited by alphaniner (2014-05-07 21:33:12)


But whether the Constitution really be one thing, or another, this much is certain - that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case, it is unfit to exist.
-Lysander Spooner

Offline

#2 2014-05-07 21:20:34

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 20,346

Re: [Python]'Fuzzy' string match

I am not sure exactly what you are asking, but have you ever looked at soundex ?


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#3 2014-05-07 21:56:03

firecat53
Member
From: Lake Stevens, WA, USA
Registered: 2007-05-14
Posts: 1,542
Website

Re: [Python]'Fuzzy' string match

IMO, option 1 is simple and clean. You know difflib will only return a list (empty or not), so just checking for entries in that list seems (to me at least) simple and pythonic.

Scott

Offline

#4 2014-05-07 22:25:26

progandy
Member
Registered: 2012-05-17
Posts: 5,280

Re: [Python]'Fuzzy' string match

You might want to use the SequenceMatcher directly, you could even keep the object around if you use it repeatetly.

s = SequenceMatcher(None, "", "what_should_be")
s.set_seq1("what_is")
if s.ratio() > 0.6: return MATCHING

... or for single use:

s = SequenceMatcher(None, "what_is", "what_should_be")
if s.ratio() > 0.6: return MATCHING

Edit: At least I believe that get_close_matches uses this internally. You can use the sequencematcher to log the differences if you want to debug your output, too.
Edit: Yes, here is the source: https://github.com/python/cpython/blob/ … ib.py#L688

Last edited by progandy (2014-05-07 22:35:04)


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Online

#5 2014-05-07 23:11:25

alphaniner
Member
From: Ancapistan
Registered: 2010-07-12
Posts: 2,810

Re: [Python]'Fuzzy' string match

progandy wrote:

You might want to use the SequenceMatcher directly...

Thanks, I'll definitely look into that. I didn't take the time to really understand how SM works when I reworked g_c_m(), so I didn't realize it's that simple. Do you think it's not worthwhile to do all three comparisons, though?

if s.real_quick_ratio() >= cutoff and \
   s.quick_ratio() >= cutoff and \
   s.ratio() >= cutoff:

But whether the Constitution really be one thing, or another, this much is certain - that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case, it is unfit to exist.
-Lysander Spooner

Offline

#6 2014-05-07 23:23:14

progandy
Member
Registered: 2012-05-17
Posts: 5,280

Re: [Python]'Fuzzy' string match

alphaniner wrote:
progandy wrote:

You might want to use the SequenceMatcher directly...

Thanks, I'll definitely look into that. I didn't take the time to really understand how SM works when I reworked g_c_m(), so I didn't realize it's that simple. Do you think it's not worthwhile to do all three comparisons, though?

if s.real_quick_ratio() >= cutoff and \
   s.quick_ratio() >= cutoff and \
   s.ratio() >= cutoff:

That depends on your input data. If you want to speed up the function for your use case, strip everything that slows the calculation down. If your data always clears e.g. real_quick_ratio, then you won't need it. You might also want to tweak the cutoff if you get too many false positives/negatives.

Maybe using something different like the Levenshtein-distance might suit you even more. Here is a list of different string metrics: http://en.wikipedia.org/wiki/String_metrics
The NLTK has some implementations: http://www.nltk.org/api/nltk.metrics.ht … s.distance and then there is jellyfish https://pypi.python.org/pypi/jellyfish/

Last edited by progandy (2014-05-07 23:37:02)


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Online

#7 2014-05-08 00:57:06

alphaniner
Member
From: Ancapistan
Registered: 2010-07-12
Posts: 2,810

Re: [Python]'Fuzzy' string match

ATM I'm limiting myself to the stdlib, but I'm keeping pylevenshtein and jellyfish in mind.


But whether the Constitution really be one thing, or another, this much is certain - that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case, it is unfit to exist.
-Lysander Spooner

Offline

Board footer

Powered by FluxBB