Question

well i need to compare two strings or at least find a sequence of characters from a string to another string. The two strings contain md5 of files which i must compare and say if i find a match.

my current code is:

def comparemd5():
    origmd5=getreferrerurl()
    dlmd5=md5_for_file(file_name)
    print "original md5 is",origmd5
    print "downloader file md5 is",dlmd5
    s = difflib.SequenceMatcher(None, origmd5, dlmd5)
    print "ratio is:",s.ratio()

the output i get is:

original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40
12db46', '59739CCDA2F15D5AC16DB6695CAE3378']

downloader file md5 is 59739ccda2f15d5ac16db6695cae3378

ratio is : 0.0

Thus! there is a match from dlmd5 in origmd5 but somehow its not finding it... I am doing something wrong somewhere...Please help me out :/

Was it helpful?

Solution

Basically, you want the idom if test_string in list_of_strings. Looks like you don't need case sensitivity, so you might want

if test_string.lower() in (s.lower() for s in list_of_strings)

In your case:

>>> originals = ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
>>> test = '59739ccda2f15d5ac16db6695cae3378'
>>> if test.lower() in (s.lower() for s in originals):
...    print '%s is match, yeih!' % test
... 
59739ccda2f15d5ac16db6695cae3378 is match, yeih!

OTHER TIPS

Looks like you're having a problem since the case isn't matching on the letters. May want to try:

def comparemd5():
    origmd5=[item.lower() for item in getreferrerurl()]
    dlmd5=md5_for_file(file_name)
    print "original md5 is",origmd5
    print "downloader file md5 is",dlmd5
    s = difflib.SequenceMatcher(None, origmd5, dlmd5)
    print "ratio is:",s.ratio()

Given the input:

original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']

downloader file md5 is 59739ccda2f15d5ac16db6695cae3378

You have two problems.

First of all, that first one isn't just an MD5, but an MD5 and two other things.

To fix that: If you know that origmd5 will always be in this format, just use origmd5[2] instead of origmd5. If you have no idea what origmd5 is, except that one of the things in it is the actual MD5, you'll have to compare against all of the elements.

Second, the actual MD5 values are both hex strings representing the same binary data, but they're different hex strings (because one is in uppercase, the other in lowercase). You could fix this by just doing a case-insensitive comparison, but it's probably more robust to unhexlify them both and compare the binary values.

In fact, if you've copied and pasted the output correctly, at least one of those hex strings has a space in the middle of it, so you actually need to unhexlify hex strings with optional spaces between hex pairs. AFAIK, there is no stdlib function that does this, but you can write it yourself in one step:

def unhexlify(s):
    return binascii.unhexlify(s.replace(' ', ''))

Meanwhile, I'm not sure why you're trying to use difflib.SequenceMatcher at all. Two slightly different MD5 hashes refer to completely different original sources; that's kind of the whole point of MD5, and crypto hash functions in general. There's no such thing as a 95% match; there's either a match, or a non-match.

So, if you know the 3rd value in origmd5 is the one you want, just do this:

s = unhexlify(origmd5[2]) == unhexlify(dlmd5)

Otherwise, do this:

s = any(unhexlify(origthingy) == unhexlify(dlmd5) for origthingy in origmd5)

Or, turning it around to make it simpler:

s = unhexlify(dlmd5) in map(unhexlify, origthingy)

Or whatever equivalent you find most readable.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top