Question

In order to be able to detect RT of a particular tweet, I plan to store hashes of each formatted tweet in the database.

What hashing algorithm should I use. Cryptic is of course not essential. Just a minimal way of storing a data as something which can then be compared if it is the same, in an efficient way.

My first attempt at this was by using md5 hashes. But I figured there can be hashing algorithms that are much more efficient, as security is not required.

Was it helpful?

Solution

You are trying to hash a string right? Builtin types can be hashed right away, just do hash("some string") and you get some int. Its the same function python uses for dictonarys, so it is probably the best choice.

OTHER TIPS

Do you really need to hash at all? Twitter messages are short enough (and disk space cheap enough) that it may be better to just store the whole message, rather than eating up clock cycles to hash it.

I am not familiar with Python (sorry, Ruby guy typing here) however you could try a few things.

Assumptions: You will likely be storing hundreds of thousands of Tweets over time, so comparing one hash against "every record" in the table will be inefficient. Also, RTs are not always carbon copies of the original tweet. After all, the original author's name is usually included and takes up some of the 140 character limit. So perhaps you could use a solution that matches more accurately than a "dumb" hash?

  1. Tagging & Indexing

    Tag and index the component parts of the message in a standard way. This could include treating hashed #...., at-marked @.... and URL strings as "tags". After removing noise words and punctuation, you could also treat the remaining words as tags too.

  2. Fast Searching

    Databases are terrible at finding multiple group membership very quickly (I'll assume your using either Mysql or Postgresql, which are terrible at this). Instead try one of the free text engines like Sphinx Search. They are very very fast at resolving multiple group membership (i.e. checking if keywords are present).

    Using Sphinx or similar, we search on all of the "tags" we extracted. This will probably return a smallish result set of "potential original Tweets". Then compare them one by one using similarity matching algorithm (here is one in Python http://code.google.com/p/pylevenshtein/)

Now let me warmly welcome you to the world of text mining.

Good luck!

I echo Chris' comment about not using a hash at all (your database engine can hopefully index 140-character fields efficiently).

If you did want to use a hash, MD5 would be my first choice as well (16 bytes), followed by SHA-1 (20 bytes).

Whatever you do, don't use sum-of-characters. I can't immediately come up with a function that would have more collisions (all anagrams hash the same), plus it's slower!

$ python -m timeit -s 'from hashlib import md5' 'd=md5("There once was a man named Michael Finnegan.").digest()'
100000 loops, best of 3: 2.47 usec per loop
$ python -m timeit 'd=sum(ord(c) for c in "There once was a man named Michael Finnegan.")'
100000 loops, best of 3: 13.9 usec per loop

There are a few issues here. First, RT's are not always identical. Some people add a comment. Others change the URL for tracking. Others add in the person that they are RT'ing (which may or may not be the originator).

So if you are going to hash the tweet, you need to boil it down to the meat of the tweet, and only hash that. Good luck.

Above, someone mentioned that with 32-bits, you will start having collisions at about 65K tweets. Of course, you could have collisions on tweet #2. But I think the author of that comment was confused, since 2^16 = ~65K, but 2^32 = ~4 Trillion. So you have a little more room there.

A better algorithm might be to try to derive the "unique" parts of the tweet, and fingerprint it. It's not a hash, it's a fingerprint of a few key words that define uniqueness.

Well, tweets are only 140 characters long, so you could even store the entire tweet in the database...

but if you really want to "hash" them somehow, a simple way would be to just take the sum of the ASCII values of all the characters in the tweet:

sum(ord(c) for c in tweet)

Of course, whenever you have a match of hashes, you should check the tweets themselves for sameness, because the probability of finding two tweets that give the same "sum-hash" is probably non-negligible.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top