Question

I had the idea of a search engine that would index web items like other search engines do now but would only store the file's title, url and a hash of the contents.

This way it would be easy to find items on the web if you already had them and didn't know where they came from or wanted to know all the places that something appeared.

More useful for non textual items like images, executables and archives.

I was wondering if there is already something similar?

Was it helpful?

Solution

Check out the wikipedia page on locality sensitive hashing. There's also a good page hosted by a research on MIT.

In general, there are several flavors available: hashes for strings (such as simhash), sets or 0/1 features (such as min-wise hashes), and for real vectors.

The main trick for numerical hashes is basically dimension reduction, so far. For strings, the idea is to come up with a representation that's robust in the face of minor edits.

I'm also doing a little research in this field, although I guess stackoverflow might not be the right place for nascent work.

OTHER TIPS

Well, for images, there's [http://tineye.com/][1], which will one-up that, and find you similar images too.

[1]: http://tineye.com/ tin eye

The question seems to focus on exact match hashes, which we understand better than nearest-neighbor approaches, and are indeed worthwhile, especially if people can share tags and other metadata that way.

As @rjmunro notes, hash-based searching is a popular idea in the P2P world, and Bitzi did pretty much this, though they have shut down and their Bitpedia (Digital Media Encyclopedia) isn't hosted there any more, though some of it at least is still available at Archive.org.

Bitzi also produced software like Bitcollider (SourceForge.net), and the Magnet URI scheme, which allows for specifying a file by hash and is thus a content-based identifier. Various applications support searching at various databases via Magnet URIs as described at that Wikipedia page.

The same idea is popular in the password-cracking scene - see e.g. findmyhash - Python script to crack hashes using online services etc.

Going a step further, I think it would be great if there were databases and online repositories identifying content by hash and providing tags and other metadata about the content from various perspectives. Then I could leave my music collection in its pristine state (no wasted backup space and time), but still tag them myself and add other metadata, via external tag databases. If my applications knew how to grab the tags, it would seem much better than the current system where we modify and copy around big files just to move tags from e.g. my desktop to my phone.

See a related idea at Metadata Independent Hashing for Media Identification & P2P Transfer Optimisation (pdf).

It's not a bad idea. Sometimes I find myself stumbled upon some file trying to figure out where it comes from :) But how are you going to track item's sources? Content can be obtained by various means - web browser, download manager, simply by copying from network share.

If I understand your proposal right, http://bitzi.com/ has done this for a while.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top