Question

I run a rather large site where my members add thousands of images every day. Obviously there is a lot of duplication and i was just wondering if during an upload of an image i can somehow generate a signature or a hash of an image so i can store it. And every time someone uploads the picture i would simply run a check if this signature already exists and fire an error stating that this image already exists. Not sure if this kind of technology already exists for asp.net but i am aware of tineye.com which sort of does it already.

If you think you can help i would appreciate your input.

Kris

Was it helpful?

Solution

You use any derived HashAlgorithm to generate a hash from the byte array of the file. Usually MD5 is used, but you could subsitute this for any of those provided in the System.Security.Cryptography namespace. This works for any binary, not just images.

Lots of sites provide MD5 hashes when you download files to verify if you've downloaded the file properly. For instance, an ISO CD/DVD image may be missing bytes when you've received the whole thing. Once you've downloaded the file, you generate the hash for it and make sure it's the same as the site says it should be. If all compares, you've got an exact copy.

I would probably use something similar to this:

public static class Helpers
{
    //If you're running .NET 2.0 or lower, remove the 'this' keyword from the
    //method signature as 2.0 doesn't support extension methods.
    static string GetHashString(this byte[] bytes, HashAlgorithm cryptoProvider)
    {
        byte[] hash = cryptoProvider.ComputeHash(bytes);
        return Convert.ToBase64String(hash);
    }
}

Requires:

using System.Security.Cryptography;

Call using:

byte[] bytes = File.ReadAllBytes("FilePath");
string filehash = bytes.GetHashString(new MD5CryptoServiceProvider());

or if you're running in .NET 2.0 or lower:

string filehash = Helpers.GetHashString(File.ReadAllBytes("FilePath"), new MD5CryptoServiceProvider());

If you were to decide to go with a different hashing method instead of MD5 for the miniscule probability of collisions:

string filehash = bytes.GetHashString(new SHA1CryptoServiceProvider());

This way your has method isn't crypto provider specific and if you were to decide you wanted to change which crypto provider you're using, you just inject a different one into the cryptoProvider parameter.

You can use any of the other hashing classes just by changing the service provider you pass in:

string md5Hash = bytes.GetHashString(new MD5CryptoServiceProvider());
string sha1Hash = bytes.GetHashString(new SHA1CryptoServiceProvider());
string sha256Hash = bytes.GetHashString(new SHA256CryptoServiceProvider());
string sha384Hash = bytes.GetHashString(new SHA384CryptoServiceProvider());
string sha512Hash = bytes.GetHashString(new SHA512CryptoServiceProvider());

OTHER TIPS

Typically you'd just use MD5 or similar to create a hash. This isn't guaranteed to be unique though, so I'd recommend you use the hash as a starting point. Identify if the image matches any known hashes you stored, then individually load the ones that it does match and do a full byte comparison on the potential collisions to be sure.

Another, simpler technique though is to simply pick a smallish number of bits and read first part of the image... store that number of starting bits as if they were a hash. This still gives you a small number of potential collisions that you'd need to check, but has much less overhead.

Look in the System.Security.Cryptography namespace. You have your choice of several hashing algorithms/implementations. Here's an example using md5, but since you have a lot of these you might want something bigger like SHA1:

public byte[] HashImage(Stream imageData)
{
    return new MD5CryptoServiceProvider().ComputeHash(imageData);
} 

I don't know if it already exists or not, but I can't think of a reason you can't do this yourself. Something similar to this will get you a hash of the file.

var fileStream = Request.Files[0].InputStream;//the uploaded file
var hasher = System.Security.Cryptography.HMACMD5();
var theHash = hasher.ComputeHash(fileStream);

System.Security.Cryptography

A keyword that might be of interest is perceptual hashing.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top