Implement recursive hashing algorithm

https://stackoverflow.com/questions/8409929

29-10-2019
|

Question

let's say file A has the bytes:

and I have a simple hashing algorithm where I store the sum of the last three consecutive bytes so:

2   
5   
8   - = 8+5+2 = 15
0   
33  
90  - = 90+33+0 = 123
1   
3   
200 - = 204
201 
23  
12  - = 236

so I will be able to represent file A as 15, 123, 204, 236

let's say I copy that file to a new computer B and I made some minor modifications and the bytes of file B are:

"note the difference is an extra byte at the beginning of the file and 2 extra bytes at the end but the rest is very similar"

so I can perform the same algorithm to determine if some parts of the file are the same. Remember that file A was represented by the hash codes 15, 123, 204, 236 let's see if file B gives me some of those hash codes!

so on file B I will have to do it every 3 consecutive bytes

int[] sums; // array where we will hold the sum of the last bytes


255 sums[0]  =          255     
2   sums[1]  =  2+ sums[0]    = 257     
5   sums[2]  =  5+ sums[1]    = 262     
8   sums[3]  =  8+ sums[2]    = 270  hash = sums[3]-sums[0]   = 15   --> MATHES FILE A!
0   sums[4]  =  0+ sums[3]    = 270  hash = sums[4]-sums[1]   = 13
33  sums[5]  =  33+ sums[4]   = 303  hash = sums[5]-sums[2]   = 41
90  sums[6]  =  90+ sums[5]   = 393  hash = sums[6]-sums[3]   = 123  --> MATHES FILE A!
1   sums[7]  =  1+ sums[6]    = 394  hash = sums[7]-sums[4]   = 124
3   sums[8]  =  3+ sums[7]    = 397  hash = sums[8]-sums[5]   = 94
200 sums[9]  =  200+ sums[8]  = 597  hash = sums[9]-sums[6]   = 204  --> MATHES FILE A!
201 sums[10] =  201+ sums[9]  = 798  hash = sums[10]-sums[7]  = 404
23  sums[11] =  23+ sums[10]  = 821  hash = sums[11]-sums[8]  = 424
12  sums[12] =  12+ sums[11]  = 833  hash = sums[12]-sums[9]  = 236  --> MATHES FILE A!
55  sums[13] =  55+ sums[12]  = 888  hash = sums[13]-sums[10] = 90
255 sums[14] =  255+ sums[13] = 1143    hash = sums[14]-sums[11] =  322
255 sums[15] =  255+ sums[14] = 1398    hash = sums[15]-sums[12] =  565

so from looking at that table I know that file B contains the bytes from file A plus additional ones because the hash codes match.

the reason why I show this algorithm is because it was of order n In other words I was able to calculate the hash of the last 3 consecutive bytes without having to iterate through them!

If I where to have a more complex algorithm such as doing md5 of the last 3 bytes then it will be of order n^3 that is because as I iterate through file B I will have to have an inner for loop that will compute the hash of the last three bytes.

So my question is:

how can I improve the algorithm keeping it of order n. That is computing the hash just once. If I use an existing hashing algorithm such as md5 the I will have to place an inner loop inside the algorithm that will significantly increase the order of the algorithm.

Note that it is possible to do the same thing with multiplication instead of addition. but the counter significantly grows really fast. Maybe I can combine multiplication and addition and subtraction...

Edit

Also if I google for:

recursive hashing functions in-grams

a lot of information comes up and I think those algorithms are very difficult to understand...

I have to implement this algorithm for a project that's why I am reinventing the wheel... I know there are a lot of algorithms out there.

Also an alternative solution that I was thinking was to perform the same algorithm plus another one that is strong. so on file A I will perform the same algorithm every 3 bytes plus md5 of each 3 bytes. On the second file I will just perform the second algorithm if the first one comes true....

Solution

Edit:

The more I think about what you meant by "recursive", the more I doubt the solution I presented earlier is what you should implement to do anything useful.

You probably want to implement a hash tree algorithm, which is a recursive operation.

To do this you hash the list, divide the list in two, and recurse into those two sub-lists. Terminate either when your list is size 1, or a minimum desired hash size, as each level of recursion will double the size of your total hash output.

Pseudo-code:

create-hash-tree(input list, minimum size: default = 1):
  initialize the output list
  hash-sublist(input list, output list, minimum size)
  return output list

hash-sublist(input list, output list, minimum size):
  add sum-based-hash(list) result to output list // easily swap hash styles here
  if size(input list) > minimum size:
    split the list into two halves
    hash-sublist(first half of list, output list, minimum size)
    hash-sublist(second half of list, output list, minimum size)

sum-based-hash(list):
  initialize the running total to 0

  for each item in the list:
    add the current item to the running total

  return the running total

The running time of this whole algorithm I think is O(hash(m)); m = n * (log(n) + 1), with hash(m) usually being linear time.

The storage space is something like O(hash * s); s = 2n - 1, with the hash usually being constant size.

Note that for C#, I'd make the output list a List<HashType>, but I'd make the input list an IEnumerable<ItemType> to save on storage space, and use Linq to quickly "split" the list without allocating two new sub-lists.

Original:

I think you can get this to be O(n + m) execution time; where n is the size of the list, m is the size of the running tally, and n < m (otherwise all sums would be equal).

With Double-Ended Queue

The memory consumption will be the stack size, plus size m for temporary storage.

To do this, use a double-ended queue, and a running total. Push newly encountered values onto the list while adding to the running total, and when the queue reaches size m, pop off the list and subtract from the running total.

Here's some pseudo-code:

initialize the running total to 0

for each item in the list:
  add the current item to the running total
  push the current value onto the end of the dequeue
  if dequeue.length > m:
    pop off the front of the dequeue
    subtract the popped value from the running total
  assign the running total to the current sum slot in the list

reset the index to the beginning of the list

while the dequeue isn't empty:
  add the item in the list at the current index to the running total
  pop off the front of the dequeue
  subtract the popped value from the running total
  assign the running total to the current sum slot in the list
  increment the index

This is not recursive, it is iterative.

A run of this algorithm looks like this (for m = 3):

value   sum slot   overwritten sum slot
2       2          92
5       7          74
8       15         70
0       15         15
33      46
90      131
1       124
3       127
200     294
201     405
23      427
12      436
55      291

With indexing

You can remove the queue and the re-assignment of any slots by taking a sum of the last m values to start with, and using an offset of your index instead of popping off a dequeue, e.g. array[i - m].

This won't decrease your execution time as you still have to have two loops, one to build up the running tally, and another to populate all the values. But it will decrease your memory usage to stack-space only (effectively O(1)).

Here's some pseudo-code:

initialize the running total to 0

for the last m items in the list:
  add those items to the running total

for each item in the list:
  add the current item to the running total
  subtract the value of the item m slots earlier from the running total
  assign the running total to the current sum slot in the list

The m slots earlier is the tricky part. You can either split this up into two loops:

One that indexes from the end of the list, minus m, plus i
One that indexes from i minus m

Or you can use modulo arithmetic to "wrap" the value around when i - m < 0:

int valueToSutract = array[(i - m) % n];

OTHER TIPS

The http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm uses an updatable hash function, which it calls a http://en.wikipedia.org/wiki/Rolling_hash. This is going to be a lot easier to compute that MD5/SHA, and may not be inferior.

You can prove some things about it: it is a polynomial of degree d in a chosen constant a. Suppose that somebody provides two stretches of text and you choose a at random. What is the probability of a collision? Well, if the hash value is the same, subtracting them gives you a polynomial with a as a root. Since there are at most d roots of a non-zero polynomial, and a was chosen at random, the probability is at most modulus / d, which will be very small for large moduli.

Of course MD5/SHA is secure, but see http://cr.yp.to/mac/poly1305-20050329.pdf for a secure variant.

that's what I got so far. I am just missing the steps that should not take time like comparing the array of hashes and opening the files for reading.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace RecursiveHashing
{
    static class Utilities
    {

        // used for circular arrays. If my circular array is of size 5 and it's
        // current position is 2 if I shift 3 units to the left I shouls be in index
        // 4 of the array.
        public static int Shift(this int number, int shift, int divisor)
        {
            var tempa = (number + shift) % divisor;
            if (tempa < 0)
                tempa = divisor + tempa;
            return tempa;
        }

    }
    class Program
    {
        const int CHUNCK_SIZE = 4; // split the files in chuncks of 4 bytes

        /* 
         * formula that I will use to compute hash
         * 
         *      formula =  sum(chunck) * (a[c]+1)*(a[c-1]+1)*(a[c-2]+1)*(-1^a[c])
         *      
         *          where:
         *              sum(chunk)  = sum of current chunck
         *              a[c]        = current byte
         *              a[c-1]      = last byte
         *              a[c-2]      = last last byte
         *              -1^a[c]     = eather -1 or +1  
         *              
         *      this formula is efficient because I can get the sum of any current index by keeping trak of the overal sum
         *      thus this algorithm should be of order n
         */

        static void Main(string[] args)
        {
            Part1(); // Missing implementation to open file for reading
            Part2();
        }



        // fist part compute hashes on first file
        static void Part1()
        {
            // pertend file b reads those bytes
            byte[] FileB = new byte[]{2,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11,};

            // create an array where to store the chashes
            // index 0 will use a fast hash algorithm. index 1 will use a more secure hashing algorithm
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            // used to track on what index of the file we are at
            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array  needed to remember the last few bytes
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array  needed to remember the last sums
            int index = 0; // position where in circular array

            int numberOfHashes = 0; // number of hashes created so far


            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    if (counter % CHUNCK_SIZE == 0 || counter == FileB.Length)
                    {
                        // get the sum of the last chunk
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);
                        Int64 tempHash = (Int64)a;

                        // conpute my hash function
                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));


                        // add the hashes to the array
                        hashes[numberOfHashes, 0] = tempHash;
                        numberOfHashes++;

                        hashes[numberOfHashes, 1] = -1;// later store a stronger hash function
                        numberOfHashes++;

                        // MISSING IMPLEMENTATION TO STORE A SECOND STRONGER HASH FUNCTION

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1); // if index is out of bounds in circular array place it at position 0
                }
            }
        }


        static void Part2()
        {
            // simulate file read of a similar file
            byte[] FileB = new byte[]{1,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11};            

            // place where we will place all matching hashes
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array
            int index = 0; // position where in circular array



            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    // here we compute the hash every time and we are missing implementation to 
                    // check if hash is contained by the other file
                    if (counter >= CHUNCK_SIZE)
                    {
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);

                        Int64 tempHash = (Int64)a;

                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1);
                }
            }
        }
    }
}

same files represented in a table using the same algorithm

                        hashes
bytes       sum Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
2       2                   
3       5                   
5       10  5   3   2   -1  
8       18  8   5   3   1   3888
2       20  2   8   5   1   
0       20  0   2   8   1   
1       21  1   0   2   -1  
0       21  0   1   0   1   6
0       21  0   0   1   1   
0       21  0   0   0   1   
1       22  1   0   0   -1  
2       24  2   1   0   1   18
4       28  4   2   1   1   
5       33  5   4   2   -1  
6       39  6   5   4   1   
7       46  7   6   5   -1  -7392
8       54  8   7   6   1   
2       56  2   8   7   1   
3       59  3   2   8   -1  
4       63  4   3   2   1   1020
5       68  5   4   3   -1  
6       74  6   5   4   1   
7       81  7   6   5   -1  
8       89  8   7   6   1   13104
11      100 11  8   7   -1  -27648






file b                          
                            rolling hashes
bytes       0   Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
1       1                   
3       4                   
5       9   5   3   1   -1  
8       17  8   5   3   1   3672
2       19  2   8   5   1   2916
0       19  0   2   8   1   405
1       20  1   0   2   -1  -66
0       20  0   1   0   1   6
0       20  0   0   1   1   2
0       20  0   0   0   1   1
1       21  1   0   0   -1  -2
2       23  2   1   0   1   18
4       27  4   2   1   1   210
5       32  5   4   2   -1  -1080
6       38  6   5   4   1   3570
7       45  7   6   5   -1  -7392
8       53  8   7   6   1   13104
2       55  2   8   7   1   4968
3       58  3   2   8   -1  -2160
4       62  4   3   2   1   1020
5       67  5   4   3   -1  -1680
6       73  6   5   4   1   3780
7       80  7   6   5   -1  -7392
8       88  8   7   6   1   13104
11      99  11  8   7   -1  -27648

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow