How to find identical byte[]-objects in two arrays concurrently?

https://stackoverflow.com/questions/8703128

12-04-2021
|

Question

I'm trying to implement an collision attack on hashes (I'm visiting the course 'cryptography'). Therefore I have two arrays of hashes (= byte-sequences byte[]) and want to find hashes which are present in both arrays. After some research and a lot of thinking I am sure that the best solution on a single-core machine would be a HashSet (add all elements of the first array and check via contains if elements of the second array are already present).

However, I want to implement a concurrent solution, since I have access to a machine with 8 cores and 12 GB RAM. The best solution I can think of is ConcurrentHashSet, which could be created via Collections.newSetFromMap(new ConcurrentHashMap<A,B>()). Using this data structure I could add all elements of the first array in parallel and - after all elements where added - I can concurrently check via contains for identical hashes.

So my question is: Do you know an algorithm designed for this exact problem? If not, do you have experience using such a ConcurrentHashSet concerning problems and effective runtime complexity? Or can you recommend another prebuilt data structure which could help me?

PS: If anyone is interested in the details: I plan to use Skandium to parallelize my program.

Solution

I think it would be a complete waste of time to use any form of HashMap. I am guessing you are calculating multi-byte hashes of various data, these are already hashes, there is no need to perform any more hashing on them.

Although you do not state it, I am guessing your hashes are byte sequences. Clearly either a trie or a dawg would be ideal to store these.

I would suggest therefore you implement a trie/dawg and use it to store all of the hashes in the first array. You could then use all of your computing power in parallel to lookup each element in your second array in this trie. No locks would be required.

Added

Here's a simple Dawg implementation I knocked together. It seems to work.

public class Dawg {
  // All my children.
  Dawg[] children = new Dawg[256];
  // Am I a leaf.
  boolean isLeaf = false;

  // Add a new word.
  public void add ( byte[] word ) {
    // Finds its location, growing as necessary.
    Dawg loc = find ( word, 0, true );
    loc.isLeaf = true;
  }

  // String form.
  public void add ( String word ) {
    add(word.getBytes());
  }

  // Returns true if word is in the dawg.
  public boolean contains ( byte [] word ) {
    // Finds its location, no growing allowed.
    Dawg d = find ( word, 0, false );
    return d != null && d.isLeaf; 
  }

  // String form.
  public boolean contains ( String word ) {
    return contains(word.getBytes());
  }

  // Find the Dawg - growing the tree as necessary if requested.
  private Dawg find ( byte [] word, int i, boolean grow ) {
    Dawg child = children[word[i]];
    if ( child == null ) {
      // Not present!
      if ( grow ) {
        // Grow the tree.
        child = new Dawg();
        children[word[i]] = child;
      }
    }
    // Found it?
    if ( child != null ) {
      // More to find?
      if ( i < word.length - 1 ) {
        child = child.find(word, i+1, grow);
      }
    }
    return child;
  }

  public static void main ( String[] args ) {
    Dawg d = new Dawg();
    d.add("H");
    d.add("Hello");
    d.add("World");
    d.add("Hell");
    System.out.println("Hello is "+(d.contains("Hello")?"in":"out"));
    System.out.println("World is "+(d.contains("World")?"in":"out"));
    System.out.println("Hell is "+(d.contains("Hell")?"in":"out"));
    System.out.println("Hal is "+(d.contains("Hal")?"in":"out"));
    System.out.println("Hel is "+(d.contains("Hel")?"in":"out"));
    System.out.println("H is "+(d.contains("H")?"in":"out"));
  }
}

Added

This could be a good start at a concurrent lock-free version. These things are notoriously difficult to test so I cannot guarantee this will work but to my mind it certainly should.

import java.util.concurrent.atomic.AtomicReferenceArray;


public class LFDawg {
  // All my children.
  AtomicReferenceArray<LFDawg> children = new AtomicReferenceArray<LFDawg> ( 256 );
  // Am I a leaf.
  boolean isLeaf = false;

  // Add a new word.
  public void add ( byte[] word ) {
    // Finds its location, growing as necessary.
    LFDawg loc = find( word, 0, true );
    loc.isLeaf = true;
  }

  // String form.
  public void add ( String word ) {
    add( word.getBytes() );
  }

  // Returns true if word is in the dawg.
  public boolean contains ( byte[] word ) {
    // Finds its location, no growing allowed.
    LFDawg d = find( word, 0, false );
    return d != null && d.isLeaf;
  }

  // String form.
  public boolean contains ( String word ) {
    return contains( word.getBytes() );
  }

  // Find the Dawg - growing the tree as necessary if requested.
  private LFDawg find ( byte[] word, int i, boolean grow ) {
    LFDawg child = children.get( word[i] );
    if ( child == null ) {
      // Not present!
      if ( grow ) {
        // Grow the tree.
        child = new LFDawg();
        if ( !children.compareAndSet( word[i], null, child ) ) {
          // Someone else got there before me. Get the one they set.
          child = children.get( word[i] );
        }
      }
    }
    // Found it?
    if ( child != null ) {
      // More to find?
      if ( i < word.length - 1 ) {
        child = child.find( word, i + 1, grow );
      }
    }
    return child;
  }

  public static void main ( String[] args ) {
    LFDawg d = new LFDawg();
    d.add( "H" );
    d.add( "Hello" );
    d.add( "World" );
    d.add( "Hell" );
    System.out.println( "Hello is " + ( d.contains( "Hello" ) ? "in" : "out" ) );
    System.out.println( "World is " + ( d.contains( "World" ) ? "in" : "out" ) );
    System.out.println( "Hell is " + ( d.contains( "Hell" ) ? "in" : "out" ) );
    System.out.println( "Hal is " + ( d.contains( "Hal" ) ? "in" : "out" ) );
    System.out.println( "Hel is " + ( d.contains( "Hel" ) ? "in" : "out" ) );
    System.out.println( "H is " + ( d.contains( "H" ) ? "in" : "out" ) );
  }
}

OTHER TIPS

A simpler approach would be to just split the first array into N equal (or near-equal) parts (with 8 cores, n=8 seems reasonable). Then solve the program in the "normal" way, by looking if any hashes in the 2nd array are present in the N smaller sub-first-arrays. This can be done in parallel.

That said, I've never heard of tries/dawgs before and I found the main discussion fascinating and informative. (I mainly work with numbers, not words)

This assumes that the byte[] hashes are of some finite, shortish length so that you really can split up the original file to process in parallel. Is that the case?

EDIT ADDED

For an example of this idea, see GPU Graphics Gems, edited by Wen-Mei W. Hwu, chapter 11, an article by Ligowski, Rudnicki, Liu and Schmidt. They parallelize a massive protein sequence database search by splitting the huge single database into many smaller pieces, then run the normal algorithm on each subpiece. I like this quote. "The described algorithm is embarrassingly parallel". In their case they used CUDA and had to do a lot of memory optimization, but the principle should still apply to multi-core machines.

semi-PSEUDOCODE FOLLOWS. I'll use Lists for the incoming byte[] hashes, hope that's o.k.

Original, 1 core method

originalProcess(List<byte[]> list1, List<byte[]> list2) {
   HashSet<byte[]> bigHugeHashOfList1 = new HashSet<byte[]>();
   bigHugeHashOfList1.addAll(list1);
   for (byte[] hash : list2)
      if (bigHugeHashOfList1.contains(hash)
         // do something
}

New method. Uses exact same process method (later on). No DAWGS or TRIES here...

preprocess(List<byte[]> list1, List<byte[]> list2) {
   List<byte[]>[] splitLists = new ArrayList<byte[]>[8];
   for (int i=0; i<8; i++)
      splitLists[i] = new ArrayList<byte[]>();
   for (byte[] hash : list1) {
      int idx = hash[0]&7; // I'm taking the 3 low order bits, YMMV
      splitLists[idx].add(hash);
      // a minor speedup would be to create the HashSet here instead of in originalProcess()
   }

   // now, using your favorite parallel/concurrency technique,
   // do the equivalent of
   for (int i=0; i<8; i++)
      originalProcess(splitLists[i], list2);
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow