compare byte[] via pinvoke memcmp combined with multithreading

https://stackoverflow.com/questions/12240822

29-06-2021
|

Domanda

I wanted to try combining use of memcmp with multithreading

there's this code that benchmarked seems to be fastest I've had so far ..but I wanted to check if I could further accelerate the comparison process .

thoughts I had :

1) via multithreading it.

2) another idea,

is it possible in the case of byte[] to check if it's 80% to 100% same (if it's possible at all..) giving this option will it reduce the time of computation?

the first question is in higher priority...if i must choose..

    public static bool ByteArrayCompare(byte[] b1, byte[] b2)
    {
        return b1.Length == b2.Length && memcmp(b1, b2, b1.Length) == 0;
    }

Soluzione

You might get to speed it up by multithreading. Simply launch multiple N threads each thread will compare the 1/N of your array. HOWEVER, this only makes sense on the multicore machine. Also keep in mind that spawning new threads and collecting their results will put a fixed time penalty on your operation, which might in fact be greater than just doing it on a single thread. Another thing to consider, is that your code will be as slow as the slowest thread, so if you spawn to many threads you would have to wait until your last thread complete.

You might get extra clever and consider more devious scheme, when you can detect that one of the threads has found that the data is not the same. This thread then can signal to the master thread about it and you can safely terminate the rest of the running threads early.

The actual speedup depends on how large the data you are comparing ( I am assuming it is pretty large since you are considering this ). And how often that data is the same or all different. If your data is different often enough you will get considerable speedup.

As for your second idea. Yes you can come up with the scheme that will get your faster you can either randomly sample both arrays and once you find at least one byte that is different you can declare both datasets to be different and terminate early. However again you should be careful how you implement this. First you should sample compare in blocks not just individual bytes. Look into your HW architecture cache / prefetch behavior to determine the optimal block size. Also, if you know something about you data you should try to sample the areas that are more likely to be different first. You might get better results, by not doing the random sampling, but by sampling at predictable pattern. So that if you fail to find the difference using random sampling, you can just compare the rest of your data without re-comparing the data you already processed.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow