Optimizing Long.bitCount

https://stackoverflow.com/questions/4839128

27-10-2019
|

Question

I have a program that is making a huge number of calls to Long.bitCount(), so many that it is taking 33% of cycles on one CPU core. Is there a way to implement it that is faster than the Sun JDK version?

I have tried:

This algorithm (I think this is exactly how the JDK implements it)
lookup tables of various sizes between 2⁸ and 2²² (looking at a few bits at a time and adding the results)

But I couldn't do any better than a 2¹⁶-entry lookup table with a manually-unrolled loop (about 27% CPU.)
How else might this be optimized for Java?

Note: this question is about Java-specific optimization, but this similar (language-agnostic) question has many other algorithms.

Solution

If you are on a recent x86 CPU there is an instruction for this, popcnt.

In recent versions of Java, Long.bitCount() uses this instruction. Just use -XX:+UsePopCountInstruction (this is the default in recent versions)

However, there are some bugs with it in JRE 6.0_u18 through 7.0_u5: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7063674

OTHER TIPS

This seems like one of those problems that is simply perfect for the GPU to work on. It should be able to slash your time by a couple orders of magnitude.

Otherwise I think you may have to deal with it at a higher level. Having multiple threads working on different segments of data at a time (which I'm sure you already do), processing the data while you are collecting it, sharing the work around multiple systems--something like that.

If you machine has an integer ALU that can process data wider than some multiples of 64 bits (also known as SIMD, such as SSE2 or VMX), you can compute the bit counts on several 64-bit elements at once.

Unfortunately, this will require you to provide machine-specific implementations in a lower-level language than Java.

I suspect that your app is memory-bound rather than CPU-bound, i.e. it spends more time fetching the values from memory than counting their bits. In that case you should try to reduce the size of the working set or improve access locality to reduce cache misses (if the algorithm allows it).

I'm no expert in the subject, but in case you haven't seen these pages, they may help:

http://www.reddit.com/r/programming/comments/84sht/fast_bit_couting_algorithms/

http://www-graphics.stanford.edu/~seander/bithacks.html

You may also want to poke around the many graphics libraries out there, especially those that are lower-level and/or speak directly to hardware.

EDIT: looks like you can use the relatively newly introduced POPCNT instruction (available on some recent AMD and Intel processors) for a potential speed increase, if you have the option to write low-level platform-specific code, and can target that specific architecture. http://kent-vandervelden.blogspot.com/2009/10/counting-bits-population-count-and.html and another article with benchmarks: http://www.strchr.com/crc32_popcnt

From my understanding:

I would use the 33% as an indicator only as profiling for small methods could really change the overall performance. So i would run the algorithm on some big dataset and see the total time. And I would consider the efficiancies of my optimization based on that total time changes. I would also include a warning up phase so that the JIT can do it's optimisations.

In fact the bit counting thing seem to be one of the key part of your algorithm anyway... if you optimize everything, and manage to get 10 time faster for all key part, you still profile something near 33% for this part. That's not bad by essence.

Inspiring from this link http://bmagic.sourceforge.net/bmsse2opt.html you could try to use SSE instruction present in all intel/AMD processor now if I remember right (you could alway failback to JAVA otherwise). An interresting part concerning the article is... That most of the time, it is memory bound anyway. But I would still try to see how this could work for you.

A GPU would be a perfect fit for insanely fast processing (easy hundred time one of a CPU core) and bandwidth. Main problem would be pushing data to CPU dedicated memory and getting result back. But if you don't just perform bit counting but more more operation, this could bring huge gains.

There is not shortcut anyway, you must try several approach and see what bring the most gain. Don't count % through but total time spent.

I am now using this method, which interleaves four popcnt operations at a time. It is based on this C implementation.

private static final long M0=0x5555555555555555L,
                          M1=0x3333333333333333L,
                          M2=0x0f0f0f0f0f0f0f0fL;
public void store4Tags(long tag0, long tag1, long tag2, long tag3) {
    long count0 = tag0,
         count1 = tag1,
         count2 = tag2,
         count3 = tag3;
    count0 = (count0 & M0) + ((count0 >>> 1) & M0);
    count1 = (count1 & M0) + ((count1 >>> 1) & M0);
    count2 = (count2 & M0) + ((count2 >>> 1) & M0);
    count3 = (count3 & M0) + ((count3 >>> 1) & M0);

    count0 = (count0 & M1) + ((count0 >>> 2) & M1);
    count1 = (count1 & M1) + ((count1 >>> 2) & M1);
    count2 = (count2 & M1) + ((count2 >>> 2) & M1);
    count3 = (count3 & M1) + ((count3 >>> 2) & M1);

    count0 = (count0 + (count0 >>> 4)) & M2;
    count1 = (count1 + (count1 >>> 4)) & M2;
    count2 = (count2 + (count2 >>> 4)) & M2;
    count3 = (count3 + (count3 >>> 4)) & M2;

    count0 += count0 >>> 8;
    count1 += count1 >>> 8;
    count2 += count2 >>> 8;
    count3 += count3 >>> 8;

    count0 += count0 >>> 16;
    count1 += count1 >>> 16;
    count2 += count2 >>> 16;
    count3 += count3 >>> 16;

    count0 += count0 >>> 32;
    count1 += count1 >>> 32;
    count2 += count2 >>> 32;
    count3 += count3 >>> 32;

    storeWithPopCnt(tag0, 0x3f & (int) count0);
    storeWithPopCnt(tag1, 0x3f & (int) count1);
    storeWithPopCnt(tag2, 0x3f & (int) count2);
    storeWithPopCnt(tag3, 0x3f & (int) count3);
}

This outperforms the lookup table version slightly, and consumes no cache.

Rather than optimise this function, you are likely to be better off optimising the usage of this function. E.g. you could keep a counter.

public void set(int n) {
   if(!get(n)) bitCount++;
   // set the bit
}
public void clear(int n) {
   if(get(n)) bitCount--;
   // clear the bit
}
public int bitCount() {
   return bitCount;
}

This avoids scanning the data by keeping track of the number of the count of bits set. This moves the overhead to how often bits and set or cleared and makes getting the number of bits set trivial. It appears in your use case, the later is much more often.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow