如何计算的数量设定的位于一个32位整数？

https://stackoverflow.com/questions/109023

01-07-2019
|

题

8位代表7号是这样的：

00000111

三位设置的。

什么样的算法来确定数量设定的位于一个32位整数？

解决方案

这被称为'汉明重量'，'popcount'或'横向添加'

“最佳”算法实际上取决于您所使用的CPU以及您的使用模式。

有些CPU只有一条内置指令可以执行此操作，而其他CPU则具有作用于位向量的并行指令。并行指令（如x86的popcnt，在支持它的CPU上）几乎肯定会最快。其他一些架构可能会使用微编码循环实现慢速指令，每个循环测试一次（引用需要）。

如果您的CPU具有大缓存和/或您在紧密循环中执行大量这些指令，则预先填充的表查找方法可以非常快。然而，由于“缓存未命中”的代价，它可能会受到影响，其中CPU必须从主存储器中获取一些表。

如果你知道你的字节大部分是0或大部分是1，那么这些场景的算法非常有效。

我相信一个非常好的通用算法如下，称为“并行”或“可变精度SWAR算法”。我用C语言伪语言表达了这一点，您可能需要调整它以适用于特定语言（例如，使用uint32_t表示C ++和<！> gt; <！> gt; <！> gt;在Java中）：

int numberOfSetBits(int i)
{
     // Java: use >>> instead of >>
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
     return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

这具有所讨论的任何算法的最佳最坏情况行为，因此将有效地处理您投入的任何使用模式或值。

这种按位SWAR算法可以在多个向量元素中同时进行并行化，而不是在单个整数寄存器中进行，以便在具有SIMD但没有可用的popcount指令的CPU上加速。（例如x86-64代码必须在任何CPU上运行，而不仅仅是Nehalem或更高版本。）

然而，使用popcount的向量指令的最佳方法通常是使用变量shuffle在每个字节的并行时对4位进行表查找。（4位索引保存在向量寄存器中的16个条目表。）

在Intel CPU上，硬件64位popcnt指令可以胜过 SSSE3 PSHUFB位 - 并行实现大约2倍，但只有如果您的编译器恰到好处。否则SSE可能会显着提前。较新的编译器版本知道 popcnt false依赖关系英特尔问题。

参考文献：

https://graphics.stanford.edu/~seander/bithacks.html

https://en.wikipedia.org/wiki/Hamming_weight

http://gurmeet.net/puzzles/fast-bit-counting-routines /

http://aggregate.ee。 engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count）

其他提示

还考虑的内在职能的编译器。

在GNU编译器，例如可以仅仅是使用：

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

在最糟糕的情况下编译器将产生一个呼叫到一个功能。在最好的情况下编译器将这些cpu指令做同样的工作速度更快。

海湾合作委员会内部函数甚至工作跨越多个平台。Popcount会成为主流x86结构，因此是有意义的开始使用的特性。其他体系结构有popcount多年。

在x86，你可以告诉编译器，它可以假设的支持 popcnt 与指令 -mpopcnt 或 -msse4.2 还启用矢量的指示中添加了相同的产生。看看海湾合作委员会x86的选择. -march=nehalem (或 -march= 无论CPU你想要你的代码的假设，并为你对)可能是一个好的选择。运行产生的二进制上的老年央处理器会导致非法指令的错误。

做二进制文件的优化了的机器建立他们，使用 -march=native (与海湾合作委员会，铛，或ICC)。

MSVC提供了一个固有的x86 popcnt 指令, 但不同于海湾合作委员会，它是一个真正的固有为硬件的指令和需要的硬件支持。

使用 std::bitset<>::count() 而不是内在的

在理论上，任何编译器，知道如何popcount有效地针对目标的CPU应该让这一功能，通过ISO C++ std::bitset<>.在实践中，你可能会更好的位的黑客和/移/ADD在某些情况下对于某些目标的Cpu。

对于目标的体系结构硬件popcount是一个可选的扩展(如x86)，并不是所有的有一个编译器 std::bitset 这需要利用它的时候提供。例如，MSVC已无法启用 popcnt 支持在编制时间，并且始终使用一个表中查找, 甚至有 /Ox /arch:AVX (这意味着SSE4.2，虽然在技术上有一个独立的特征位 popcnt.)

但至少你得到的东西的便携式工作无处不在，并且与海湾合作委员会/铛与权目标的选项，你的硬件popcount于体系结构支持它。

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
{
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();
}

看看 asm从海湾合作委员会，铛，国际刑事法院和MSVC 在Godbolt编译器。

x86-64 gcc -O3 -std=gnu++11 -mpopcnt 发出这样的：

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax
    ret
unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax
    ret
unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi
    ret

PowerPC64 gcc -O3 -std=gnu++11 发出(对的 int arg版):

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount
    blr

这种来源不x86的具体或GNU特定在所有，而是仅仅编纂以及x86与海湾合作委员会/铛/际刑事法院。

还注意到，海湾合作委员会的备用的架构，而不单指令popcount是一个字的-在一个时间表查找。这不是美好的手臂，例如.

在我看来，<！>“最好的<！>”;解决方案是另一个程序员（或两年后的原始程序员）可以阅读的解决方案，没有大量的评论。你可能想要一些已经提供的最快或最聪明的解决方案，但我更喜欢可读性而不是聪明。

unsigned int bitCount (unsigned int value) {
    unsigned int count = 0;
    while (value > 0) {           // until all bits are zero
        if ((value & 1) == 1)     // check lower bit
            count++;
        value >>= 1;              // shift bits, removing lower bit
    }
    return count;
}

如果你想要更快的速度（假设你记录好以帮助你的继任者），你可以使用表格查找：

// Lookup table for fast calculation of bits set in 8-bit unsigned char.

static unsigned char oneBitsInUChar[] = {
//  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F (<- n)
//  =====================================================
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n
    : : :
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn
};

// Function for fast calculation of bits set in 16-bit unsigned short.

unsigned char oneBitsInUShort (unsigned short x) {
    return oneBitsInUChar [x >>    8]
         + oneBitsInUChar [x &  0xff];
}

// Function for fast calculation of bits set in 32-bit unsigned int.

unsigned char oneBitsInUInt (unsigned int x) {
    return oneBitsInUShort (x >>     16)
         + oneBitsInUShort (x &  0xffff);
}

虽然这些依赖于特定的数据类型大小，因此它们不具备可移植性。但是，由于许多性能优化无论如何都不可移植，这可能不是问题。如果你想要可移植性，我会坚持使用可读的解决方案。

黑客的喜悦令人愉快！强烈推荐。

我认为最快的方式<！>＃8212;不使用查找表和 popcount <！>＃8212;如下。只需12次操作即可对设定位进行计数。

int popcount(int v) {
    v = v - ((v >> 1) & 0x55555555);                // put count of each 2 bits into those 2 bits
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits  
    return c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

它的工作原理是因为您可以通过分成两半来计算设置位的总数，计算两半中的设置位数，然后将它们相加。也称为Divide and Conquer范式。让我们详细说明..

v = v - ((v >> 1) & 0x55555555);

两位中的位数可以是0b00，0b01或0b10。让我们试着用2位来解决这个问题..

 ---------------------------------------------
 |   v    |   (v >> 1) & 0b0101   |  v - x   |
 ---------------------------------------------
   0b00           0b00               0b00   
   0b01           0b00               0b01     
   0b10           0b01               0b01
   0b11           0b01               0b10

这是所需要的：最后一列显示每两位对中的设置位数。如果两位数>= 2 (0b10)则and生成0b01000010，否则生成0b01100010。

v = (v & 0x33333333) + ((v >> 2) & 0x33333333);

这个陈述应该很容易理解。在第一次操作之后，我们每两位有一个设置位的计数，现在我们总计每4位的计数。

v & 0b00110011         //masks out even two bits
(v >> 2) & 0b00110011  // masks out odd two bits

然后我们总结上面的结果，给出4位的设置位总数。最后一句话是最棘手的。

c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;

让我们进一步分解......

v + (v >> 4)

它类似于第二个陈述;我们正在计算4个组中的设置位。我们知道<！>＃8212;因为我们之前的操作<！>＃8212;每个半字节都有其中的设置位数。让我们看一个例子。假设我们有字节0b10101010。这意味着第一个半字节设置了4比特，第二个设置了2比特。现在我们将这些小块一起添加。

0b01000010 + 0b01000000

它给出了第一个半字节A B C D中字节中设置位的计数，因此我们屏蔽了数字中所有字节的最后四个字节（丢弃它们）。

0b01100010 & 0xF0 = 0b01100000

现在每个字节都有其中的设置位数。我们需要将它们加在一起。诀窍是将结果乘以A+B+C+D B+C+D C+D D，它具有一个有趣的属性。如果我们的数字有四个字节，0b00100000，它将产生一个带有这些字节>> 24的新数字。一个4字节的数字最多可以设置32位，可以表示为32 bit。

我们现在需要的是第一个字节，其中包含所有字节中所有设置位的总和，我们通过64 bit得到它。此算法是为<=>单词设计的，但可以轻松修改为<=>单词。

如果您碰巧使用Java，内置方法Integer.bitCount将会这样做。

我感到无聊，并计划了三十次迭代的三种方法。编译器是gcc -O3。 CPU就是他们放在第一代Macbook Pro中的任何东西。

以下最快，为3.7秒：

static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
    return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}

第二个位置使用相同的代码，但查找4个字节而不是2个半字。这需要大约5.5秒。

排在第三位的是“斜向加法”，耗时8.6秒。

第四名是GCC的__builtin_popcount（），这是一个可耻的11秒。

一次一位计数的方法慢得多，我厌倦了等待它完成。

因此，如果您关注的是性能高于其他所有，那么请使用第一种方法。如果您在意，但还不足以花费64Kb的RAM，请使用第二种方法。否则，使用可读（但缓慢）的一次一位方法。

很难想象你想要使用苦涩的方法。

编辑：类似的结果此处。

unsigned int count_bit(unsigned int x)
{
  x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
  x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
  x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
  x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
  x = (x & 0x0000FFFF) + ((x >> 16)& 0x0000FFFF);
  return x;
}

让我解释一下这个算法。

该算法基于Divide and Conquer算法。假设有一个8位整数213（二进制11010101），算法就像这样（每次合并两个相邻块）：

+-------------------------------+
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |  <- x
|  1 0  |  0 1  |  0 1  |  0 1  |  <- first time merge
|    0 0 1 1    |    0 0 1 0    |  <- second time merge
|        0 0 0 0 0 1 0 1        |  <- third time ( answer = 00000101 = 5)
+-------------------------------+

这是有助于了解您的微架构的问题之一。我只是使用C ++内联在gcc 4.3.3下使用-O3编译了两个变体来消除函数调用开销，十亿次迭代，保持所有计数的运行总和以确保编译器不会删除任何重要的东西，使用rdtsc进行计时（时钟周期精确）。

inline int pop2(unsigned x, unsigned y)
{
    x = x - ((x >> 1) & 0x55555555);
    y = y - ((y >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    y = (y & 0x33333333) + ((y >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    y = y + (y >> 8);
    x = x + (x >> 16);
    y = y + (y >> 16);
    return (x+y) & 0x000000FF;
}

未修改的Hacker's Delight花了12.2万亿美元。我的并行版本（计数两倍的位数）运行在13.0 gigacycles中。在2.4GHz Core Duo上共同使用了10.5秒。 25个gigacycles =在这个时钟频率下超过10秒，所以我有信心我的时间是正确的。

这与指令依赖链有关，这对于该算法非常不利。通过使用一对64位寄存器，我几乎可以将速度提高一倍。事实上，如果我很聪明并且稍微增加x + y，我可以稍微减少一些变化。带有一些小调整的64位版本会出现关于偶数，但再次计算两倍的位数。

使用128位SIMD寄存器，另一个因子是2，SSE指令集通常也有聪明的快捷方式。

代码没有理由特别透明。界面简单，算法可以在很多地方在线参考，并且可以进行全面的单元测试。偶然发现它的程序员甚至可能会学到一些东西。这些位操作在机器级别非常自然。

好的，我决定对经过调整的64位版本进行测试。对于这个sizeof（无符号长）== 8

inline int pop2(unsigned long x, unsigned long y)
{
    x = x - ((x >> 1) & 0x5555555555555555);
    y = y - ((y >> 1) & 0x5555555555555555);
    x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
    y = (y & 0x3333333333333333) + ((y >> 2) & 0x3333333333333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F0F0F0F0F;
    x = x + y; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x + (x >> 32); 
    return x & 0xFF;
}

看起来是正确的（虽然我没有仔细测试）。现在时间是10.70千兆/ 14.1千兆。后来的数字总计1280亿比特，相当于这台机器上已经过了5.9秒。非并行版本加速了一点点因为我在64位模式下运行它喜欢64位寄存器比32位寄存器略好。

让我们看看这里是否有更多的流水线操作。这涉及更多，所以我实际测试了一下。每个术语单独总计为64，所有总和为256.

inline int pop4(unsigned long x, unsigned long y, 
                unsigned long u, unsigned long v)
{
  enum { m1 = 0x5555555555555555, 
         m2 = 0x3333333333333333, 
         m3 = 0x0F0F0F0F0F0F0F0F, 
         m4 = 0x000000FF000000FF };

    x = x - ((x >> 1) & m1);
    y = y - ((y >> 1) & m1);
    u = u - ((u >> 1) & m1);
    v = v - ((v >> 1) & m1);
    x = (x & m2) + ((x >> 2) & m2);
    y = (y & m2) + ((y >> 2) & m2);
    u = (u & m2) + ((u >> 2) & m2);
    v = (v & m2) + ((v >> 2) & m2);
    x = x + y; 
    u = u + v; 
    x = (x & m3) + ((x >> 4) & m3);
    u = (u & m3) + ((u >> 4) & m3);
    x = x + u; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x & m4; 
    x = x + (x >> 32);
    return x & 0x000001FF;
}

我很兴奋，但事实证明gcc正在使用-O3播放内联技巧，即使我在某些测试中没有使用inline关键字。当我让gcc玩弄技巧时，十亿次调用pop4（）需要12.56个gigatcles，但我确定它是将参数折叠为常量表达式。一个更现实的数字似乎是19.6gc，另外30％的加速。我的测试循环现在看起来像这样，确保每个参数足够不同以阻止gcc玩弄技巧。

   hitime b4 = rdtsc(); 
   for (unsigned long i = 10L * 1000*1000*1000; i < 11L * 1000*1000*1000; ++i) 
      sum += pop4 (i,  i^1, ~i, i|1); 
   hitime e4 = rdtsc();

在8.17s中总计了2560亿比特。在16位表查找中作为基准测试，以3200万位的速度运行到1.02s。无法直接比较，因为另一个工作台没有给出时钟速度，但看起来我已经从64KB表版本中打出了鼻涕，这首先是一个悲剧性的使用L1缓存。

更新：决定做明显的事情并通过添加四个重复的行来创建pop6（）。得出22.8gc，经过9.5s总计3840亿比特。所以还有另外20％现在800毫秒，320亿比特。

为什么不迭代地除以2？

count = 0
while n > 0
  if (n % 2) == 1
    count += 1
  n /= 2

我同意这不是最快的，但是<！>“最好的<！>”;有点暧昧。我会争辩说，<！>“最好的<！>”;应该有一个清晰的元素

当你写出位模式时，Hacker's Delight bit-twiddling变得更加清晰。

unsigned int bitCount(unsigned int x)
{
  x = ((x >> 1) & 0b01010101010101010101010101010101)
     + (x       & 0b01010101010101010101010101010101);
  x = ((x >> 2) & 0b00110011001100110011001100110011)
     + (x       & 0b00110011001100110011001100110011); 
  x = ((x >> 4) & 0b00001111000011110000111100001111)
     + (x       & 0b00001111000011110000111100001111); 
  x = ((x >> 8) & 0b00000000111111110000000011111111)
     + (x       & 0b00000000111111110000000011111111); 
  x = ((x >> 16)& 0b00000000000000001111111111111111)
     + (x       & 0b00000000000000001111111111111111); 
  return x;
}

第一步将偶数位加到奇数位，产生每两位的位数。其他步骤将高阶块添加到低阶块，将块大小加倍，直到我们将最终计数占用整个int。

对于2 ³²查找表之间的愉快介质并逐个迭代每个位：

int bitcount(unsigned int num){
    int count = 0;
    static int nibblebits[] =
        {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
    for(; num != 0; num >>= 4)
        count += nibblebits[num & 0x0f];
    return count;
}

来自 http://ctips.pbwiki.com/CountBits

这不是最快或最好的解决方案，但我找到了同样的问题，我开始思考和思考。最后我意识到，如果从数学方面得到问题并绘制图形，然后你会发现它是一个具有某些周期性部分的函数，然后你意识到周期之间的差异......就这样可以这样做。你走了：

unsigned int f(unsigned int x)
{
    switch (x) {
        case 0:
            return 0;
        case 1:
            return 1;
        case 2:
            return 1;
        case 3:
            return 2;
        default:
            return f(x/4) + f(x%4);
    }
}

这可以在O(k)中完成，其中k是设置的位数。

int NumberOfSetBits(int n)
{
    int count = 0;

    while (n){
        ++ count;
        n = (n - 1) & n;
    }

    return count;
}

您正在寻找的功能通常称为<！>“; sideways sum <！>”;或<！> quot;人口数<！>一个二进制数。 Knuth在前分册1A，第11-12页中对此进行了讨论（尽管在第2卷，4.6.3-（7）中有一个简短的参考文献。）

locus classicus 是Peter Wegner的文章<！>“一种计算二进制计算机中的一种技术<！>”，来自 ACM的通讯 ，第3卷（1960年）第5期，第322页。他在那里给出了两种不同的算法，一种针对期望为“！稀疏<！”的数字进行了优化。（即，有少量的）和一个相反的情况。

几个未解决的问题： -

如果数字是负数那么？
如果数字是1024，那么<！>“迭代地除以2 <！>”;方法将迭代10次。

我们可以修改算法以支持负数，如下所示： -

count = 0
while n != 0
if ((n % 2) == 1 || (n % 2) == -1
    count += 1
  n /= 2  
return count

现在要克服第二个问题，我们可以写下算法： -

int bit_count(int num)
{
    int count=0;
    while(num)
    {
        num=(num)&(num-1);
        count++;
    }
    return count;
}

有关完整参考，请参阅：

http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html

  private int get_bits_set(int v)
    {
      int c; // c accumulates the total bits set in v
        for (c = 0; v>0; c++)
        {
            v &= v - 1; // clear the least significant bit set
        }
        return c;
    }

我认为 Brian Kernighan的方法也很有用...... 它经历了与设置位一样多的迭代。因此，如果我们有一个只有高位设置的32位字，那么它只会循环一次。

int countSetBits(unsigned int n) { 
    unsigned int n; // count the number of bits set in n
    unsigned int c; // c accumulates the total bits set in n
    for (c=0;n>0;n=n&(n-1)) c++; 
    return c; 
}

1988年出版，C编程语言第2版。（Brian W. Kernighan和Dennis M. Ritchie）在练习2-9中提到了这一点。 2006年4月19日，Don Knuth向我指出，这种方法首先由Peter Wegner在CACM 3（1960），322中发表。（也由Derrick Lehmer独立发现并于1964年出版的一本书中由Beckenbach）QUOT。<！>;

我使用下面更直观的代码。

int countSetBits(int n) {
    return !n ? 0 : 1 + countSetBits(n & (n-1));
}

逻辑：n <！> amp; （n-1）重置n的最后一位。

P.S：我知道这不是O（1）解决方案，尽管这是一个有趣的解决方案。

你的意思是什么<！>“最佳算法<！>”？短代码或禁食代码？您的代码看起来非常优雅，并且具有恒定的执行时间。代码也很短。

但如果速度是主要因素而不是代码大小，那么我认为跟随可以更快：

       static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... };
        static int bitCountOfByte( int value ){
            return BIT_COUNT[ value & 0xFF ];
        }

        static int bitCountOfInt( int value ){
            return bitCountOfByte( value ) 
                 + bitCountOfByte( value >> 8 ) 
                 + bitCountOfByte( value >> 16 ) 
                 + bitCountOfByte( value >> 24 );
        }

我认为对于64位值来说这不会更快，但32位值可能会更快。

我写了一个快速bitcount宏RISC机在关于1990年。它不使用先进的运算(乘法、司%)，内存取(太缓慢)，分支机构(太缓慢)，但它并承担CPU有一个32位桶器(换句话说，>>1>>32采取的同样数额的周期。) 它假设，小型常数(例如6、12日、24)费用没有载入登记册，或者被存储在临时和重新使用多次。

与这些假设，它计数的32位中约16周期/上的指令大多数RISC机。注意说明15/周期接近一个下限数目的周期或指示，因为它似乎至少需要3的说明(掩模、转变、操作员)削减的数个加数中的一半，所以log_2(32个)=5，5×3=15说明是准lowerbound.

#define BitCount(X,Y)           \
                Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
                Y = ((Y + (Y >> 3)) & 030707070707); \
                Y =  (Y + (Y >> 6)); \
                Y = (Y + (Y >> 12) + (Y >> 24)) & 077;

这里是一个秘密的第一和最复杂的步骤：

input output
AB    CD             Note
00    00             = AB
01    01             = AB
10    01             = AB - (A >> 1) & 0x1
11    10             = AB - (A >> 1) & 0x1

所以如果我把第1列(A)所述，它转移权的1位，并减去它从AB，我得到输出(CD)。该扩展到3位类似；你可以检查它与8行布尔表像我这样上述的，如果你的愿望。

不吉利斯

如果你正在使用C ++，另一种选择是使用模板元编程：

// recursive template to sum bits in an int
template <int BITS>
int countBits(int val) {
        // return the least significant bit plus the result of calling ourselves with
        // .. the shifted value
        return (val & 0x1) + countBits<BITS-1>(val >> 1);
}

// template specialisation to terminate the recursion when there's only one bit left
template<>
int countBits<1>(int val) {
        return val & 0x1;
}

用法是：

// to count bits in a byte/char (this returns 8)
countBits<8>( 255 )

// another byte (this returns 7)
countBits<8>( 254 )

// counting bits in a word/short (this returns 1)
countBits<16>( 256 )

你当然可以进一步扩展这个模板以使用不同的类型（甚至自动检测位大小），但为了清晰起见，我保持简单。

编辑：忘了提到这很好，因为它应该在任何C ++编译器中工作，如果一个常量值用于位数，它基本上只为你展开循环（换句话说，我很确定这是你能找到的最快的通用方法）

我特别喜欢财富档案中的这个例子：

#define BITCOUNT(x)    (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x)         ((x) - (((x)>>1)&0x77777777)
                             - (((x)>>2)&0x33333333)
                             - (((x)>>3)&0x11111111))

我最喜欢它，因为它太漂亮了！

Java JDK1.5

Integer.bitCount（N）;

其中n是要计算1的数字。

同时检查，

Integer.highestOneBit(n);
Integer.lowestOneBit(n);
Integer.numberOfLeadingZeros(n);
Integer.numberOfTrailingZeros(n);

//Beginning with the value 1, rotate left 16 times
     n = 1;
         for (int i = 0; i < 16; i++) {
            n = Integer.rotateLeft(n, 1);
            System.out.println(n);
         }

我在使用SIMD指令（SSSE3和AVX2）的阵列中找到了位计数的实现。它的性能比使用__popcnt64内部函数的性能高2-2.5倍。

SSSE3版本：

#include <smmintrin.h>
#include <stdint.h>

const __m128i Z = _mm_set1_epi8(0x0);
const __m128i F = _mm_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m128i T = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m128i _sum =  _mm128_setzero_si128();
    for (size_t i = 0; i < size; i += 16)
    {
        //load 16-byte vector
        __m128i _src = _mm_loadu_si128((__m128i*)(src + i));
        //get low 4 bit for every byte in vector
        __m128i lo = _mm_and_si128(_src, F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m128i hi = _mm_and_si128(_mm_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, hi)));
    }
    uint64_t sum[2];
    _mm_storeu_si128((__m128i*)sum, _sum);
    return sum[0] + sum[1];
}

AVX2版本：

#include <immintrin.h>
#include <stdint.h>

const __m256i Z = _mm256_set1_epi8(0x0);
const __m256i F = _mm256_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m256i T = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 
                                   0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m256i _sum =  _mm256_setzero_si256();
    for (size_t i = 0; i < size; i += 32)
    {
        //load 32-byte vector
        __m256i _src = _mm256_loadu_si256((__m256i*)(src + i));
        //get low 4 bit for every byte in vector
        __m256i lo = _mm256_and_si256(_src, F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, hi)));
    }
    uint64_t sum[4];
    _mm256_storeu_si256((__m256i*)sum, _sum);
    return sum[0] + sum[1] + sum[2] + sum[3];
}

我总是在竞争性编程中使用它，它易于编写和高效：

#include <bits/stdc++.h>

using namespace std;

int countOnes(int n) {
    bitset<32> b(n);
    return b.count();
}

有许多算法来计算设定位;但我认为最好的一个是更快的！您可以在此页面上看到详细信息：

咬一口黑客

我建议这个：

使用64位指令计数以14位，24位或32位字设置的位

unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 14-bit values in v:
c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;

// option 2, for at most 24-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) 
     % 0x1f;

// option 3, for at most 32-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 
     0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

此方法需要具有快速模数除法的64位CPU才能高效。第一个选项只需3个操作;第二种选择需要10;第三个选项需要15个。

快速C＃解决方案使用预先计算的字节位计数表，并在输入大小上进行分支。

public static class BitCount
{
    public static uint GetSetBitsCount(uint n)
    {
        var counts = BYTE_BIT_COUNTS;
        return n <= 0xff ? counts[n]
             : n <= 0xffff ? counts[n & 0xff] + counts[n >> 8]
             : n <= 0xffffff ? counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff]
             : counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] + counts[(n >> 24) & 0xff];
    }

    public static readonly uint[] BYTE_BIT_COUNTS = 
    {
        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
    };
}

这是一个便携式模块（ANSI-C），可以对任何架构上的每个算法进行基准测试。

你的CPU有9位字节？没问题:-)目前它实现了2个算法，K <！> amp; R算法和一个字节明智的查找表。查找表平均比K <！> amp; R算法快3倍。如果有人可以找到一种方法来制作<！>“黑客的喜悦<！>”;算法便携式随意添加。

#ifndef _BITCOUNT_H_
#define _BITCOUNT_H_

/* Return the Hamming Wieght of val, i.e. the number of 'on' bits. */
int bitcount( unsigned int );

/* List of available bitcount algorithms.  
 * onTheFly:    Calculate the bitcount on demand.
 *
 * lookupTalbe: Uses a small lookup table to determine the bitcount.  This
 * method is on average 3 times as fast as onTheFly, but incurs a small
 * upfront cost to initialize the lookup table on the first call.
 *
 * strategyCount is just a placeholder. 
 */
enum strategy { onTheFly, lookupTable, strategyCount };

/* String represenations of the algorithm names */
extern const char *strategyNames[];

/* Choose which bitcount algorithm to use. */
void setStrategy( enum strategy );

#endif

#include <limits.h>

#include "bitcount.h"

/* The number of entries needed in the table is equal to the number of unique
 * values a char can represent which is always UCHAR_MAX + 1*/
static unsigned char _bitCountTable[UCHAR_MAX + 1];
static unsigned int _lookupTableInitialized = 0;

static int _defaultBitCount( unsigned int val ) {
    int count;

    /* Starting with:
     * 1100 - 1 == 1011,  1100 & 1011 == 1000
     * 1000 - 1 == 0111,  1000 & 0111 == 0000
     */
    for ( count = 0; val; ++count )
        val &= val - 1;

    return count;
}

/* Looks up each byte of the integer in a lookup table.
 *
 * The first time the function is called it initializes the lookup table.
 */
static int _tableBitCount( unsigned int val ) {
    int bCount = 0;

    if ( !_lookupTableInitialized ) {
        unsigned int i;
        for ( i = 0; i != UCHAR_MAX + 1; ++i )
            _bitCountTable[i] =
                ( unsigned char )_defaultBitCount( i );

        _lookupTableInitialized = 1;
    }

    for ( ; val; val >>= CHAR_BIT )
        bCount += _bitCountTable[val & UCHAR_MAX];

    return bCount;
}

static int ( *_bitcount ) ( unsigned int ) = _defaultBitCount;

const char *strategyNames[] = { "onTheFly", "lookupTable" };

void setStrategy( enum strategy s ) {
    switch ( s ) {
    case onTheFly:
        _bitcount = _defaultBitCount;
        break;
    case lookupTable:
        _bitcount = _tableBitCount;
        break;
    case strategyCount:
        break;
    }
}

/* Just a forwarding function which will call whichever version of the
 * algorithm has been selected by the client 
 */
int bitcount( unsigned int val ) {
    return _bitcount( val );
}

#ifdef _BITCOUNT_EXE_

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* Use the same sequence of pseudo random numbers to benmark each Hamming
 * Weight algorithm.
 */
void benchmark( int reps ) {
    clock_t start, stop;
    int i, j;
    static const int iterations = 1000000;

    for ( j = 0; j != strategyCount; ++j ) {
        setStrategy( j );

        srand( 257 );

        start = clock(  );

        for ( i = 0; i != reps * iterations; ++i )
            bitcount( rand(  ) );

        stop = clock(  );

        printf
            ( "\n\t%d psudoe-random integers using %s: %f seconds\n\n",
              reps * iterations, strategyNames[j],
              ( double )( stop - start ) / CLOCKS_PER_SEC );
    }
}

int main( void ) {
    int option;

    while ( 1 ) {
        printf( "Menu Options\n"
            "\t1.\tPrint the Hamming Weight of an Integer\n"
            "\t2.\tBenchmark Hamming Weight implementations\n"
            "\t3.\tExit ( or cntl-d )\n\n\t" );

        if ( scanf( "%d", &option ) == EOF )
            break;

        switch ( option ) {
        case 1:
            printf( "Please enter the integer: " );
            if ( scanf( "%d", &option ) != EOF )
                printf
                    ( "The Hamming Weight of %d ( 0x%X ) is %d\n\n",
                      option, option, bitcount( option ) );
            break;
        case 2:
            printf
                ( "Please select number of reps ( in millions ): " );
            if ( scanf( "%d", &option ) != EOF )
                benchmark( option );
            break;
        case 3:
            goto EXIT;
            break;
        default:
            printf( "Invalid option\n" );
        }

    }

 EXIT:
    printf( "\n" );

    return 0;
}

#endif

是不是32位？在阅读<！>后，我刚刚在Java中使用了这种方法; 破解编码面试< ！/一> <> QUOT;第4版练习5.5（第5章：比特操纵）。如果最低有效位为1增量count，则右移整数。

public static int bitCount( int n){
    int count = 0;
    for (int i=n; i!=0; i = i >> 1){
        count += i & 1;
    }
    return count;
}

我认为这个比使用常数0x33333333的解决方案更直观，无论它们有多快。这取决于你对<！>“最佳算法<！>”的定义;

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow