C/C++ Bit Array or Bit Vector

https://stackoverflow.com/questions/4604130

25-09-2019
|

Question

I am learning C/C++ programming & have encountered the usage of 'Bit arrays' or 'Bit Vectors'. Am not able to understand their purpose? here are my doubts -

Are they used as boolean flags?
Can one use int arrays instead? (more memory of course, but..)
What's this concept of Bit-Masking?
If bit-masking is simple bit operations to get an appropriate flag, how do one program for them? is it not difficult to do this operation in head to see what the flag would be, as apposed to decimal numbers?

I am looking for applications, so that I can understand better. for Eg -

Q. You are given a file containing integers in the range (1 to 1 million). There are some duplicates and hence some numbers are missing. Find the fastest way of finding missing numbers?

For the above question, I have read solutions telling me to use bit arrays. How would one store each integer in a bit?

Solution

I think you've got yourself confused between arrays and numbers, specifically what it means to manipulate binary numbers.

I'll go about this by example. Say you have a number of error messages and you want to return them in a return value from a function. Now, you might label your errors 1,2,3,4... which makes sense to your mind, but then how do you, given just one number, work out which errors have occured?

Now, try labelling the errors 1,2,4,8,16... increasing powers of two, basically. Why does this work? Well, when you work base 2 you are manipulating a number like 00000000 where each digit corresponds to a power of 2 multiplied by its position from the right. So let's say errors 1, 4 and 8 occur. Well, then that could be represented as 00001101. In reverse, the first digit = 1*2^0, the third digit 1*2^2 and the fourth digit 1*2^3. Adding them all up gives you 13.

Now, we are able to test if such an error has occured by applying a bitmask. By example, if you wanted to work out if error 8 has occured, use the bit representation of 8 = 00001000. Now, in order to extract whether or not that error has occured, use a binary and like so:

  00001101
& 00001000
= 00001000

I'm sure you know how an and works or can deduce it from the above - working digit-wise, if any two digits are both 1, the result is 1, else it is 0.

Now, in C:

int func(...)
{
    int retval = 0;

    if ( sometestthatmeans an error )
    {
        retval += 1;
    }


    if ( sometestthatmeans an error )
    {
        retval += 2;
    }
    return retval
}

int anotherfunc(...)
{
    uint8_t x = func(...)

    /* binary and with 8 and shift 3 plaes to the right
     * so that the resultant expression is either 1 or 0 */
    if ( ( ( x & 0x08 ) >> 3 ) == 1 )
    {
        /* that error occurred */
    }
}

Now, to practicalities. When memory was sparse and protocols didn't have the luxury of verbose xml etc, it was common to delimit a field as being so many bits wide. In that field, you assign various bits (flags, powers of 2) to a certain meaning and apply binary operations to deduce if they are set, then operate on these.

I should also add that binary operations are close in idea to the underlying electronics of a computer. Imagine if the bit fields corresponded to the output of various circuits (carrying current or not). By using enough combinations of said circuits, you make... a computer.

OTHER TIPS

regarding the usage the bits array :

if you know there are "only" 1 million numbers - you use an array of 1 million bits. in the beginning all bits will be zero and every time you read a number - use this number as index and change the bit in this index to be one (if it's not one already).

after reading all numbers - the missing numbers are the indices of the zeros in the array.

for example, if we had only numbers between 0 - 4 the array would look like this in the beginning: 0 0 0 0 0. if we read the numbers : 3, 2, 2 the array would look like this: read 3 --> 0 0 0 1 0. read 3 (again) --> 0 0 0 1 0. read 2 --> 0 0 1 1 0. check the indices of the zeroes: 0,1,4 - those are the missing numbers

BTW, of course you can use integers instead of bits but it may take (depends on the system) 32 times memory

Sivan

Bit Arrays or Bit Vectors can be though as an array of boolean values. Normally a boolean variable needs at least one byte storage, but in a bit array/vector only one bit is needed. This gets handy if you have lots of such data so you save memory at large.

Another usage is if you have numbers which do not exactly fit in standard variables which are 8,16,32 or 64 bit in size. You could this way store into a bit vector of 16 bit a number which consists of 4 bit, one that is 2 bit and one that is 10 bits in size. Normally you would have to use 3 variables with sizes of 8,8 and 16 bit, so you only have 50% of storage wasted.

But all these uses are very rarely used in business aplications, the come to use often when interfacing drivers through pinvoke/interop functions and doing low level programming.

Bit Arrays of Bit Vectors are used as a mapping from position to some bit value. Yes it's basically the same thing as an array of Bool, but typical Bool implementation is one to four bytes long and it uses too much space.

We can store the same amount of data much more efficiently by using arrays of words and binary masking operations and shifts to store and retrieve them (less overall memory used, less accesses to memory, less cache miss, less memory page swap). The code to access individual bits is still quite straightforward.

There is also some bit field support builtin in C language (you write things like int i:1; to say "only consume one bit") , but it is not available for arrays and you have less control of the overall result (details of implementation depends on compiler and alignment issues).

Below is a possible way to answer to your "search missing numbers" question. I fixed int size to 32 bits to keep things simple, but it could be written using sizeof(int) to make it portable. And (depending on the compiler and target processor) the code could only be made faster using >> 5 instead of / 32 and & 31 instead of % 32, but that is just to give the idea.

#include <stdio.h>
#include <errno.h>
#include <stdint.h>

int main(){
    /* put all numbers from 1 to 1000000 in a file, except 765 and 777777 */
    {
        printf("writing test file\n");
        int x = 0;
        FILE * f = fopen("testfile.txt", "w");
        for (x=0; x < 1000000; ++x){
            if (x == 765 || x == 777760 || x == 777791){
                continue;
            }
            fprintf(f, "%d\n", x);
        }
        fprintf(f, "%d\n", 57768); /* this one is a duplicate */
        fclose(f);
    }

    uint32_t bitarray[1000000 / 32];

    /* read file containing integers in the range [1,1000000] */
    /* any non number is considered as separator */
    /* the goal is to find missing numbers */
    printf("Reading test file\n");
    {
        unsigned int x = 0;
        FILE * f = fopen("testfile.txt", "r");
        while (1 == fscanf(f, " %u",&x)){
            bitarray[x / 32] |= 1 << (x % 32);
        }
        fclose(f);
    }
    /* find missing number in bitarray */
    {
        int x = 0;
        for (x=0; x < (1000000 / 32) ; ++x){
            int n = bitarray[x];
            if (n != (uint32_t)-1){
                printf("Missing number(s) between %d and %d [%x]\n",
                    x * 32, (x+1) * 32, bitarray[x]);
                int b;
                for (b = 0 ; b < 32 ; ++b){
                    if (0 == (n & (1 << b))){
                        printf("missing number is %d\n", x*32+b);
                    }
                }
            }
        }
    }
}

That is used for bit flags storage, as well as for parsing different binary protocols fields, where 1 byte is divided into a number of bit-fields. This is widely used, in protocols like TCP/IP, up to ASN.1 encodings, OpenPGP packets, and so on.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow