Matching bit strings

https://stackoverflow.com/questions/12103438

28-06-2021
|

Question

I have needed to implement a string searching algorithm that finds a pattern of bits in a text of bits (the match may not be byte/word aligned). For starters, I implemented the Boyer-Moore algorithm, but comparing individual bits was too slow for my purposes. So instead I tried implementing a blocked based version that would compare entire bytes/words as described in this paper, but it has become complex and unmanageable (in part due to me not entirely understanding what I'm doing.)

Does anyone have good implementation of such an algorithm?

My specific use case is with pattern length N >= 32, text window 2N, and bits packed into ints. Also N in this case is a multiple of char size N % 8 == 0. I preprocess once and use many times on changing text, like the Boyer-Moore. First match is all I need. Performance is key.

Edit: After successfully implementing the Blocked Boyer-Moore algorithm, I notice no improvement(my bit by bit version is faster!) It's probably a mistake of my own, because I've been racking my brain on it and optimized it to the point where it makes no sense without many lines of comments, yet it's still slower. Here it is.

Solution

Overview

I'm new to the SO community, but I look forward to giving something back.

Interesting problem. I put together an implementation which only does byte-based comparisons (with the aid of pre-computed bit patterns and bit masks) as opposed to performing expensive bit-manipulations on compare. As a result, it should be reasonably fast. It doesn't implement any of the Shift Rules (performance optimizations) discussed for the Boyer-Moore algorithm and thus could be further improved.

Although this implementation does depend on the number of pattern bits % CHAR_BIT == 0--on an 8-bit machine, meets your criteria of N % 8 == 0, the implementation will find non-byte-aligned bit patterns. (It also currently requires 8-bit characters ( CHAR_BIT == 8 ), but in the unlikely event your system is not using 8-bit chars, is easily adaptable by changing the all of the arrays/vectors from uint8_t to char, and adjusting the values they contain to reflect the correct number of bits.)

Given that the search doesn't do any bit-twiddling (besides anding pre-computed byte masks) it should be pretty performant.

Algorithm summary

In a nutshell, the pattern to be searched for is specified, and the implementation shifts it by one bit and records the shifted pattern. It also computes masks for the shifted pattern, as for non-byte-aligned bit patterns, some bits at the beginning and end of the comparison will need to be ignored for proper behavior.

The search is conducted for all the pattern bits in each shift position until a match is found or the end of the data buffer is reached.

//
//  BitStringMatch.cpp
//

#include "stdafx.h"
#include <iostream>
#include <cstdint>
#include <vector>
#include <memory>
#include <cassert>

int _tmain(int argc, _TCHAR* argv[])
{
    //Enter text and pattern data as appropriate for your application.  This implementation assumes pattern bits % CHAR_BIT == 0
    uint8_t text[] = { 0xcc, 0xcc, 0xcc, 0x5f, 0xe0, 0x1f, 0xe0, 0x0c }; //1010 1010, 1010 1010, 1010 1010, 010*1 1111, 1110 0000, 0001 1111, 1110 0000, 000*0 1010 
    uint8_t pattern[] = { 0xff, 0x00, 0xff, 0x00 }; //Set pattern to 1111 1111, 0000 0000, 1111 1111, 0000 0000

    assert( CHAR_BIT == 8 ); //Sanity check
    assert ( sizeof( text ) >= sizeof( pattern ) ); //Sanity check

    std::vector< std::vector< uint8_t > > shiftedPatterns( CHAR_BIT, std::vector< uint8_t >( sizeof( pattern ) + 1, 0 ) );  //+1 to accomodate bit shifting of CHAR_BIT bits.
    std::vector< std::pair< uint8_t, uint8_t > > compareMasks( CHAR_BIT, std::pair< uint8_t, uint8_t >( 0xff, 0x00 ) );

    //Initialize pattern shifting through all bit positions
    for( size_t i = 0; i < sizeof( pattern ); ++i ) //Start by initializing the unshifted pattern
    {
        shiftedPatterns[ 0 ][ i ] = pattern[ i ];
    }

    for( size_t i = 1; i < CHAR_BIT; ++i )  //Initialize the other patterns, shifting the previous vector pattern to the right by 1 bit position
    {
        compareMasks[ i ].first >>= i;  //Set the bits to consider in the first...
        compareMasks[ i ].second = 0xff << ( CHAR_BIT - i ); //and last bytes of the pattern

        bool underflow = false;
        for( size_t j = 0; j < sizeof( pattern ) + 1; ++j )
        {
            bool thisUnderflow = shiftedPatterns[ i - 1 ][ j ] & 0x01 ? true : false; 
            shiftedPatterns[ i ][ j ] = shiftedPatterns[ i - 1][ j ] >> 1;

            if( underflow ) //Previous byte shifted out a 1; shift in a 1
            {
                shiftedPatterns[ i ][ j ] |= 0x80;  //Set MSb to 1
            }

            underflow = thisUnderflow;
        }
    }

    //Search text for pattern
    size_t maxTextPos = sizeof( text ) - sizeof( pattern );
    size_t byte = 0;
    bool match = false;
    for( size_t byte = 0; byte <= maxTextPos && !match; ++byte )
    {
        for( size_t bit = 0; bit < CHAR_BIT && ( byte < maxTextPos || ( byte == maxTextPos && bit < 1 ) ); ++bit )
        {
            //Compare first byte of pattern
            if( ( shiftedPatterns[ bit ][ 0 ] & compareMasks[ bit ].first ) != ( text[ byte ] & compareMasks[ bit ].first ) )
            {
                continue;
            }

            size_t foo = sizeof( pattern );
            //Compare all middle bytes of pattern
            bool matchInProgress = true;
            for( size_t pos = 1; pos < sizeof( pattern ) && matchInProgress; ++pos )
            {
                matchInProgress = shiftedPatterns[ bit ][ pos ] == text[ byte + pos ];
            }
            if( !matchInProgress )
            {
                continue;
            }

            if( bit != 0 )  //If compare failed or we're comparing the unshifted pattern, there's no need to compare final pattern buffer byte
            {
                if( ( shiftedPatterns[ bit ][ sizeof( pattern ) ] & compareMasks[ bit ].second ) != ( text[ byte + sizeof( pattern ) ] & compareMasks[ bit ].second ) )
                {
                    continue;
                };
            }

            //We found a match!
            match = true;   //Abandon search
            std::cout << "Match found!  Pattern begins at byte index " << byte << ", bit position " << CHAR_BIT - bit - 1 << ".\n";
            break;
        }
    }
    //If no match
    if( !match )
    {
        std::cout << "No match found.\n";
    }

    std::cout << "\nPress a key to exit...";
    std::getchar();

    return 0;
}

I hope this is helpful.

OTHER TIPS

If N is big (bigger than, say, 16 bits) then it would be pretty easy to do a preliminary search against 8 shifted copies of the bit pattern (truncating the pattern to eliminate 'partial bytes'). Then you can refine the results by looking at adjacent bits. The byte search (against the 8 shifted copies) could be done using Boyer-Moore or a similarly efficient algorithm.

In case you are wondering: 8 byte searches is probably faster than one bit search since each byte comparison takes just one instruction, whereas the bit manipulation needed to do the bit search takes far more instructions per bit. However, you should always profile to be sure.

Performance will be highly dependent on the types of bit patterns. For example if the "key" bit sequence and the "search" bit sequence is highly different, then some of the byte shifting solutions, or even your bit by bit solution will be pretty fast. Because at the vast majority of bit offsets, the first comparison will fail and you can move on to the next one.

If the sequences are highly similar then you need a more sophisticated algorithm. Imagine for example 1 million bits that are all 10101010101010...... except that somewhere in the middle a 1 is flipped to a zero e.g. ...101000101... and you are looking for a 10k bit sequence that ends in "...101000" then byte shift comparison algorithms will do poorly because they will have to compare a ton of bytes - 8 times - before failing the match a half a million times.

So the statistical characteristics of your data are important here. Also if the key string is used multiple times, or you expect multiple matches, then proprocessing the buffer can speed things up.

For example you could transform the buffer once to add up the number of bits in each pair of bytes, and in each byte, and then do that to each key string. You can then scan the two buffers. For the string to potentially match, the number of bits in each byte of the key string must always be smaller than the number of bits in each byte pair, and the number of bits in each byte pair of the key string must always be smaller than the bit count in each byte of the search string.

If your key string is large, then you could mark low and high bit "anchors" and scan for those. For example say I was comparing a 10k key string that had two 0 bytes at offsets x, y, z. Then I could scan my search string for positions the number of bits in single bytes at those offsets match zero. That would be really fast.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow