Decimal concatenation of two integers using bit operations

Question 1

The most efficient way to do it, is likely something similar to this:

uint32_t uintcat (uint32_t ms, uint32_t ls)
{
  uint32_t mult=1;

  do
  {
    mult *= 10; 
  } while(mult <= ls);

  return ms * mult + ls;
}

Then let the compiler worry about optimization. There's likely not a lot it can improve since this is base 10, which doesn't get along well with the various instructions of the computer, like shifting.

EDIT : BENCHMARK TEST

Intel i7-3770 2 3,4 GHz
OS: Windows 7/64
Mingw, GCC version 4.6.2
gcc -O3 -std=c99 -pedantic-errors -Wall

10 million random values, from 0 to 3276732767.

Result (approximates):

Algorithm 1: 60287 micro seconds
Algorithm 2: 65185 micro seconds

Benchmark code used:

#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#include <time.h>

uint32_t uintcat (uint32_t ms, uint32_t ls)
{
  uint32_t mult=1;

  do
  {
    mult *= 10; 
  } while(mult <= ls);

  return ms * mult + ls;
}


uint32_t myConcat (uint32_t a, uint32_t b) {
    switch( (b >= 10000000) ? 7 : 
            (b >= 1000000) ? 6 : 
            (b >= 100000) ? 5 : 
            (b >= 10000) ? 4 : 
            (b >= 1000) ? 3 : 
            (b >= 100) ? 2 : 
            (b >= 10) ? 1 : 0 ) {
        case 1: return a*100+b; break;
        case 2: return a*1000+b; break;
        case 3: return a*10000+b; break;
        case 4: return a*100000+b; break;
        case 5: return a*1000000+b; break;
        case 6: return a*10000000+b; break;
        case 7: return a*100000000+b; break;

        default: return a*10+b; break;
    }
}


static LARGE_INTEGER freq;

static void print_benchmark_results (LARGE_INTEGER* start, LARGE_INTEGER* end)
{
  LARGE_INTEGER elapsed;

  elapsed.QuadPart = end->QuadPart - start->QuadPart;
  elapsed.QuadPart *= 1000000;
  elapsed.QuadPart /= freq.QuadPart;

  printf("%lu micro seconds", elapsed.QuadPart);
}

int main()
{
  const uint32_t TEST_N = 10000000;
  uint32_t* data1 = malloc (sizeof(uint32_t) * TEST_N);
  uint32_t* data2 = malloc (sizeof(uint32_t) * TEST_N);
  volatile uint32_t* result_algo1 = malloc (sizeof(uint32_t) * TEST_N);
  volatile uint32_t* result_algo2 = malloc (sizeof(uint32_t) * TEST_N);

  srand (time(NULL));
  // Mingw rand() apparently gives numbers up to 32767
  // worst case should therefore be 3,276,732,767

  // fill up random data in arrays
  for(uint32_t i=0; i<TEST_N; i++)
  {
    data1[i] = rand();
    data2[i] = rand();
  }


  QueryPerformanceFrequency(&freq); 


  LARGE_INTEGER start, end;

  // run algorithm 1
  QueryPerformanceCounter(&start);
  for(uint32_t i=0; i<TEST_N; i++)
  {
    result_algo1[i] = uintcat(data1[i], data2[i]);
  } 
  QueryPerformanceCounter(&end);

  // print results
  printf("Algorithm 1: ");
  print_benchmark_results(&start, &end);
  printf("\n");

  // run algorithm 2
  QueryPerformanceCounter(&start);
  for(uint32_t i=0; i<TEST_N; i++)
  {
    result_algo2[i] = myConcat(data1[i], data2[i]);
  } 
  QueryPerformanceCounter(&end);

  // print results
  printf("Algorithm 2: ");
  print_benchmark_results(&start, &end);
  printf("\n\n");


  // sanity check both algorithms against each other
  for(uint32_t i=0; i<TEST_N; i++)
  {
    if(result_algo1[i] != result_algo2[i])
    {
      printf("Results mismatch for %lu %lu. Expected: %lu%lu, algo1: %lu, algo2: %lu\n",
             data1[i], 
             data2[i],
             data1[i],
             data2[i],
             result_algo1[i],
             result_algo2[i]);
    }
  }


  // clean up
  free((void*)data1);
  free((void*)data2);
  free((void*)result_algo1);
  free((void*)result_algo2);
}

Question 2

Bit operations use the binary representation of the numbers. However what you try to achieve is to concatenate the numbers in decimal notation. Please note that concatenating the decimal representations has little to do with concatenating the binary representations. Though it is theoretically possible to solve the problem using binary operations I am sure it will be far from the most efficient way.

Question 3

We need to calculate a*10^N + b very fast.

Bit operations isn't a best idea to optimize it (even using tricks like a := (a<<1) + (a<<3) <=> a := a*10 as compiler can make it himself).

The first problem is to calculate 10^N, but there is no need to calculate it, there is just 9 possible values.

The second problem is to calculate N from b (length of 10 representation). If your data have uniform distribution you can minimize operations count in average case.

Check b <= 10^9, b <= 10^8, ..., b <= 10 with ()?: (it's faster than if( ) after optimizations, it has much simplier grammar and functionality), call the result N. Next, make switch(N) with lines "return a*10^N + b" (where 10^N is constant). As I know, switch() with 3-4 "case" is faster than same if( ) construction after optimizations.

unsigned int myConcat(unsigned int& a, unsigned int& b) {
    switch( (b >= 10000000) ? 7 : 
            (b >= 1000000) ? 6 : 
            (b >= 100000) ? 5 : 
            (b >= 10000) ? 4 : 
            (b >= 1000) ? 3 : 
            (b >= 100) ? 2 : 
            (b >= 10) ? 1 : 0 ) {
        case 1: return a*100+b; break;
        case 2: return a*1000+b; break;
        case 3: return a*10000+b; break;
        case 4: return a*100000+b; break;
        case 5: return a*1000000+b; break;
        case 6: return a*10000000+b; break;
        case 7: return a*100000000+b; break;
        default: return a*10+b; break;
        // I don't really know what to do here
        //case 8: return a*1000*1000*1000+b; break;
        //case 9: return a*10*1000*1000*1000+b; break;
    }
}

As you can see, there is 2-3 operations in average case + optimisations is very effective here. I've benchmarked it in comparison with Lundin's suggestion, here is the result. 0ms vs 100ms

Question 4

If you care about decimal digit concatenation, you might want to simply do that as you're printing, and convert both numbers to a sequence of digits sequentially. e.g. How do I print an integer in Assembly Level Programming without printf from the c library? shows an efficient C function, as well as asm. Call it twice into the same buffer.

@Lundin's answer loops increasing powers of 10 to find the right decimal-shift, i.e. a linear search for the right power of 10. If it's called very frequently so a lookup table can stay hot in cache, a speedup maybe be possible.

If you can use GNU C __builtin_clz (Count Leading Zeros) or some other way of quickly finding the MSB position of the right-hand input (ls, the least-significant part of the resulting concatenation), you can start the search for the right mult from a 32-entry lookup table. (And you only have to check at most one more iteration, so it's not a loop.)

Most of the common modern CPU architectures have a HW instruction that the compiler can use directly or with a bit of processing to implement clz. https://en.wikipedia.org/wiki/Find_first_set#Hardware_support. (And on all but x86, the result is well-defined for an input of 0, but unfortunately GNU C doesn't portably give us access to that.)

If the table stays hot in L1d cache, this can be pretty good. The extra latency of a clz and a table lookup are comparable to a couple iterations of the loop (on a modern x86 like Skylake or Ryzen for example, where bsf or tzcnt is 3 cycle latency, L1d latency is 4 or 5 cycles, imul latency is 3 cycles.)

Of course, on many architectures (including x86), multiplying by 10 is cheaper than by a runtime variable, using shift and add. 2 LEA instructions on x86, or an add+lsl on ARM/AArch64 using a shifted input to do tmp = x + x*4 with the add. So on Intel CPUs, we're only looking at a 2-cycle loop-carried dependency chain, not 3. But AMD CPUs have slower LEA when using a scaled index.

This doesn't sound great for small numbers. But it can reduce branch mispredictions by needing at most one iteration. It even makes a branchless implementation possible. And it means less total work for large lower parts (large powers of 10). But large integers will easily overflow unless you use a wider result type.

Unfortunately, 10 is not a power of 2, so the MSB position alone can't give us the exact power of 10 to multiply by. e.g. all numbers from 64 to 127 all have MSB = 1<<7, but some of them have 2 decimal digits and some have 3. Since we want to avoid division (because it requires a multiplication by a magic constant and shifting the high half), we want to always start with the lower power of 10 and see if that's big enough.

But fortunately, a bitscan does get us within one power of 10 so we no longer need a loop.

I probably wouldn't have written the part with _lzcnt_u32 or ARM __clz if I'd learned of the clz(a|1) trick for avoiding problems with input=0 beforehand. But I did, and played around with the source a bit to try to get nicer asm from gcc and clang. Index the table on clz or BSR depending on target platform makes it a bit of a mess.

#include <stdint.h>
#include <limits.h>
#include <assert.h>

   // builtin_clz matches Intel's docs for x86 BSR: garbage result for input=0
   // actual x86 HW leaves the destination register unmodified; AMD even documents this.
   // but GNU C doesn't let us take advantage with intrinsics.
   // unless you use BMI1 _lzcnt_u32


// if available, use an intrinsic that gives us a leading-zero count
// *without* an undefined result for input=0
#ifdef __LZCNT__      // x86 CPU feature
#include <immintrin.h>  // Intel's intrinsics
#define HAVE_LZCNT32
#define lzcnt32(a) _lzcnt_u32(a)
#endif

#ifdef __ARM__      // TODO: do older ARMs not have this?
#define HAVE_LZCNT32
#define lzcnt32(a) __clz(a)  // builtin, no header needed
#endif

// Some POWER compilers define `__cntlzw`?



// index = msb position, or lzcnt, depending on which the HW can do more efficiently
// defined later; one or the other is unused and optimized out, depending on target platform
// alternative: fill this at run-time startup
// with a loop that does mult*=10 when (x<<1)-1 > mult, or something
//#if INDEX_BY_MSB_POS == 1
  __attribute__((unused))
  static const uint32_t catpower_msb[] = {
       10,    // 1 and 0
       10,    // 2..3
       10,    // 4..7
       10,    // 8..15
       100,    // 16..31     // 2 digits even for the low end of the range
       100,    // 32..63
       100,    // 64..127
       1000,   // 128..255   // 3 digits
       1000,   // 256..511
       1000,   // 512..1023
       10000,   // 1024..2047
       10000,   // 2048..4095
       10000,   // 4096..8191
       10000,   // 8192..16383
       100000,   // 16384..32767
       100000,   // 32768..65535      // up to 2^16-1, enough for 16-bit inputs
       //  ...   // fill in the rest yourself
  };
//#elif INDEX_BY_MSB_POS == 0
  // index on leading zeros
  __attribute__((unused))
  static const uint32_t catpower_lz32[] = {
      // top entries overflow: 10^10 doesn't fit in uint32_t
      // intentionally wrong to make it easier to spot bad output.
    4000000000,    // 2^31 .. 2^32-1    2*10^9 .. 4*10^9
    2000000000,    // 1,073,741,824 .. 2,147,483,647
    // first correct entry
    1000000000,    //   536,870,912 .. 1,073,741,823

    // ... fill in the rest
    // for testing, skip until 16 leading zeros
    [16] = 100000,   // 32768..65535      // up to 2^16-1, enough for 16-bit inputs
       100000,   // 16384..32767
       10000,   // 8192..16383
       10000,   // 4096..8191
       10000,   // 2048..4095
       10000,   // 1024..2047
       1000,   // 512..1023
       1000,   // 256..511
       1000,   // 128..255
       100,    // 64..127
       100,    // 32..63
       100,    // 16..31     // low end of the range has 2 digits
       10,    // 8..15
       10,    // 4..7
       10,    // 2..3
       10,    // 1
                       // lzcnt32(0) == 32
       10,    // 0     // treat 0 as having one significant digit.
  };
//#else
//#error "INDEX_BY_MSB_POS not set correctly"
//#endif



//#undef HAVE_LZCNT32  // codegen for the other path, for fun

static inline uint32_t msb_power10(uint32_t a)
{
#ifdef HAVE_LZCNT32  // 0-safe lzcnt32 macro available
    #define INDEX_BY_MSB_POS 0
    // a |= 1 would let us shorten the table, in case 32*4 is a lot nicer than 33*4 bytes
    unsigned lzcnt = lzcnt32(a);  // 32 for a=0
    return catpower_lz32[lzcnt];
#else
  // only generic __builtin_clz available

  static_assert(sizeof(uint32_t) == sizeof(unsigned) && UINT_MAX == (1ULL<<32)-1, "__builtin_clz isn't 32-bit");
  // See also https://foonathan.net/blog/2016/02/11/implementation-challenge-2.html
  // for C++ templates for fixed-width wrappers for __builtin_clz

  #if defined(__i386__) || defined(__x86_64__)
    // x86 where MSB_index = 31-clz = BSR is most efficient
    #define INDEX_BY_MSB_POS 1
    unsigned msb = 31 - __builtin_clz(a|1);  // BSR
    return catpower_msb[msb];
    //return unlikely(a==0) ? 10 : catpower_msb[msb];
  #else
    // use clz directly while still avoiding input=0
    // I think all non-x86 CPUs with hardware CLZ do define clz(0) = 32 or 64 (the operand width),
    // but gcc's builtin is still documented as not valid for input=0
    // Most ISAs like PowerPC and ARM that have a bitscan instruction have clz, not MSB-index

    // set the LSB to avoid the a==0 special case
    unsigned clz = __builtin_clz(a|1);
    // table[32] unused, could add yet another #ifdef for that
    #define INDEX_BY_MSB_POS 0
    //return unlikely(a==0) ? 10 : catpower_lz32[clz];
    return catpower_lz32[clz];   // a|1 avoids the special-casing
  #endif  // optimize for BSR or not
#endif // HAVE_LZCNT32
}


uint32_t uintcat (uint32_t ms, uint32_t ls)
{
//  if (ls==0) return ms * 10;  // Another way to avoid the special case for clz

  uint32_t mult = msb_power10(ls); // catpower[clz(ls)];
  uint32_t high = mult * ms;
#if 0
  if (mult <= ls)
      high *= 10;
  return high + ls;
#else
  // hopefully compute both and then select
  // because some CPUs can shift and add at the same time (x86, ARM)
  // so this avoids having an ADD *after* the cmov / csel, if the compiler is smart
  uint32_t another10 = high*10 + ls;
  uint32_t enough = high + ls; 
  return (mult<=ls) ? another10 : enough;
#endif
}

From the Godbolt compiler explorer, this compiles efficiently for x86-64 with and without BSR:

# clang7.0 -O3 for x86-64 SysV,  -march=skylake -mno-lzcnt
uintcat(unsigned int, unsigned int):
    mov     eax, esi
    or      eax, 1
    bsr     eax, eax                    # 31-clz(ls|1)
    mov     ecx, dword ptr [4*rax + catpower_msb]
    imul    edi, ecx                    # high = mult * ms
    lea     eax, [rdi + rdi]
    lea     eax, [rax + 4*rax]          # retval = high * 10
    cmp     ecx, esi
    cmova   eax, edi                    # if(mult>ls) retval = high   (drop the *10 result)
    add     eax, esi                    # retval += ls
    ret

Or with lzcnt, (enabled by -march=haswell or later, or some AMD uarches),

uintcat(unsigned int, unsigned int):
          # clang doesn't try to break the false dependency on EAX; gcc uses xor eax,eax
    lzcnt   eax, esi                    # C source avoids the |1, saving instructions
    mov     ecx, dword ptr [4*rax + catpower_lz32]
    imul    edi, ecx                    # same as above from here on
    lea     eax, [rdi + rdi]
    lea     eax, [rax + 4*rax]
    cmp     ecx, esi
    cmova   eax, edi
    add     eax, esi
    ret

Factoring the last add out of both sides of the ternary is a missed optimization, adding 1 cycle of latency after the cmov. We can multiply by 10 and add just as cheaply as multiplying by 10 alone, on Intel CPUs:

    ... same start         # hand-optimized version that clang should use
    imul    edi, ecx                    # high = mult * ms
    lea     eax, [rdi + 4*rdi]          # high * 5
    lea     eax, [rsi + rdi*2]          # retval = high * 10 + ls
    add     edi, esi                    # tmp = high + ls
    cmp     ecx, esi
    cmova   eax, edi                    # if(mult>ls) retval = high+ls
    ret

So the high + ls latency would run in parallel with the high*10 + ls latency, both needed as inputs for cmov.

GCC branches instead of using CMOV for the last condition. GCC also makes a mess of 31-clz(a|1), calculating clz with BSR and XOR with 31. But then subtracting that from 31. And it has some extra mov instructions. Strangely, gcc seems to do better with that BSR code when lzcnt is available, even though it chooses not to use it.

clang has no trouble optimizing away the 31-clz double-inversion and just using BSR directly.

For PowerPC64, clang also makes branchless asm. gcc does something similar, but with a branch like on x86-64.

uintcat:
.Lfunc_gep0:
    addis 2, 12, .TOC.-.Lfunc_gep0@ha
    addi 2, 2, .TOC.-.Lfunc_gep0@l
    ori 6, 4, 1                              # OR immediate
    addis 5, 2, catpower_lz32@toc@ha
    cntlzw 6, 6                              # CLZ  word
    addi 5, 5, catpower_lz32@toc@l           # static table address
    rldic 6, 6, 2, 30                        # rotate left and clear immediate (shift and zero-extend the CLZ result)
    lwzx 5, 5, 6                             # Load Word Zero eXtend, catpower_lz32[clz]
    mullw 3, 5, 3                            # mul word
    cmplw   5, 4                             # compare   mult, ls
    mulli 6, 3, 10                           # mul immediate
    isel 3, 3, 6, 1                          # conditional select high vs. high*10
    add 3, 3, 4                              # + ls
    clrldi  3, 3, 32                         # zero extend, clearing upper 32 bits
    blr                                      # return

Compressing the table

Using clz(ls|1) >> 1 or that +1 should work, because 4 < 10. The table always takes at least 3 entries to gain another digit. I haven't investigated this. (And have already spent longer than I meant to on this. :P)

Or right-shift a lot more to just get a starting point for the loop. e.g. mult = clz(ls) >= 18 ? 100000 : 10;, or a 3 or 4-way chain of if.

Or loop on mult *= 100, and after exiting that loop sort out whether you want old_mult * 10 or mult. (i.e. check if you went too far). This cuts the iteration count in half for even numbers of digits.

(Watch out for a possible infinite loop on large ls that would overflow the result. If mult *= 100 wraps to 0, it will always stay <= ls for ls = 1000000000, for example.)