Question

Right now I'm working on a project which requires an integer to be converted to a base 62 string many times a second. The faster this conversion is completed, the better.

The problem is that I'm having a hard time getting my own base conversion methods to be fast and reliable. If I use strings, it's generally reliable and works well, but it's slow. If I use char arrays, it's generally much faster, but it's also very messy, and unreliable. (It produces heap corruption, comparison of strings that should match return a negative, etc.)

So what's the fastest and most reliable way of converting from a very large integer to a base 62 key? In the future, I plan on utilizing SIMD model code in my application, so is this operation parallelizable at all?

EDIT: This operation is performed several million times a second; as soon as the operation finishes, it begins again as part of a loop, so the faster it runs, the better. The integer being converted is of arbitrary size, and can easily be as large as a 128 bit integer (or larger).

EDIT: This is the function I am currently using.

char* charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
int charsetLength = (int)(strlen(charset));

//maxChars is an integer specifying the maximum length of the key
char* currentKey = new char[maxChars];

void integerToKey(unsigned long long location)
{
    unsigned long long num = location;
    int i = 0;

    for(; num > 0; i++)
    {
            currentKey[i] = charset[num % (charsetLength)];
            num /= charsetLength + 1;
    }

    currentKey[i + 1] = '\0';
}

I ripped this out of a class that is part of my application, and some of the code is modified so that it makes sense sans its owning class.

Was it helpful?

Solution

Probably what you want is some version of itoa. Here is a link that shows various versions of itoa with performance tests: http://www.jb.man.ac.uk/~slowe/cpp/itoa.html

In general, I know of two ways to do this. One way it to perform successive divisions to strip off one digit at a time. Another way is to precompute conversions in "blocks". So you could precompute a block of int to text conversion of size 62^3 then do the digits 3 at a time. Provided you do the memory layout and lookup efficiently this can be slightly faster at runtime but incurs a startup penalty.

OTHER TIPS

I feel bad because I cant remember where I originally found this, but I have been using this in my code and have found it to be pretty fast. You could modify this to be more efficient in certain places I am sure.

Oh and I feel worse because this is written in Java, but a quick c&p and refactor could get it working in c++

public class BaseConverterUtil {

     private static final String baseDigits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

     public static String toBase62( int decimalNumber ) {
         return fromDecimalToOtherBase( 62, decimalNumber );
     }

     public static String toBase36( int decimalNumber ) {
         return fromDecimalToOtherBase( 36, decimalNumber );
     }

     public static String toBase16( int decimalNumber ) {
         return fromDecimalToOtherBase( 16, decimalNumber );
     }

     public static String toBase8( int decimalNumber ) {
         return fromDecimalToOtherBase( 8, decimalNumber );
     }

     public static String toBase2( int decimalNumber ) {
         return fromDecimalToOtherBase( 2, decimalNumber );
     }

     public static int fromBase62( String base62Number ) {
         return fromOtherBaseToDecimal( 62, base62Number );
     }

     public static int fromBase36( String base36Number ) {
         return fromOtherBaseToDecimal( 36, base36Number );
     }

     public static int fromBase16( String base16Number ) {
         return fromOtherBaseToDecimal( 16, base16Number );
     }

     public static int fromBase8( String base8Number ) {
         return fromOtherBaseToDecimal( 8, base8Number );
     }

     public static int fromBase2( String base2Number ) {
         return fromOtherBaseToDecimal( 2, base2Number );
     }

     private static String fromDecimalToOtherBase ( int base, int decimalNumber ) {
         String tempVal = decimalNumber == 0 ? "0" : "";
         int mod = 0;

         while( decimalNumber != 0 ) {
             mod = decimalNumber % base;
             tempVal = baseDigits.substring( mod, mod + 1 ) + tempVal;
             decimalNumber = decimalNumber / base;
         }

         return tempVal;
     }

     private static int fromOtherBaseToDecimal( int base, String number ) {
         int iterator = number.length();
         int returnValue = 0;
         int multiplier = 1;

         while( iterator > 0 ) {
             returnValue = returnValue + ( baseDigits.indexOf( number.substring( iterator - 1, iterator ) ) * multiplier );
             multiplier = multiplier * base;
             --iterator;
         }
         return returnValue;
     }

 }

Off the top of me head I'd expect an implementation to look a lot like this.

const char lookUpTable[] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 
  'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
  'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
  'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z' };

std::string ConvertToBase62( int integer )
{
   char res[MAX_BASE62_LENGTH];
   char* pWritePos = res;
   int leftOver = integer;
   while( leftOver )
   {
      int value62     = leftOver % 62;     
      *pWritePos      = lookUpTable[value62];
      pWritePos++;

      leftOver        /= value62;
   }
   *pWritePos = 0;    

   return std::string( res );
}

At the moment this isn't very SIMD optimisable. There is no SIMD modulo.

If we do Modulo ourselves we could in turn rewrite the loop as follows.

   while( leftOver )
   {
      const int newLeftOver = leftOver / 62;
      int digit62     = leftOver - (62 * newLeftOver);     
      *pWritePos      = lookUpTable[digit62];
      pWritePos++;

      leftOver        = newLeftOver;
   }

Now we have something that would be easy to SIMD if it weren't for that lookup ...

Though you can still get a good speed improvement by doing the modulo for several values simultaneously. It'd probably even be worth unrolling the loop a second time so you can process the next 4 or so modulos while the previous set are calculating (Due to instruction latency). You should be able to hide latencies pretty effectively this way. #

I'll come back if i can think of a way to eliminate the table lookup ...

Edit: That said as the maximum number of base62 digits you can get from a 32-bit integer is 6 you should just be able to fully unwind the loop and process all 6 digits simultaneously. I'm not entirely sure SIMD would give you much of a win here. It'd be an interesting experiment but I do really doubt you'd get all that much of a speed up over the loop above. Would be interesting to try it if someone hadn't poured tea over my dev machine's keyboard :(

Edit 2: while i think about it. A constant / 62 can be craftily optimised by the compiler using scary magic numbers ... so I don't even reckon the loop above would do a divide.

there are reversing issues in the above - the low orders come first in the generated string - I don't know if that is actually an issue because it depends on the subsequent usage of the generated string.

Generally this sort of radix conversion can be accelerated by doing it in radix*radix chunks In your case a char[2][62*62] is needed. This array can be constructed at initialization time(it is const).

This must be benchmarked though. The divide cost used to be HUGE so saving half the divides was a sure win. It depends on the ability to cache this 7000+ byte table and the cost of the divide.

If you are getting heap corruption, you have issues beyond the code you're showing here.

You can make the string class faster by reserving the space for the string before you begin, with string::reserve.

Your string is coming out in reverse order, the lower order base-62 digit is the first character in the string. This might explain your comparison issues.

Your implementation is pretty much as quick as it's going to get. I would suggest a couple of changes though:

void integerToKey(unsigned long long location)
{
    unsigned long long num = location;
    int i = 0;
    for(; num > 0; i++)
    {
            currentKey[i] = charset[num % (charsetLength)];
            num /= charsetLength; // use charsetLength
    }
    currentKey[i] = '\0'; // put the null after the last written char
}

The first change (divide by charsetLength) may have been causing your string comparison problems. With your original code (dividing by charsetLength + 1), there may be different values of integer that incorrectly get converted to the same string. For base 62, then both 0 and 62 would be encoded as "0".

It's hard to say whether either of the above changes would be causing your reported heap corruption problems, without a bit more context (such as the value of maxChars).

Also, you should be aware that the above code will write the digits of the string representation in reverse order (try it with base 10 and convert a number such as 12345 to see what I mean). This may not matter for your application, though.

Here's a solution I use in php for Base 10 to N (62 in this example)
My whole post is over here: http://ken-soft.com/?p=544

public class BNID {
        // Alphabet of Base N (This is a Base 62 Implementation)
        var $bN = array(
            '0','1','2','3','4','5','6','7','8','9',
            'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
            'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'
        );

        var $baseN;

        function __construct() {
            $this->baseN = count($this->bN);
        }

        // convert base 10 to base N
        function base10ToN($b10num=0) {
            $bNnum = "";
            do {
                $bNnum = $this->bN[$b10num % $this->baseN] . $bNnum;
                $b10num /= $this->baseN;
            } while($b10num >= 1);     
            return $bNnum;
        }

        // convert base N to base 10
        function baseNTo10($bNnum = "") {
           $b10num = 0;
            $len = strlen($bNnum);
            for($i = 0; $i < $len; $i++) {
                $val = array_keys($this->bN, substr($bNnum, $i, 1));
                $b10num += $val[0] * pow($this->baseN, $len - $i - 1);
            }
            return $b10num;
        }

}

I'm piling on with another answer because a couple of answers I tried didn't produce the output I expected. Though, this is optimized for readability, not speed.

string toStr62(unsigned long long num) {
   string charset = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
   int base = charset.length();
   string str = num ? "" : "0";

   while (num) {
      str = charset.substr(num % base, 1) + str;
      num /= base;
   }

   return str;
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top