Question

Below is my current char* to hex string function. I wrote it as an exercise in bit manipulation. It takes ~7ms on a AMD Athlon MP 2800+ to hexify a 10 million byte array. Is there any trick or other way that I am missing?

How can I make this faster?

Compiled with -O3 in g++

static const char _hex2asciiU_value[256][2] =
     { {'0','0'}, {'0','1'}, /* snip..., */ {'F','E'},{'F','F'} };

std::string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
    std::string str;
    str.resize(_len*2);
    char* pszHex = &str[0];
    const unsigned char* pEnd = _pArray + _len;

    clock_t stick, etick;
    stick = clock();
    for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
        pszHex[0] = _hex2asciiU_value[*pChar][0];
        pszHex[1] = _hex2asciiU_value[*pChar][1];
    }
    etick = clock();

    std::cout << "ticks to hexify " << etick - stick << std::endl;

    return str;
}

Updates

Added timing code

Brian R. Bondy: replace the std::string with a heap alloc'd buffer and change ofs*16 to ofs << 4 - however the heap allocated buffer seems to slow it down? - result ~11ms

Antti Sykäri:replace inner loop with

 int upper = *pChar >> 4;
 int lower = *pChar & 0x0f;
 pszHex[0] = pHex[upper];
 pszHex[1] = pHex[lower];

result ~8ms

Robert: replace _hex2asciiU_value with a full 256-entry table, sacrificing memory space but result ~7ms!

HoyHoy: Noted it was producing incorrect results

Was it helpful?

Solution

At the cost of more memory you can create a full 256-entry table of the hex codes:

static const char _hex2asciiU_value[256][2] =
    { {'0','0'}, {'0','1'}, /* ..., */ {'F','E'},{'F','F'} };

Then direct index into the table, no bit fiddling required.

const char *pHexVal = pHex[*pChar];
pszHex[0] = pHexVal[0];
pszHex[1] = pHexVal[1];

OTHER TIPS

This assembly function (based off my previous post here, but I had to modify the concept a bit to get it to actually work) processes 3.3 billion input characters per second (6.6 billion output characters) on one core of a Core 2 Conroe 3Ghz. Penryn is probably faster.

%include "x86inc.asm"

SECTION_RODATA
pb_f0: times 16 db 0xf0
pb_0f: times 16 db 0x0f
pb_hex: db 48,49,50,51,52,53,54,55,56,57,65,66,67,68,69,70

SECTION .text

; int convert_string_to_hex( char *input, char *output, int len )

cglobal _convert_string_to_hex,3,3
    movdqa xmm6, [pb_f0 GLOBAL]
    movdqa xmm7, [pb_0f GLOBAL]
.loop:
    movdqa xmm5, [pb_hex GLOBAL]
    movdqa xmm4, [pb_hex GLOBAL]
    movq   xmm0, [r0+r2-8]
    movq   xmm2, [r0+r2-16]
    movq   xmm1, xmm0
    movq   xmm3, xmm2
    pand   xmm0, xmm6 ;high bits
    pand   xmm2, xmm6
    psrlq  xmm0, 4
    psrlq  xmm2, 4
    pand   xmm1, xmm7 ;low bits
    pand   xmm3, xmm7
    punpcklbw xmm0, xmm1
    punpcklbw xmm2, xmm3
    pshufb xmm4, xmm0
    pshufb xmm5, xmm2
    movdqa [r1+r2*2-16], xmm4
    movdqa [r1+r2*2-32], xmm5
    sub r2, 16
    jg .loop
    REP_RET

Note it uses x264 assembly syntax, which makes it more portable (to 32-bit vs 64-bit, etc). To convert this into the syntax of your choice is trivial: r0, r1, r2 are the three arguments to the functions in registers. Its a bit like pseudocode. Or you can just get common/x86/x86inc.asm from the x264 tree and include that to run it natively.

P.S. Stack Overflow, am I wrong for wasting time on such a trivial thing? Or is this awesome?

Faster C Implmentation

This runs nearly 3x faster than the C++ implementation. Not sure why as it's pretty similar. For the last C++ implementation that I posted it took 6.8 seconds to run through a 200,000,000 character array. The implementation took only 2.2 seconds.

#include <stdio.h>
#include <stdlib.h>

char* char_to_hex(const unsigned char* p_array, 
                  unsigned int p_array_len,
                  char** hex2ascii)
{
    unsigned char* str = malloc(p_array_len*2+1);
    const unsigned char* p_end = p_array + p_array_len;
    size_t pos=0;
    const unsigned char* p;
    for( p = p_array; p != p_end; p++, pos+=2 ) {
       str[pos] = hex2ascii[*p][0];
       str[pos+1] = hex2ascii[*p][1];
    }
    return (char*)str;
}

int main()
{
  size_t hex2ascii_len = 256;
  char** hex2ascii;
  int i;
  hex2ascii = malloc(hex2ascii_len*sizeof(char*));
  for(i=0; i<hex2ascii_len; i++) {
    hex2ascii[i] = malloc(3*sizeof(char));    
    snprintf(hex2ascii[i], 3,"%02X", i);
  }
  size_t len = 8;
  const unsigned char a[] = "DO NOT WANT";
  printf("%s\n", char_to_hex((const unsigned char*)a, len, (char**)hex2ascii));
}

enter image description here

Operate on 32 bits at a time (4 chars), then deal with the tail if needed. When I did this exercise with url encoding a full table lookup for each char was slightly faster than logic constructs, so you may want to test this in context as well to take caching issues into account.

It works for me with unsigned char:

unsigned char  c1 =  byteVal >> 4;
unsigned char  c2 =  byteVal & 0x0f;

c1 +=  c1 <= 9 ? '0' : ('a' - 10);
c2 +=  c2 <= 9 ? '0' : ('a' - 10);

std::string sHex("  ");
sHex[0] = c1 ;
sHex[1] = c2 ;


//sHex - contain what we need. For example "0f"

For one, instead of multiplying by 16 do a bitshift << 4

Also don't use the std::string, instead just create a buffer on the heap and then delete it. It will be more efficient than the object destruction that is needed from the string.

not going to make a lot of difference... *pChar-(ofs*16) can be done with [*pCHar & 0x0F]

This is my version, which, unlike the OP's version, doesn't assume that std::basic_string has its data in contiguous region:

#include <string>

using std::string;

static char const* digits("0123456789ABCDEF");

string
tohex(string const& data)
{
    string result(data.size() * 2, 0);
    string::iterator ptr(result.begin());
    for (string::const_iterator cur(data.begin()), end(data.end()); cur != end; ++cur) {
        unsigned char c(*cur);
        *ptr++ = digits[c >> 4];
        *ptr++ = digits[c & 15];
    }
    return result;
}

I assume this is Windows+IA32.
Try to use short int instead of the two hexadecimal letters.

short int hex_table[256] = {'0'*256+'0', '1'*256+'0', '2'*256+'0', ..., 'E'*256+'F', 'F'*256+'F'};
unsigned short int* pszHex = &str[0];

stick = clock();

for (const unsigned char* pChar = _pArray; pChar != pEnd; pChar++) 
    *pszHex++ = hex_table[*pChar];

etick = clock();

Changing

    ofs = *pChar >> 4;
    pszHex[0] = pHex[ofs];
    pszHex[1] = pHex[*pChar-(ofs*16)];

to

    int upper = *pChar >> 4;
    int lower = *pChar & 0x0f;
    pszHex[0] = pHex[upper];
    pszHex[1] = pHex[lower];

results in roughly 5% speedup.

Writing the result two bytes at time as suggested by Robert results in about 18% speedup. The code changes to:

_result.resize(_len*2);
short* pszHex = (short*) &_result[0];
const unsigned char* pEnd = _pArray + _len;

const char* pHex = _hex2asciiU_value;
for(const unsigned char* pChar = _pArray;
    pChar != pEnd;
    pChar++, ++pszHex )
{
    *pszHex = bytes_to_chars[*pChar];
}

Required initialization:

short short_table[256];

for (int i = 0; i < 256; ++i)
{
    char* pc = (char*) &short_table[i];
    pc[0] = _hex2asciiU_value[i >> 4];
    pc[1] = _hex2asciiU_value[i & 0x0f];
}

Doing it 2 bytes at a time or 4 bytes at a time will probably result in even greater speedups, as pointed out by Allan Wind, but then it gets trickier when you have to deal with the odd characters.

If you're feeling adventurous, you might try to adapt Duff's device to do this.

Results are on an Intel Core Duo 2 processor and gcc -O3.

Always measure that you actually get faster results — a pessimization pretending to be an optimization is less than worthless.

Always test that you get the correct results — a bug pretending to be an optimization is downright dangerous.

And always keep in mind the tradeoff between speed and readability — life is too short for anyone to maintain unreadable code.

(Obligatory reference to coding for the violent psychopath who knows where you live.)

Make sure your compiler optimization is turned on to the highest working level.

You know, flags like '-O1' to '-03' in gcc.

I have found that using an index into an array, rather than a pointer, can speed things up a tick. It all depends on how your compiler chooses to optimize. The key is that the processor has instructions to do complex things like [i*2+1] in a single instruction.

If you're rather obsessive about speed here, you can do the following:

Each character is one byte, representing two hex values. Thus, each character is really two four-bit values.

So, you can do the following:

  1. Unpack the four-bit values to 8-bit values using a multiplication or similar instruction.
  2. Use pshufb, the SSSE3 instruction (Core2-only though). It takes an array of 16 8-bit input values and shuffles them based on the 16 8-bit indices in a second vector. Since you have only 16 possible characters, this fits perfectly; the input array is a vector of 0 through F characters, and the index array is your unpacked array of 4-bit values.

Thus, in a single instruction, you will have performed 16 table lookups in fewer clocks than it normally takes to do just one (pshufb is 1 clock latency on Penryn).

So, in computational steps:

  1. A B C D E F G H I J K L M N O P (64-bit vector of input values, "Vector A") -> 0A 0B 0C 0D 0E 0F 0G 0H 0I 0J 0K 0L 0M 0N 0O 0P (128-bit vector of indices, "Vector B"). The easiest way is probably two 64-bit multiplies.
  2. pshub [0123456789ABCDEF], Vector B

I'm not sure doing it more bytes at a time will be better... you'll probably just get tons of cache misses and slow it down significantly.

What you might try is to unroll the loop though, take larger steps and do more characters each time through the loop, to remove some of the loop overhead.

Consistently getting ~4ms on my Athlon 64 4200+ (~7ms with original code)

for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++) {
    const char* pchars = _hex2asciiU_value[*pChar];
    *pszHex++ = *pchars++;
    *pszHex++ = *pchars;
}

The function as it is shown when I'm writing this produces incorrect output even when _hex2asciiU_value is fully specified. The following code works, and on my 2.33GHz Macbook Pro runs in about 1.9 seconds for 200,000,000 million characters.

#include <iostream>

using namespace std;

static const size_t _h2alen = 256;
static char _hex2asciiU_value[_h2alen][3];

string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
    string str;
    str.resize(_len*2);
    char* pszHex = &str[0];
    const unsigned char* pEnd = _pArray + _len;
    const char* pHex = _hex2asciiU_value[0];
    for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
       pszHex[0] = _hex2asciiU_value[*pChar][0];
       pszHex[1] = _hex2asciiU_value[*pChar][1];
    }
    return str;
}


int main() {
  for(int i=0; i<_h2alen; i++) {
    snprintf(_hex2asciiU_value[i], 3,"%02X", i);
  }
  size_t len = 200000000;
  char* a = new char[len];
  string t1;
  string t2;
  clock_t start;
  srand(time(NULL));
  for(int i=0; i<len; i++) a[i] = rand()&0xFF;
  start = clock();
  t1=char_to_hex((const unsigned char*)a, len);
  cout << "char_to_hex conversion took ---> " << (clock() - start)/(double)CLOCKS_PER_SEC << " seconds\n";
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top