Question

A production issue has led our team to the following questions:

  1. Under RHEL6 using GCC 4.4.6, how are ntohs and ntohl implemented?
  2. Are the implementations known to be fast or slow?
  3. How can I actually see the generated assembly code for the functions?

I know the implications behind questions may seem far-fetched and ridiculous, but I have been asked to investigate.

The hardware in question is an Intel box, little endian, 64-bit processor and compiled in 64 bit.

Was it helpful?

Solution 2

  1. They're provided by glibc, not GCC, look in /usr/include/bits/byteswap.h for the __bswap_16 and __bswap_32 functions, which are used when optimization is enabled (see <netinet/in.h> for details of how.)
  2. You didn't say what architecture you're using, on a big-endian system they're no-ops, so optimally fast! On little-endian they're architecture-specific hand-optimized assembly code.
  3. Use GCC's -save-temps option to keep the intermediate .s files, or use -S to stop after compilation and before assembling the code, or use http://gcc.godbolt.org/

OTHER TIPS

Do the following:

test.c

#include <arpa/inet.h>
int main()
{
   volatile uint32_t x = 0x12345678;
   x = ntohl(x);
   return 0;
}

Then compile with:

$ gcc -O3 -g -save-temps test.c

And analyze the resulting test.s file, or alternatively run objdump -S test.o.

In my machine (Ubuntu 13.4) the relevant asssembler is:

movl    $305419896, 12(%esp)
movl    12(%esp), %eax
bswap   %eax
movl    %eax, 12(%esp)

Hints:

  • 305419896 is 0x12345678 in decimal.
  • 12(%esp) is the address of the volatile variable.
  • All the movl instructions are there for the volatile-ness of x. The only really interesting instruction is bswap.
  • Obviously, ntohl is compiled as an inline-intrinsic.

Moreover, if I look at the test.i (precompiled output), I find that the ntohl is #defined as simply __bswap_32(), which is an inline function with just a call to __builtin_bswap32().

These are implemented in glibc. Look at /usr/include/netinet/in.h. They will most likely rely on the glibc byteswap macros (/usr/include/bits/byteswap.h on my machine)

These are implemented in assembly in my header so should be pretty fast. For constants, this is done at compile time.

GCC/glibc causes ntohl() and htonl() to be inlined into the calling code. Therefore, the function call overhead is avoided. Furthermore, each ntohl() or htonl() call is translated into a single bswap assembler operation. According to the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" bswap has both latency and throughput of "1" on all current Intel CPUs. So, only a single CPU clock is required to execute ntohl() or htonl().

ntohs() and htons() are implented as a rotation by 8 bit. This effectively swaps the two halfs of the 16-bit operand. Latency and throughput are similiar to that of bswap.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top