- They're provided by glibc, not GCC, look in
/usr/include/bits/byteswap.h
for the__bswap_16
and__bswap_32
functions, which are used when optimization is enabled (see<netinet/in.h>
for details of how.) - You didn't say what architecture you're using, on a big-endian system they're no-ops, so optimally fast! On little-endian they're architecture-specific hand-optimized assembly code.
- Use GCC's
-save-temps
option to keep the intermediate.s
files, or use-S
to stop after compilation and before assembling the code, or use http://gcc.godbolt.org/
How are the ntoh functions implemented under RHEL/GCC?
Question
A production issue has led our team to the following questions:
- Under RHEL6 using GCC 4.4.6, how are
ntohs
andntohl
implemented? - Are the implementations known to be fast or slow?
- How can I actually see the generated assembly code for the functions?
I know the implications behind questions may seem far-fetched and ridiculous, but I have been asked to investigate.
The hardware in question is an Intel box, little endian, 64-bit processor and compiled in 64 bit.
Solution 2
OTHER TIPS
Do the following:
test.c
#include <arpa/inet.h>
int main()
{
volatile uint32_t x = 0x12345678;
x = ntohl(x);
return 0;
}
Then compile with:
$ gcc -O3 -g -save-temps test.c
And analyze the resulting test.s
file, or alternatively run objdump -S test.o
.
In my machine (Ubuntu 13.4) the relevant asssembler is:
movl $305419896, 12(%esp)
movl 12(%esp), %eax
bswap %eax
movl %eax, 12(%esp)
Hints:
- 305419896 is 0x12345678 in decimal.
12(%esp)
is the address of the volatile variable.- All the
movl
instructions are there for thevolatile
-ness ofx
. The only really interesting instruction isbswap
. - Obviously,
ntohl
is compiled as an inline-intrinsic.
Moreover, if I look at the test.i
(precompiled output), I find that the ntohl
is #defined
as simply __bswap_32()
, which is an inline function with just a call to __builtin_bswap32()
.
These are implemented in glibc. Look at /usr/include/netinet/in.h. They will most likely rely on the glibc byteswap macros (/usr/include/bits/byteswap.h on my machine)
These are implemented in assembly in my header so should be pretty fast. For constants, this is done at compile time.
GCC/glibc causes ntohl() and htonl() to be inlined into the calling code. Therefore, the function call overhead is avoided. Furthermore, each ntohl() or htonl() call is translated into a single bswap assembler operation. According to the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" bswap has both latency and throughput of "1" on all current Intel CPUs. So, only a single CPU clock is required to execute ntohl() or htonl().
ntohs() and htons() are implented as a rotation by 8 bit. This effectively swaps the two halfs of the 16-bit operand. Latency and throughput are similiar to that of bswap.