How are the ntoh functions implemented under RHEL/GCC?

Question 1

They're provided by glibc, not GCC, look in /usr/include/bits/byteswap.h for the __bswap_16 and __bswap_32 functions, which are used when optimization is enabled (see <netinet/in.h> for details of how.)
You didn't say what architecture you're using, on a big-endian system they're no-ops, so optimally fast! On little-endian they're architecture-specific hand-optimized assembly code.
Use GCC's -save-temps option to keep the intermediate .s files, or use -S to stop after compilation and before assembling the code, or use http://gcc.godbolt.org/

Question 2

Do the following:

test.c

#include <arpa/inet.h>
int main()
{
   volatile uint32_t x = 0x12345678;
   x = ntohl(x);
   return 0;
}

Then compile with:

$ gcc -O3 -g -save-temps test.c

And analyze the resulting test.s file, or alternatively run objdump -S test.o.

In my machine (Ubuntu 13.4) the relevant asssembler is:

movl    $305419896, 12(%esp)
movl    12(%esp), %eax
bswap   %eax
movl    %eax, 12(%esp)

Hints:

305419896 is 0x12345678 in decimal.
12(%esp) is the address of the volatile variable.
All the movl instructions are there for the volatile-ness of x. The only really interesting instruction is bswap.
Obviously, ntohl is compiled as an inline-intrinsic.

Moreover, if I look at the test.i (precompiled output), I find that the ntohl is #defined as simply __bswap_32(), which is an inline function with just a call to __builtin_bswap32().

Question 3

These are implemented in glibc. Look at /usr/include/netinet/in.h. They will most likely rely on the glibc byteswap macros (/usr/include/bits/byteswap.h on my machine)

These are implemented in assembly in my header so should be pretty fast. For constants, this is done at compile time.

Question 4

GCC/glibc causes ntohl() and htonl() to be inlined into the calling code. Therefore, the function call overhead is avoided. Furthermore, each ntohl() or htonl() call is translated into a single bswap assembler operation. According to the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" bswap has both latency and throughput of "1" on all current Intel CPUs. So, only a single CPU clock is required to execute ntohl() or htonl().

ntohs() and htons() are implented as a rotation by 8 bit. This effectively swaps the two halfs of the 16-bit operand. Latency and throughput are similiar to that of bswap.