Understanding the TCP checksum function

Question 1

The checksum function appears to be for big-endian processors only.

The first while loop is optimized for speed.

The &answer trick loads the last byte (if there were an odd number of bytes) into the high byte of answer, leaving the low byte zero, similar to what your code does with data[i] & 0xff00. The way it works is this

1) take the address of answer      (&answer)
2) convert that to a byte pointer  (uint8_t *)  
2a) on a big endian processor the first byte of a 16-bit quantity is the high byte
3) overwrite the high byte with the last byte of the data

The checksum is supposed to be computed with the carries added back in. It's assumed here that this code is running on a machine where an int is 32-bits. Therefore, (sum & 0xffff) is the 16-bit checksum, and (sum >> 16) are the carry bits (if any) that need to be added back in. Hence, the line

sum = (sum >> 16) + (sum & 0xffff);

adjusts the sum to include the carries. However, that line of code could itself generate another carry bit. So the next line sum += (sum >> 16) adds that carry (if any) back into the checksum.

Finally, take the ones-complement of the answer. Note that htons is not used since the whole function implicitly assumes that it is running on a big endian processor.

Question 2

What does the statement *(uint8_t *) (&answer) = *(uint8_t *) w; actually do?

This casts uint16_t to uint8_t, so only 8 most-right bits are copied from w into answer. Consider:

uint16_t x = 0x1234;
uint16_t* w = &x; // *w = // 0001001000110100

*(uint16_t *) (&answer) = *(uint16_t *) w; // answer = 0001001000110100

*(uint8_t *) (&answer) = *(uint8_t *) w;   // answer = 0000000000110100

Why do we take the sum as:

sum = (sum >> 16) + (sum & 0xFFFF);
sum += (sum >> 16);
answer = ~sum;

The sum is 32 bits. 65536 ≡ 1 mod 65535, so the end-around carry expression (sum & 0xffff) + (sum >> 16) reduces sum modulo 65535. This is necessary to add any (eventual) resulting carry back into the resulting sum.

Question 3

*(uint8_t *) (&answer) = *(uint8_t *) w; On the right side, it converts w to a uint8_t* and dereferences it. It truncates the garbage data that would be read when dereferencing uint16_t* pointing to the last byte. On the left side, it takes the address (pointer) of answer and converts it to uint8_t* and dereferences it. So it takes the first byte pointed by w and assigns the value to the first byte of answer. In effect, this line does the 2. Add a one byte padding of 0s to the end of the last block if it's not 2 bytes long, to make it 2 bytes. The conversions on the left side are needed to support big endian systems... I think.

Question 4

This statement accommodates the case (see RFC793 or RFC1701) where the packet has an odd number of bytes: [A,B] + [C,D] + ... + [Z,0] by incorporating into the sum a quantity (answer) with the 2 most significant bytes as Z and the 2 least significant bytes as 0. Remember + here is always 1's complement addition.
sum is a 32-bit accumulator. To add in 1's complement, we add the carry back in after accumulating bits. The 2 most significant bytes of sum contain the carry bit(s), if any.
If you take a look at RFC1701 you can see at the top which RFCs update it. There are none that supersede it.