Question

I read the memcpy implementation in the http://androidxref.com/4.0.4/xref/bionic/libc/string/bcopy.c find following code is hard to understand, can anybody explain it?

 36 /*
 37  * sizeof(word) MUST BE A POWER OF TWO
 38  * SO THAT wmask BELOW IS ALL ONES
 39  */
 40 typedef long word;      /* "word" used for optimal copy speed */
 41 
 42 #define wsize   sizeof(word)
 43 #define wmask   (wsize - 1)
 44 

...

        /*
 78          * Copy forward.
 79          */
 80         t = (long)src;  /* only need low bits */
 81         if ((t | (long)dst) & wmask) {
 82             /*
 83              * Try to align operands.  This cannot be done
 84              * unless the low bits match.
 85              */
 86             if ((t ^ (long)dst) & wmask || length < wsize)
 87                 t = length;
 88             else
 89                 t = wsize - (t & wmask);

What is mean of these bitwise operations? What is their intent?

Was it helpful?

Solution

The basic idea is to meet alignment constraints: each "word" to be copied one word at a time must be aligned on a "word" boundary.

Some CPUs have this as a fundamental constraint, that loads and stores must occur on a "natural" boundary. On older ARM processors the low bits of the address are actually ignored entirely, so that a load or store of two bytes from an odd-valued address have the same effect as:

short w = *(short *)(addr & ~1);

for instance. On some other CPUs an unaligned load or store results in a trap (MIPS and SPARC for instance), and still others will just do it, but with a performance penalty (x86).

So, suppose you're copying a large number of bytes (say, 4096 of them) from address 0x12345 to address 0x22345, and that the "word size" is 4 bytes. If we first copy three bytes, the addresses will now be 0x12348 and 0x22348. At this point we can copy just 1023 4-byte words, one word at a time, without tripping over any alignment issues. After that we'll have one remaining byte to copy, because 4096 = 3 + (4 * 1023) + 1.

This all makes the assumption that bytes are all addressed individually, even when loading and storing "words". This assumption is false on some machines: for instance, the old Data General MV10000 CPU would address "words" using "word addresses", which are essentially byte addresses divided by two. (It's thus impossible to address a "word" that spans two bytes: the word at location 0 has byte addresses 0 and 1 but word address 0; the word at location 1 has byte addresses 2 and 3; the word at location 2 has byte addresses 4 and 5; and so on.) On machines like this, you may need to use a different version of bcopy.c.

As @Alex noted, the XOR is just making sure that it is actually possible to align the two addresses. If you're copying from 0x12345 to 0x22345, it is; but if you're copying from 0x12345 to 0x22344, the two addresses will never line up.

OTHER TIPS

Just do it step by step:

t = (long)src;
if ((t | (long)dst) & wmask)

These check if at least one of src and dst isn't a multiple of sizeof(long).

if ((t ^ (long)dst) & wmask || length < wsize)

This checks if src and dst are aligned differently w.r.t. sizeof(long) (IOW, aren't "equal" multiples of sizeof(long)) or if length < sizeof(long)-1.

At the end you receive in t how many bytes have to be copied as bytes between unaligned locations, either all (length), or just enough (less than sizeof(long)) to reach addresses that are multiple of sizeof(long) from where the rest can be copied in units of long. And the latter is the speed optimization.

To see all that you have to know that an integer, when expressed in binary, is a multiple of a certain power of 2 when its least significant bits below that power of 2 are all zeroes.

Examples:

1002 (410) is a multiple of 1002 (410)
11002 (1210) is a multiple of 1002 (410)
100002 (1610) is a multiple of 1002 (410)
02 (010) is a multiple of 1002 (410)
112 (310) is not a multiple of 1002 (410)
11012 (1310) is not a multiple of 1002 (410)

This is what & (sizeof(long)-1) is used for.

You also need to know that a value XORed with itself gives 0 and when you XOR different values the result is non-zero. So you can use XOR for comparison. And that's what (t ^ (long)dst) is for.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top