Implementing a logical shift right

Question 1

A logical shift of a number either left or right equals copying N-n bits from one word of N bits to another. Thus:

unsigned int a = 0x1321;
unsigned int b = 0;
unsigned int mask1 = 1;
unsigned int mask2 = 1 << n;  // use repeated addition for left shift...
int i;
for (i = 0; i < N-n; i++) {
    if (a & mask2)
        b|= mask1;
    mask1 += mask1;
    mask2 += mask2;
}

Swapping mask1 and mask2 would implement left shift (with bitwise operations only).

Question 2

In keeping with the nature of the Nand2Tetris course, I've tried to walk a line in this answer, giving examples of Hack assembly coding techniques and general algorithms, but leaving the final code as an exercise.

The Hack ALU does not have any data paths that connect bit N with bit N-1. This means that right-shifts and rotates must be implemented using left rotates. (Note: left = most significant bits, right = least significant bits)

A left-shift is easy, since it's just multiplication by 2, which is itself just self-addition. For example:

// left-shift variable someVar 1 bit

@someVar     // A = address of someVar
D = M        // D = Memory[A]
M = M + D    // Memory[A] = Memory[A] * 2

Left-rotate is a bit more difficult. You need to keep a copy of the leftmost bit, and move it into the rightmost bit after doing the multiply. Note however that you have a copy of the original value of "someVar" in the D register, and you can test and jump based on its value -- if the leftmost bit of D is 1, then D will be less than zero. Furthermore, note that after you multiply "someVar" by 2, it's rightmost bit will always be 0, which makes it easy to set without changing any of the other bits.

Once you have left-rotate, right-rotate is straightforward; if you want to left-rotate N bits, you instead right-rotate 16-N bits. Note that this assumes N in range 0-15.

Right-shift is the most complicated operation. In this instance, you need to first do the right-rotate, then generate a mask that has the upper N bits set to zero. You AND the result of the right-rotate with the mask.

The basic way to generate the mask is to start with -1 (all bits set) and add it to itself N times; this makes the rightmost N bits of the mask 0. Then left-rotate this 16-N times to move all the 0 bits to the leftmost N bits.

However, this is a lot of cycles, and when programming in assembly language, saving cycles is what it's all about. There are a couple of techniques you can use.

The first is using address arithmetic to implement the equivalent of a case statement. For each of the 16 possible rotate values, you need to load a 16 bit mask value into the D register, then jump to the end of the case. You have to be careful because you can only load 15 bit constants using the @instruction, but you can do the load and unconditional jump in 6 instructions (4 to load the full 16 bit constant, and 2 to jump).

So if you have 16 of these starting at location (CASE), you just need to multiply N by 6, add it to @CASE, and jump to that location. When thinking about how to multiply by 6, keep in mind one of the really cute features of the HACK instruction set; you can store the results of an ALU operation in multiple registers simultaneously.

The most efficient solution, however, is to precompute a mask table. During your program initialization, you generate the 16 bit masks and store them in some fixed location in memory, then you can just add N to the address of the start of the table and read the mask.

Since the HACK CPU can't access the program ROM other than to fetch instructions, you can't store the table in ROM, you have to use several instructions per table entry to load the value into the D register and then save it into RAM. I ended up written a simple python script that generates the code to initialize tables.

Question 3

It gets easier if you treat the value to shift as unsigned, since a logical right shift won't preserve the sign anyway. Then you just subtract 2 repeatedly until the result is less than 2, at which point the number of subtractions is your quotient (i.e. the right-shifted value).

An example implementation in C:

int lsr(int valueToShift)
{
    int shifted = 0;
    uint16_t u = valueToShift;

    while (u >= 2) {
        u -= 2;
        shifted++;
    }

    return shifted;
}

Question 4

You should use binary or hexadecimal since using decimal makes it hard to imagine the number representation.

If you have arithmetic shift but not logical shift, the most obvious solution would be clearing the top bits if it's negative

int LogicalRightShift(int x, int shift)
{
    return (x >> shift) & ((1U << (CHAR_BIT*sizeof(x) - shift)) - 1);
    // or
    return (x >> shift) & (~((~0) << (CHAR_BIT*sizeof(x) - shift)));
}

If you don't have arithmetic right shift either you can copy it bit-by-bit

int LogicalRightShift(int x, int shift)
{
    // assuming int size is 32
    int bits[] = {  0x1,        0x2,        0x4,        0x8,        0x10,       0x20,       0x40,       0x80,
                    0x100,      0x200,      0x400,      0x800,      0x1000,     0x2000,     0x4000,     0x8000,
                    0x10000,    0x20000,    0x40000,    0x80000,    0x100000,   0x200000,   0x400000,   0x800000,
                    0x1000000,  0x2000000,  0x4000000,  0x8000000,  0x10000000, 0x20000000, 0x40000000, 0x80000000
    }
    int res = 0;
    for (int i = 31; i >= shift; i++)
    {
        if (x & bits[i])
            res |= bits[i - shift];
    }
    return res;
}

Another way is repeatedly dividing by 2. Or you can store the powers of 2 in a lookup table and divide by that power. This way it may be slower than the bit copying method above if you don't have hardware divider but still much faster than having to subtract thousands of times like your method. To shift -27139 (38397) right 1 bit you need to subtract 2 from the number 9599 times, and even more if the number is larger or if you need to shift a different number of bits

Question 5

A faster way might be to use addition. For a crude example:

uin32_t LSR(uint32_t value, int count) {
    uint32_t result = 0;
    uint32_t temp;

    while(count < 32) {
        temp = value + value;
        if(temp < value) {                // Did the addition overflow?
            result = result + result + 1;
        } else {
            result = result + result;
        }
        value = temp;
        count++;
    }
    return result;
}

The basic idea is to shift a 64-bit unsigned integer left "32 - count" times then return the highest 32 bits.

In assembly, most of the code above (the branches, etc) would hopefully become something like add value, value then add_with_carry result, result.