How does modulus work?

https://softwareengineering.stackexchange.com/questions/283035

08-10-2020
|

Question

I am not asking asking how the mathematical concept of modulus works, or how 20 % 3 == 2 I understand this fine. I'm more curious how the compiler/interpreter determines this. I can think of a very naive way to do it, like so:

int modulus(int a, int b)
{
    while (a > b)
    {
        a -= b;
    }
return a;
}

(obviously this needs a few more cases because a can be smaller then b, but you get the idea.) But this would would have some serious performance issues. I threw together some C++ to test the efficiency of both the modulus operator and my naive function.

#include <iostream>
#include <chrono>
#include <limits>
#include "chrono_io"
using namespace std;

typedef chrono::high_resolution_clock Clock;

int mod(int a, int b)
{
    while (a > b)
    {
        a -= b;
    }
    return a;
}

int main()
{
    auto start = Clock::now();
    cout << 20 % 3 << endl;
    auto end = Clock::now();
    cout << "Modulus on a small number took "  << end - start << " seconds to calculate.\n";

    start = Clock::now();
    cout << numeric_limits<int>::max() % 3 << endl;
    end = Clock::now();
    cout << "Modulus on a large number took "  << end - start << " seconds to calculate.\n";

    start = Clock::now();
    cout << mod(20, 3) << endl;
    end = Clock::now();
    cout << "My modulus function on a small number took "  << end - start << " seconds to calculate.\n";

    start = Clock::now();
    cout << mod(numeric_limits<int>::max(), 3) << endl;
    end = Clock::now();
    cout << "My modulus function on a large number took "  << end - start << " seconds to calculate.\n";
}

and the output is:

2
Modulus on a small number took 43629 nanoseconds seconds to calculate.
1
Modulus on a large number took 5644 nanoseconds seconds to calculate.
2
My modulus function on a small number took 3682 nanoseconds seconds to calculate.
1
My modulus function on a large number took 2153824554 nanoseconds seconds to calculate.

The modulus operator varied from 4,000-70,000 nanoseconds. My function always won on the small numbers, but took around 2 seconds on the large number. Since the modulus operator is so much more consistent, I assume that it compares the number on a bitwise level.

So how does the modulus operator work? I can think of some cases that would be very easy to determine. For example, with mod 2, you can just look at the very last byte. Or with mod 4 you can look at the second to last byte. But not all numbers have such a simple pattern.

Solution

Nearly all CPUs have a single instruction that will return the modulus of a value. For instance, consider this program:

int main()
{
    int i = 10;
    return i % 3;
}

If I compile this on my Intel OS X machine using g++ -S, the result will be some boilerplate and this:

movl    $3, %eax
movl    $0, -4(%rbp)
movl    $10, -8(%rbp)
movl    -8(%rbp), %ecx
movl    %eax, -12(%rbp)         ## 4-byte Spill
movl    %ecx, %eax
cltd
movl    -12(%rbp), %ecx         ## 4-byte Reload
idivl   %ecx
movl    %edx, %eax

The actual modulus happens with this instruction: idivl %ecx. When this instruction executes, the quotient will be placed in in %eax and the remainder will be placed in in %edx.

Since really, this means the % operation is only going to take a few clock cycles, the bulk of which is actually moving the data into the right register. Also note that with Intel at least, the same operation finds both the quotient and the remainder, so in reference to your comment, / and % take exactly the same time. It is the same operation. The only thing that changes is what register the compiler gets the results from.

With any CPU made in the last couple decades, you can assume that any basic mathematical operation (including things that look like library functions like sqrt or cos) is actually a single machine instruction and generally only take a few clock cycles at most.

[UPDATE]

As people have noted in the comments, to see something approaching a correct timing, you need to remove the output from the timed section like this:

int i;
auto start = Clock::now();
i = 20 % 3;
auto end = Clock::now();
cout << i << endl;

But even that is likely not accurate as the actual granularity of the timings may exceed what you are trying to time. Instead, you might want to do this:

int i=0;
int x;
auto start = Clock::now();
for(i=0;i<1000000;i++)
    x = i % 3;
auto end = Clock::now();
cout << i << endl;

Then divide your timing by 1,000,000. (This will be slightly high as it includes the time taken for an assignment, a comparison and an increment.) On my machine, this gives a time of 5 nanoseconds.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange