Is the sum or subtraction of two positive machine numbers exact?

https://softwareengineering.stackexchange.com/questions/393960

27-02-2021
|

Question

If you have two positive numbers that can be represented exactly by the finite arithmetic of your computer, that is, two machine numbers, if you perform their sum or subtraction with perfect arithmetic, is the result a machine number aswell?

I suppose that in the case of multiplication the answer is not necessarily, but I'm not sure with sums or subtractions.

La solution

It really depends on what you mean by perfect arithmetic.

If you mean school arithmetic (for want of a better word) then the answer is not necessarily for addition.

For addition, imagine that the 2 numbers are both the largest that can be represented by the machine. When you add them, the result will be larger than the largest number that can be represented. However, if they are both 1, then you'll probably be fine (unless you're using a one-bit machine).

For subtraction, it depends. If you restrict the result to positive numbers then the answer is also, not necessarily. It will depend on the relative sizes of the two numbers. If the result can be negative then it's likely that you can represent all subtractions of positive numbers though that depends on the encoding of negative numbers being used. The most common is 2s complement which would be fine if both inputs are positive.

But there are other definitions of arithmetic that are in regular use in computing. Probably the most common are wrapping and saturation arithmetic which are well defined for fixed machine word sizes. Though they might not produce the same answers as you would expect from "school" arithmetic.

Autres conseils

No, every data type has a maximal value, and adding two arbitrary values can always overflow. No concrete data type can be closed under an operation that could increase the value unless it uses an arbitrary amount of space per variable.

There are two types of numbers that are commonly natively supported by computers: modular integers and floating point.

Modular integers has a lot of nice properties, given that a1 ≡ b1 (mod n), a2 ≡ b2 (mod n), and if a ≡ b (mod n), then it always hold that:

a + k ≡ b + k (mod n) for any integer k

k a ≡ k b (mod n) for any integer k

a1 + a2 ≡ b1 + b2 (mod n)

a1 – a2 ≡ b1 – b2 (mod n)

a1 a2 ≡ b1 b2 (mod n)

a^k ≡ b^k (mod n) for any non-negative integer k

To put it simply, the result of addition, subtraction, multiplication, and exponentiation to a constant of two integers is congruent to the result of addition, subtraction, multiplication, and exponentiation to a constant (b^k) of mod n integers. Notably missing here, is division and exponentiation of a constant (k^b), which aren't always congruent.

For floating points, well... floating points are weird. The most common floating point implementation in computers is IEEE754, which represents floating point values in three parts: a sign bit, an exponent, and a mantissa. To answer your question in the context of IEEE754, the answer is no. Adding two IEEE754 floating point numbers does not necessarily always produce a number that can be exactly represented by an IEEE754 number. As an example, there exist values of a and c both IEEE754 floating point, such that c != 0 and a + c = a. This happens for example when a is a number with large exponent and c is a number with small exponent.

The answer is not as obvious as could be, due to some circumstances.

Well, nearly all modern computers (essentially, since mid-1960s) use Twoʼs complement (jargon that gets official; more proper variant would be "zeroʼs complement"). And, when you perform a usual addition or subtraction on assembler level, it's done as truncating to N least bits, where N is used word length. Carry and overflow signs of this operation are either put in Flags/PS/etc. register (typical for CISC, but also for SPARC, ARM, POWER), or ignored (MIPS, Alpha, RISC-V, etc.) Adding 30000 to 30000 with 16-bit word, youʼll get -5536 as resuly of such truncating. You may treat carry+result as a N+1-bit result for unsigned operation; for signed one, you shall get "true sign" for this that is (SF xor OF) (x86 names), (N xor V) (NZVC names).

But even on assembler level result is not so unambiguous; one could utilize exception-on-overflow mode or machine instruction (e.g. MIPS add raises exception on signed overflow but addu doesn't).

Similarly for multiplication, if you want N bits of product of N bits by N bits, you needn't distinguish signed and unsigned multiplication and you will get truncated value.

The same approach is defined for wide bunch of programming languages, as default mode or with concrete operators. Some of the most important ones:

Java - default operations on +, -, *
C# - operations on +, -, * in "unchecked" mode
Go - default operations on +, -, *
Swift - operations on &+, &-, &*
C, C++ - operations on +, -, * on unsigned numbers

all they truncate result and ignore overflow.

But, as soon as this is error prone, there is a stable trend to invent alternatives that provide some checking of result. They may be defined in the way that, if overflow happened, result would not be defined at all.

For example, Rust has following functions for its numeric types:

checked_add() returns option<T> for a type T; so, if overflow happened, option<> value is empty.
overflowing_add() returns pair of N-bits value (truncated if overflowed) and overflow flag;
wrapping_add() returns just N-bits value.

The first of them (checked_add), as told, doesnʼt have real value if overflow happened. If you call unwrap on it, youʼll get runtime panic.

Analogs for other languages are:

Java: Math.addExact() and analogs: raising exception on overflow.
C#: operators, +, -, * in "checked" mode: the same.
Swift: operators +, -, * (but functions like Rust's overflowing_add also exist).
GCC extensions: functions like __builtin_add_overflow that return boolean sign of overflow.

But the most weird case is C and C++ for signed integer arithmetic. It is declared as programmer's responsibility to avoid overflows. If compiler detects overflow condition, it can either reduce possible value sets for operator arguments, or insert so-called poison value. This is not really a value but a sign that improper operation is requested. Depending on compiler mode and various (generally unpredictable) circumstances, this can cause explicit generating of CPU exception, changing of cycle iterations near the overflowing operator, and so on.

Just for me, Iʼd strictly vote for making "checked" mode (exception on overflow) default one and allow to variate mode with syntax-level modifiers (pragmas, etc.)

Finally, a separate view is required for floating point. I would assume IEEE754 as default. With it, two special values INF and NaN (not-a-number) are invented. While NaN is formally an allowed value (it has defined representation, processing in standard operations, etc., it has characteristics specially crafted as different from normal numbers: for example, a == a is false, if a is NaN.

With spreading of languages that provide combined data types, part of cases like Rustʼs option<> will grow, and, respectively, the simple truncating mode will reduce in popularity. I treat this as a good sign of movement towards the program safety.

Addition (and subtraction) of two N bit numbers results in an N+1 bit number, while multiplication of two N bit numbers results in a 2N bit number. So, arithmetic in N bits is susceptible to overflow.

So, you might ask, how does computing work if so many common operations are susceptible to overflow?

The answer is that we often know something about the values we're manipulating.

For one example:

for ( int i = 0; i < 100; i++ ) ...

this does addition (i++), but will not overflow since the range of i is constrained.

For another scenario, multiplication and addition are necessary during array indexing. Your language may hide this indexing from you, though, certainly in assembly language, you'll see indexes being multiplied by size of array elements, and added to pointers. However, if we use the right data types for index computations — namely unsigned int for 32 bit machines and long for 64 bit machines — and we access only array elements that we know exist, then these indexing multiplications and additions will not overflow! This means we program to catch (check for errors) on dynamic memory allocation instead of checking for overflow during indexing computation.

If we take input from a user or file, to compute the average, for example, we cannot know for sure whether we will have an overflow or not without a constraint on the problem. If we don't have such a constraint, then we should check for overflow. (If we have a constraint like no more than 1000 numbers, and, numbers are in the range -1000 to 1000, we might consider validating this constraint at runtime.)

We can also use floating point data types, which are also fixed-sized data types but they trade off accuracy/precision for additional range.

And there are other options as well, such as variable length numeric packages (search for BigInteger), though these trade off more range with full precision against performance compared with the built-in data fixed-size types.

It is programmer responsibility to decide which computations to check for overflow and which to not bother, as well as what data types to choose. Some languages will check all possible calculations and halt the program on overflow, but that is somewhat harsh (such program termination won't tell the user what is really wrong with the input, for example), so programming overflow handling to be more graceful is desirable — and also up to the programmer.

And yet still, some calculations we expect to get overflow and still want the truncated answer, because overflow results in predictable values — for example in computations of checksums, parity, and hashes.

It can never be exact.

For every possible number representation, there is a largest number it can store.

Take two of those “largest numbers” and add them together. The answer is larger. Can it be represented exactly?

YES - then the “largest number” wasn’t the largest. Contradiction!

NO - then you have a result of an addition which cannot e represented exactly - just as you asked.

This answer is going to be a little pedantic.

In your title you ask if the result will be "exact" but in the body of your question you ask if it is a "machine number".

Because you specifically mention "machine number" in the body of your question the answer may be a cautious: yes/maybe.

You have a standard: IEEE 754 for machine numbers.

If I read it correctly, if you add two numbers and the result will be larger then the largest representable number it will round to that. That's an exact answer (because specified by the standard) and a machine number.

If however the result of your math is like a magnitude higher then the highest representable number the result will be +infinity (for a positive result)

This may not be the answer you meant though.

The standard is an interesting read nevertheless. (and specifies alternate ways of handling stuff like this)

Licencié sous: CC-BY-SA avec attribution

Non affilié à softwareengineering.stackexchange