Why is size_t unsigned?

https://stackoverflow.com/questions/10168079

31-05-2021
|

Pregunta

Bjarne Stroustrup wrote in The C++ Programming Language:

The unsigned integer types are ideal for uses that treat storage as a bit array. Using an unsigned instead of an int to gain one more bit to represent positive integers is almost never a good idea. Attempts to ensure that some values are positive by declaring variables unsigned will typically be defeated by the implicit conversion rules.

size_t seems to be unsigned "to gain one more bit to represent positive integers". So was this a mistake (or trade-off), and if so, should we minimize use of it in our own code?

Another relevant article by Scott Meyers is here. To summarize, he recommends not using unsigned in interfaces, regardless of whether the value is always positive or not. In other words, even if negative values make no sense, you shouldn't necessarily use unsigned.

Solución

size_t is unsigned for historical reasons.

On an architecture with 16 bit pointers, such as the "small" model DOS programming, it would be impractical to limit strings to 32 KB.

For this reason, the C standard requires (via required ranges) ptrdiff_t, the signed counterpart to size_t and the result type of pointer difference, to be effectively 17 bits.

Those reasons can still apply in parts of the embedded programming world.

However, they do not apply to modern 32-bit or 64-bit programming, where a much more important consideration is that the unfortunate implicit conversion rules of C and C++ make unsigned types into bug attractors, when they're used for numbers (and hence, arithmetical operations and magnitude comparisions). With 20-20 hindsight we can now see that the decision to adopt those particular conversion rules, where e.g. string( "Hi" ).length() < -3 is practically guaranteed, was rather silly and impractical. However, that decision means that in modern programming, adopting unsigned types for numbers has severe disadvantages and no advantages – except for satisfying the feelings of those who find unsigned to be a self-descriptive type name, and fail to think of typedef int MyType.

Summing up, it was not a mistake. It was a decision for then very rational, practical programming reasons. It had nothing to do with transferring expectations from bounds-checked languages like Pascal to C++ (which is a fallacy, but a very very common one, even if some of those who do it have never heard of Pascal).

Otros consejos

size_t is unsigned because negative sizes make no sense.

(From the comments:)

It's not so much ensuring, as stating what is. When is the last time you saw a list of size -1? Follow that logic too far and you find that unsigned should not exist at all and bit operations shouldn't be permitted either. – geekosaur

More to the point: addresses, for reasons you should think about, are not signed. Sizes are generated by comparing addresses; treating an address as signed will do very much the wrong thing, and using a signed value for the result will lose data in a way that your reading of the Stroustrup quote evidently thinks is acceptable, but in fact is not. Perhaps you can explain what a negative address should do instead. – geekosaur

A reason for making index types unsigned is for symmetry with C and C++'s preference for half-open intervals. And if your index types are going to be unsigned, then it's convenient to also have your size type unsigned.

In C, you can have a pointer that points into an array. A valid pointer can point to any element of the array or one element past the end of the array. It cannot point to one element before the beginning of the array.

int a[2] = { 0, 1 };
int * p = a;  // OK
++p;  // OK, points to the second element
++p;  // Still OK, but you cannot dereference this one.
++p;  // Nope, now you've gone too far.
p = a;
--p;  // oops!  not allowed

C++ agrees and extends this idea to iterators.

Arguments against unsigned index types often trot out an example of traversing an array from back to front, and the code often looks like this:

// WARNING:  Possibly dangerous code.
int a[size] = ...;
for (index_type i = size - 1; i >= 0; --i) { ... }

This code works only if index_type is signed, which is used as an argument that index types should be signed (and that, by extension, sizes should be signed).

That argument is unpersuasive because that code is non-idiomatic. Watch what happens if we try to rewrite this loop with pointers instead of indices:

// WARNING:  Bad code.
int a[size] = ...;
for (int * p = a + size - 1; p >= a; --p) { ... }

Yikes, now we have undefined behavior! Ignoring the problem when size is 0, we have a problem at the end of the iteration because we generate an invalid pointer that points to the element before the first. That's undefined behavior even if we never try dereference that pointer.

So you could argue to fix this by changing the language standard to make it legit to have a pointer that points to the element before the first, but that's not likely to happen. The half-open interval is a fundamental building block of these languages, so let's write better code instead.

A correct pointer-based solution is:

int a[size] = ...;
for (int * p = a + size; p != a; ) {
  --p;
  ...
}

Many find this disturbing because the decrement is now in the body of the loop instead of in the header, but that's what happens when your for-syntax is designed primarily for forward loops through half-open intervals. (Reverse iterators solve this asymmetry by postponing the decrement.)

Now, by analogy, the index-based solution becomes:

int a[size] = ...;
for (index_type i = size; i != 0; ) {
  --i;
  ...
}

This works whether index_type is signed or unsigned, but the unsigned choice yields code that maps more directly to the idiomatic pointer and iterator versions. Unsigned also means that, as with pointers and iterators, we'll be able to access every element of the sequence--we don't surrender half of our possible range in order to represent nonsensical values. While that's not a practical concern in a 64-bit world, it can be a very real concern in a 16-bit embedded processor or in building an abstract container type for sparse data over a massive range that can still provide the identical API as a native container.

On the other hand ...

Myth 1: std::size_t is unsigned is because of legacy restrictions that no longer apply.

There are two "historical" reasons commonly referred to here:

sizeof returns std::size_t, which has been unsigned since the days of C.
Processors had smaller word sizes, so it was important to squeeze that extra bit of range out.

But neither of these reasons, despite being very old, are actually relegated to history.

sizeof still returns a std::size_t which is still unsigned. If you want to interoperate with sizeof or the standard library containers, you're going to have to use std::size_t.

The alternatives are all worse: You could disable signed/unsigned comparison warnings and size conversion warnings and hope that the values will always be in the overlapping ranges so that you can ignore the latent bugs using different types couple potentially introduce. Or you could do a lot of range-checking and explicit conversions. Or you could introduce your own size type with clever built-in conversions to centralize the range checking, but no other library is going to use your size type.

And while most mainstream computing is done on 32- and 64-bit processors, C++ is still used on 16-bit microprocessors in embedded systems, even today. On those microprocessors, it's often very useful to have a word-sized value that can represent any value in your memory space.

Our new code still has to interoperate with the standard library. If our new code used signed types while the standard library continues to use unsigned ones, we make it harder for every consumer that has to use both.

Myth 2: You don't need that extra bit. (A.K.A., You're never going to have a string larger than 2GB when your address space is only 4GB.)

Sizes and indexes aren't just for memory. Your address space may be limited, but you might process files that are much larger than your address space. And while you might not have a string with more the 2GB, you could comfortably have a bitset with more than 2Gbits. And don't forget virtual containers designed for sparse data.

Myth 3: You can always use a wider signed type.

Not always. It's true that for a local variable or two, you could use a std::int64_t (assuming your system has one) or a signed long long and probably write perfectly reasonable code. (But you're still going to need some explicit casts and twice as much bounds checking or you'll have to disable some compiler warnings that might've alerted you to bugs elsewhere in your code.)

But what if you're building a large table of indices? Do you really want an extra two or four bytes for every index when you need just one bit? Even if you have plenty of memory and a modern processor, making that table twice as large could have deleterious effects on locality of reference, and all your range checks are now two-steps, reducing the effectiveness of branch prediction. And what if you don't have all that memory?

Myth 4: Unsigned arithmetic is surprising and unnatural.

This implies that signed arithmetic is not surprising or somehow more natural. And, perhaps it is when thinking in terms of mathematics where all the basic arithmetic operations are closed over the set of all integers.

But our computers don't work with integers. They work with an infinitesimal fraction of the integers. Our signed arithmetic is not closed over the set of all integers. We have overflow and underflow. To many, that's so surprising and unnatural, they mostly just ignore it.

This is bug:

auto mid = (min + max) / 2;  // BUGGY

If min and max are signed, the sum could overflow, and that yields undefined behavior. Most of us routinely miss this these kinds of bugs because we forget that addition is not closed over the set of signed ints. We get away with it because our compilers typically generate code that does something reasonable (but still surprising).

If min and max are unsigned, the sum could still overflow, but the undefined behavior is gone. You'll still get the wrong answer, so it's still surprising, but not any more surprising than it was with signed ints.

The real unsigned surprise comes with subtraction: If you subtract a larger unsigned int from a smaller one, you're going to end up with a big number. This result isn't any more surprising than if you divided by 0.

Even if you could eliminate unsigned types from all your APIs, you still have to be prepared for these unsigned "surprises" if you deal with the standard containers or file formats or wire protocols. Is it really worth adding friction to your APIs to "solve" only part of the problem?

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow