Would it break the language or existing code if we'd add safe signed/unsigned compares to C/C++?

https://stackoverflow.com/questions/3476590

28-09-2019
|

Question

After reading this question on signed/unsigned compares (they come up every couple of days I'd say):

Signed / unsigned comparison and -Wall

I wondered why we don't have proper signed unsigned compares and instead this horrible mess? Take the output from this small program:

#include <stdio.h>
#define C(T1,T2)\
 {signed   T1 a=-1;\
 unsigned T2 b=1;\
  printf("(signed %5s)%d < (unsigned %5s)%d = %d\n",#T1,(int)a,#T2,(int)b,(a<b));}\

 #define C1(T) printf("%s:%d\n",#T,(int)sizeof(T)); C(T,char);C(T,short);C(T,int);C(T,long);
int main()
{
 C1(char); C1(short); C1(int); C1(long); 
}

Compiled with my standard compiler (gcc, 64bit), I get this:

char:1
(signed  char)-1 < (unsigned  char)1 = 1
(signed  char)-1 < (unsigned short)1 = 1
(signed  char)-1 < (unsigned   int)1 = 0
(signed  char)-1 < (unsigned  long)1 = 0
short:2
(signed short)-1 < (unsigned  char)1 = 1
(signed short)-1 < (unsigned short)1 = 1
(signed short)-1 < (unsigned   int)1 = 0
(signed short)-1 < (unsigned  long)1 = 0
int:4
(signed   int)-1 < (unsigned  char)1 = 1
(signed   int)-1 < (unsigned short)1 = 1
(signed   int)-1 < (unsigned   int)1 = 0
(signed   int)-1 < (unsigned  long)1 = 0
long:8
(signed  long)-1 < (unsigned  char)1 = 1
(signed  long)-1 < (unsigned short)1 = 1
(signed  long)-1 < (unsigned   int)1 = 1
(signed  long)-1 < (unsigned  long)1 = 0

If I compile for 32 bit, the result is the same except that:

long:4
(signed  long)-1 < (unsigned   int)1 = 0

The "How?" of all this is easy to find: Just goto section 6.3 of the C99 standard or chapter 4 of C++ and dig up the clauses which describe how the operands are converted to a common type and this can break if the common type reinterprets negative values.

But what about the "Why?". As we can see, the '<' fails in 50% of all cases, also it depends on the concrete sizes of the types, so it is platform dependent. Here are some points to consider:

The convert & compare process is not really a prime example for the Rule of Least Surprise
I don't believe that there is code out there, which relies on the proposition that (short)-1 > (unsigned)1 and is not written by terrorists.
This is all terrible when you're in C++ with template code, because you need type trait magic to knit a correct "<".

After all, comparing signed and unsigned value of different types is easy to implement:

signed X < unsigned Y -> (a<(X)0) || ((Z)a<(Z)b) where Z=X|Y

The pre-check is cheap and can also be optimized away by the compiler if a>=0 can statically be proven.

So here's my question:

Would it break the language or existing code if we'd add safe signed/unsigned compares to C/C++?

("Would it break the language" means would we need to make massive changes to different parts of the language to accommodate this change)

UPDATE: I've ran this on my good old Turbo-C++ 3.0 and got this output:

char:1
(signed  char)-1 < (unsigned  char)1 = 0

Why is (signed char)-1 < (unsigned char) == 0 here?

Solution

Yes it would break the language/existing code. The language, as you have noted, carefully specifies the behavior when signed and unsigned operands are used together. This behavior with comparison operators is essential for some important idioms, like:

if (x-'0' < 10U)

Not to mention things like (equality comparison):

size_t l = mbrtowc(&wc, s, n, &state);
if (l==-1) ... /* Note that mbrtowc returns (size_t)-1 on failure */

As an aside, specifying "natural" behavior for mixed signed/unsigned comparisons would also incur a significant performance penalty, even in programs which are presently using such comparisons in safe ways where they already have their "natural" behavior due to constraints on the input which the compiler would have a hard time determining (or might not be able to determine at all). In writing your own code to handle these tests, I'm sure you've already seen what the performance penalty would look like, and it's not pretty.

OTHER TIPS

My answer is for C only.

There is no type in C that can accomodate all possible values of all possible integer types. The closest C99 comes to this is intmax_t and uintmax_t, and their intersection only covers half their respective range.

Therefore, you cannot implement a mathematical value comparison of such as x <= y by first converting x and y to a common type and then doing a simple operation. This is a major departure from a general principle of how operators work. It also breaks the intuition that operators correspond to things that tend to be single instructions in common hardware.

Even if you added this additional complexity to the language (and extra burden to implementation writers), it wouldn't have very nice properties. For example, x <= y would still not be equivalent to x - y <= 0. If you wanted all these nice properties, you'd have to make arbitrary-sized integers part of the language.

I'm sure there's plenty of old unix code out there, possibly some running on your machine, that assumes that (int)-1 > (unsigned)1. (Ok, maybe it was written by freedom fighters ;-)

If you want lisp/haskell/python/$favorite_language_with_bignums_built_in, you know where to find it...

I do not think it would break the language, but yes, it could break some existing code (and the breakage would be probably hard to detect at the compiler level).

There exists a lot more code written in C and C++ than you and I together can imagine (some of it may be even written by terrorists).

Relying on "proposition that (short)-1 > (unsigned)1" may be done unintentionally by someone. There exists a lot of C code dealing with complex bit manipulation and similar things. It is quite possible some programmer may be using the current comparison behaviour in such code. (Other people have already provided nice examples of such code, and a the code is even simpler than I would expect).

Current solution is to warn on such comparisons instead, and leave the solution to the programmer, which I think is in a spirit how C and C++ works. Also, solving it on a compiler level would incur a performance penalty, and this is something C and C++ programmers are extremely sensitive at. Two tests instead of one might seem like a minor issue to you, but there is probably plenty of C code where this would be an issue. It could be solved e.g. by forcing the previous behaviour by using explicit casts to a common data type - but this again would require programmer attention, therefore it is no better than a simple warning.

I think C++ is like the Roman empire. Its big, and too established to fix the things that are going to destroy it.

c++0x - and boost - are examples of a horrible horrible syntax - the kind of baby only its parents can love - and are a long long way from the simple elegant (but severely limited) c++ of 10 years ago.

The point is, by the time one has "fixed" something as terribly simple as comparisons of integral types, enough legacy & existing c++ code has been broken that one might as well just call it a new language.

And once broken, there is so much else that is also eligible for retroactive fixing.

The only ways for a language to define rules that can come close to upholding the Principle of Least Surprise at run-time when using combining operands of different C-language types would be to either have the compiler forbid implicit type conversions in at least some contexts (shifting the 'surprise' to "why won't this compile?" and making it less likely to cause unexpected bugs down the road), define multiple types for each storage format (e.g. both wrapping and non-wrapping variants of each integer type), or both.

Having multiple types for each storage format, e.g. both wrapping and non-wrapping versions of signed and unsigned 16-bit integers, could allow the compiler to distinguish between "I'm using a 16-bit value here in case it makes things more efficient, but it's never going to exceed the range 0-65535 and I wouldn't care what happened if it did)" and "I'm using a 16-bit value which needs to wrap to 65535 it goes negative". In the latter case, a compiler that used a 32-bit register for such a value would have to mask it after each arithmetic operation, but in the former case the compiler could omit that. With regard to your particular wish, the meaning of a comparison between a non-wrapping signed long and a non-wrapping unsigned long would be clear, and it would be appropriate for a compiler to generate the multi-instruction sequence necessary to make it happen (since conversion of a negative number to a non-wrapping unsigned long would be Undefined Behavior, having the compiler define a behavior for comparison operators on those types would not conflict with anything else that might be specified).

Unfortunately, beyond having the compiler generate warnings for mixed-operand comparisons, I don't really see much that can be done with the C language as it exists without adding new types to it as described above; although I would regard the addition of such new types as an improvement, I wouldn't hold my breath.

If a comparison between integer types compared the actual mathematical values, I'd want the same to happen for comparisons between integer and floating-point. And comparing the exact values of an arbitrary 64 bit integer and an arbitrary double-precision floating point number is quite difficult. But then the compiler would probably be better at it than me.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow