why aren't the platform specific integer types in C and C++ (short, int, long) deprecated?

https://softwareengineering.stackexchange.com/questions/400379

03-03-2021
|

Question

TL;DR: Why isn't everybody screaming, "Don't use short, int, and long unless you really need to, and you very likely don't need to!"

I understand that, in theory, by using the types short, int, and long, you let the compiler choose the length that is most efficient for the given processor.

But is this a case of premature optimization being the root of all evil?

Suppose I have an integer variable that I know will always hold numbers from 1 to 1000. My understanding is that, assuming I am not worried about the memory difference between two and four bytes, the proponents of short/int/long would have me make that variable an int because that way the compiler can choose 16 bits or 32 bits depending on what is more efficient for the processor. If I had made it a uint16_t, the compiler may not be able to make code that is quite as fast.

But on modern hardware is that even true? Or rather, is the speed that's going to gain me (if any), really worth the much more likely possibility that using an imprecise type leads to a major bug in my program? For instance, I might use int throughout my program and think of it as representing a 32 bit value because that's what it's represented on every platform I've used for the past 20 years, but then my code is compiled on an unusual platform where int is two bytes and all sorts of bugs happen.

And aside from bugs, it just seems like an annoyingly imprecise way for programmers to talk about data. As an example, here is the definition that Microsoft gives in 2019 for a GUID structure:

typedef struct _GUID {
  unsigned long  Data1;
  unsigned short Data2;
  unsigned short Data3;
  unsigned char  Data4[8];
} GUID;

Because of what a Uuid is, that long has to mean 32 bits, those shorts have to mean 16 bits, and that char has to mean 8 bits. So why continue to talk in this imprecise language of "short", "long" and (heaven help us) "long long"?

Solution

I understand that, in theory, by using the types short, int, and long, you let the compiler choose the length that is most efficient for the given processor.

That is only partially true. All those types have a guaranteed minimum size in ANSI C (AFAIK even in ANSI C89). Code relying only on those minimum sizes is still portable. Cases where the maximum size of a type matters to portability are way less frequent. Said that, I have seen (and written) lots of code over the years where int was assumed to be 32 bit as minimum, written clearly for environments with >=32 bit CPUs at minimum.

But is this a case of premature optimization [...]?

Premature optimization is not only about optimizing for speed. It is about investing extra effort into code, and making code more complicated, for a (often pathological) "just in case" reason. "Just in case it could be slow" is only one of those potential reasons. So avoiding the usage of int "just in case" it could be ported to a 16 bit platform in the future could also be seen as a form of premature optimization, when this kind of porting will likely never happen.

Said that, I think the part you wrote about int is to some degree correct: in case there is any evidence a program might get ported from a 32 to a 16 bit platform, it would be best not to rely on int having 32 bits, and to use either long, a specific C99 data type like int32_t or int_least32_t whereever one is unsure whether 16 bits are enough, or not. One could also use a global typedef to define int32_t on platforms which are not C99 compliant. All of this is a little bit of extra effort (at least in teaching the team which special data types were used in the project, and why).

See also this older SO article, for which the top most answer says, most people don't need that degree of portability.

And to your example about the GUID structure: the shown data structure seems to be mostly ok, it uses data types which are guaranteed to be large enough for each of the parts on each ANSI compliant platform. So even if someone tries to use this structure for writing portable code, that would perfectly be possible.

As you noted by yourself, if someone would try to use this structure as a spec for a GUID, they could complain about the fact it is to some degree imprecise and that it requires the read the documentation in full for getting an unambigous spec. This is one of the less frequent cases where maximum size of the types may matter.

Other problems could arise when the content of such a struct is string-formatted, binary serialized, stored or transmitted somewhere, whilst making assumptions about the individual maximum size of each field, or the total size being exactly 128 bit, the endianness, or the precise binary encoding of those data types. But since the documentation of the GUID struct does not make any promises about the underlying binary representation, one should not make any assumptions about it when trying to write portable code.

OTHER TIPS

They're not deprecated because there's no reason to deprecate them.

I'm almost tempted to leave it at that, because there's honestly not a lot more that really needs to be said--deprecating them would accomplish precisely nothing, so nobody's written a paper trying to deprecate them, and I can't quite imagine anybody bothering to write such a paper either (except, I suppose, perhaps as an April Fool's joke, or something on that order).

But, let's consider a typical use of int:

for (int i=0; i<10; i++)
    std::cout << "something or other\n";

Now, would anybody gain anything by changing i to an int_fast8_t, int_fast16_t, or something similar? I'd posit that the answer is a resounding "no". We'd gain essentially nothing at all.

Now, it's certainly true that there are situations where it makes sense to use explicitly sized types such as int8_t, int16_t and int32_t (or their unsigned variants).

But, part of the intent of C and C++ is to support system programming, and for that, there are definitely times I want a type that reflects the exact size of a register on the target machine. Given that this is an explicit intent of both C and C++, deprecating types that support that makes no sense at all.

What it really comes down to is pretty simple: yes, there are cases where you want a type that's a specific number of bits--and if you need that, C and C++ provide types that are guaranteed to be exactly the size you specify. But there are also cases where you don't care much about the size, as long as it's large enough for the range you're using--and C and C++ provide types to satisfy that need as well.

From there, it's up to you, the programmer, to know what you really want, and act appropriately. Yes, you've pointed to a case where somebody (at least arguably) made a poor choice. But that doesn't mean it's always a poor choice, or even necessarily a poor choice most of the time.

Another thing to keep in mind is that although there are cases where portability is important, there are also a lot where it matters little, and still others where it doesn't matter at all. At least in my experience, however, sizes of integer types are rarely a significant factor in portability. On one hand, it's probably true that if you looked at a lot of current code, there's undoubtedly quite a bit that actually depends on int being at least 32 bits, rather than the 16 bits specified by the standards. But, if you tried to port most of that code to (say) a compiler for MS-DOS that used 16-bit ints, you'd quickly run into much larger problems, such as the fact that they were using that int to index into an array of around 10 million doubles--and your real problem in porting the code is a lot less with that int than with storing 80 million bytes on a system that only supports 640K.

Deprecated today means gone tomorrow.

The cost of removing these types from C and C++ would be incredibly high. Not just causing unneeded work, but also likely to cause bugs all over the place.

Microsoft's documentation for GUID should be read in conjunction with the Microsoft's C++ compiler platform-specific definitions of those values, which has a well defined sizes for those types, not the ANSI C/C++ standards' definition. So in a sense, the sizes of those GUID fields are well defined in Microsoft's compilers.

The GUID header is of course buggy in non-Microsoft platforms, but the error here is in thinking that Microsoft gives a damn about standard and other implementations.

Compiled C code (typically) runs natively, and native word sizes vary (they were especially variable in the early ‘70s when C was first developed). You still have code running on 16-bit machines, machines where word sizes aren’t powers of 2 (9-bit bytes, 36-bit words), machines that use padding bits, etc.

Each type guarantees that it can represent a minimum range of values. int is guaranteed to represent values in at least the range [-32767..32767], meaning it’s at least 16 bits wide. On modern desktop and server systems it tends to be 32 bits wide, but that’s not guaranteed.

So no, the bit widths of char, short, int, long, etc, are not fixed, and this is a good thing from C’s perspective. It’s what has allowed C to be ported to such a wide variety of hardware.

It is kind of like talking.

If you talk to yourself, it really does not matter what language, sounds, etc., you use, you will probably understand yourself.

If you talk to someone else, there are specific rules that must be followed in order for both parties to clearly understand. The language matters. The grammar rules for the language matter. Meanings of specific phrases or words matter. When language is written, the spelling matters, and the layout on the page matters.

You are free to not conform with the rules and standards, but other parties are not likely to understand, and you may even cause damage by insulting or using phrases that are ambiguous. Wars have been fought due to failures in communication.

In software there are analogous rules and standards.

If the software does not need to exchange information with any other systems, then yes, use of short/long is unnecessary in most cases as long as the data you are processing fits into the containers you define or use -- overflow is still possible.

If -- on the other hand -- the software exchanges information with another system, then that software must be aware of how that information is structured.

For example:

Networking -- packets absolutely must have correct byte-order -- little-endian vs big-endian -- and fields within the packet must be the correct number of bits. Even when you think you are sending 'obvious' data like JSON, that data must be converted into network packets that may be much shorter than the total data in your JSON stream, and the packets also have fields for packet type, for sequencing -- so you can reassemble the data on the receiving end -- for error detection and correction, and much much much more. All of the possible network packets must be defined in such a way that there can be no ambiguity on either sender's or receiver's part. For this to be possible you must be able to specify exact sizes for packet fields that work with existing systems and systems which will use those packets in the future.

Device Control -- Very similar to networking when you think about it -- where the packet 'fields' roughly correspond to device registers, bits, memory, etc., and controlling a specific device roughly corresponds to using a specific NIC or to the network IP address. You 'send' a 'packet' by writing bits to specific locations, and you 'receive' a 'packet' by reading bits from specific locations. If you are not the device creator -- as it typical -- you must follow the 'protocol' spelled out by the creator in the device datasheet. The fields (registers) have the be the correct size. The bits have to be in the correct location. The registers must be correctly located in the system's address or I/O space. The device creator tells you the 'protocol' for exchanging data with the device. The system designer tells you the 'protocol' -- address space and mapping -- for accessing the device.

You are free to do whatever you want in the software you write, but it is likely that the other party -- network recipient, specific device, etc. -- will not understand what you think you are doing, and in some cases you can even damage the system.

The Ping-of-Death is a network example where intentional violation of the packet format resulted in crashing network receivers that presumed network packets would be correctly formed.

The Fork-Bomb is a system example where intentional abuse of system fork 'protocol' can hang a system until rebooted.

The Buffer-Overrun is a program example where assuming "everything just works" fails when someone (even yourself as the programmer) puts too much data into a container which cannot hold it.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange