Why does Java have primitives for different size numbers?

https://softwareengineering.stackexchange.com/questions/331955

29-12-2020
|

Question

In Java there are primitive types for byte, short, int and long and the same thing for float and double. Why is it necessary to have a person set how many bytes should be used for a primitive value? Couldn't the size just be determined dynamically depending on how big the number passed in was?

There are 2 reasons I can think of:

Dynamically setting the size of the data would mean it would need to be able to change dynamically as well. This could potentially cause performance issues?
Perhaps the programmer wouldn't want someone to be able to use a bigger number than a certain size and this lets them limit it.

I still think there could've been a lot to gain by simple using a single int and float type, was there a specific reason Java decided not to go this route?

Solution

Like so many aspects of language design, it comes to a trade-off of elegance against performance (not to mention some historical influence from earlier languages).

Alternatives

It is certainly possible (and quite simple) to make a programming language that has just a single type of natural numbers nat. Almost all programming languages used for academic study (e.g. PCF, System F) have this single number type, which is the more elegant solution, as you surmised. But language design in practice is not just about elegance; we must also consider performance (the extent to which performance is considered depends on the intended application of the language). The performance comprises both time and space constraints.

Space constraints

Letting the programmer choose the number of bytes up-front can save space in memory-constrained programs. If all your numbers are going to be less than 256, then you can use 8 times as many bytes as longs, or used the saved storage for more complex objects. The standard Java application developer does not have to worry about these constraints, but they do come up.

Efficiency

Even if we ignore space, we are still constrained by the CPU, which only has instructions that operate on a fixed number of bytes (8 bytes on a 64-bit architecture). That means even providing a single 8-byte long type would make the implementation of the language significantly simpler than having an unbounded natural number type, by being able to map arithmetic operations directly to a single underlying CPU instructions. If you allow the programmer to use arbitrarily large numbers, then a single arithmetic operation must be mapped to a sequence of complex machine instructions, which would slow down the program. This is point (1) that you brought up.

Floating-point types

The discussion so far has only concerned integers. Floating-point types are a complex beast, with extremely subtle semantics and edge-cases. Thus, even though we could easily replace int, long, short, and byte with a single nat type, it is not clear what the type of floating-point numbers even is. They aren't real numbers, obviously, as real numbers cannot exist in a programming language. They aren't quite rational numbers, either (though it's straight-forward to create a rational type if desired). Basically, IEEE decided on a way to kinda sorta approximate real numbers, and all languages (and programmers) have been stuck with them ever since.

Finally:

Perhaps the programmer wouldn't want someone to be able to use a bigger number than a certain size and this lets them limit it.

This isn't a valid reason. Firstly, I can't think of any situations in which types could naturally encode numerical bounds, not to mention the chances are astronomically low that the bounds the programmer wants to enforce would correspond exactly to the sizes of any of the primitive types.

OTHER TIPS

The reason is very simple: efficiency. In multiple ways.

Native data types: The closer the data types of a language match the underlying data types of the hardware, the more efficient the language is considered to be. (Not in the sense that your programs will necessarily be efficient, but in the sense that you may, if you really know what you are doing, write code that will run about as efficient as the hardware can run it.) The data types offered by Java correspond to bytes, words, doublewords and quadwords of the most popular hardware out there. That's the most efficient way to go.
Unwarranted overhead on 32-bit systems: If the decision had been made to map everything to a fixed-size 64-bit long, this would have imposed a huge penalty on 32-bit architectures that need considerably more clock cycles to perform a 64-bit operation than a 32-bit operation.
Memory wastefulness: There is a lot of hardware out there that is not too picky about memory alignment, (the Intel x86 and x64 architectures being examples of that,) so an array of 100 bytes on that hardware can occupy only 100 bytes of memory. However, if you do not have a byte anymore, and you have to use a long instead, the same array will occupy an order of magnitude more memory. And byte arrays are very common.
Calculating number sizes: Your notion of determining the size of an integer dynamically depending on how big the number passed in was is too simplistic; there is no single point of "passing in" a number; the calculation of how large a number needs to be has to be performed at runtime, on every single operation that may require a result of a larger size: every time you increment a number, every time you add two numbers, every time you multiply two numbers, etc.
Operations on numbers of different sizes: Subsequently, having numbers of potentially different sizes floating around in memory would complicate all operations: Even in order to simply compare two numbers, the runtime would first have to check whether both numbers to be compared are of the same size, and if not, resize the smaller one to match the size of the larger one.
Operations that require specific operand sizes: Certain bit-wise operations rely on the integer having a specific size. Having no pre-determined specific size, these operations would have to be emulated.
Overhead of polymorphism: Changing the size of a number at runtime essentially means that it has to be polymorphic. This in turn means that it cannot be a fixed-size primitive allocated on the stack, it has to be an object, allocated on the heap. That is terribly inefficient. (Re-read #1 above.)

To avoid repeating the points that have been discussed in other answers, I will instead try to outline multiple perspectives.

From language design perspective

It is certainly possible to design and implement a programming language and its execution environment that will automatically accommodate the results of integer operations that don't fit in the machine width.
It is the language designer's choice whether to make such dynamic-width integers to be the default integer type for this language.
However, the language designer has to consider the following drawbacks:
- The CPU will have to execute more code, which takes more time. However, it is possible to optimize for the most frequent case in which the integer fits within a single machine word. See tagged pointer representation.
- The size of that integer becomes dynamic.
- Reading a dynamic width integer from memory may require more than one trip.
- Structs (objects) and arrays that contain dynamic width integers inside their fields/elements will have a total (occupied) size that is dynamic as well.

Historical reasons

This is already discussed in the Wikipedia article about the history of Java, and is also briefly discussed in Marco13's answer.

I would point out that:

Language designers must juggle between an aesthetic and a pragmatic mindset. The aesthetic mindset wants to design a language which is not prone to well-known problems, such as integer overflows. The pragmatic mindset reminds the designer that the programming language needs to be good enough to implement useful software applications, and to inter-operate with other software parts which are implemented in different languages.
Programming languages which intend to capture market share from older programming languages might be more inclined to be pragmatic. One possible consequence is that they are more willing to incorporate or borrow existing programming constructs and styles from those older languages.

Efficiency reasons

When does efficiency matter?

When you intend to advertise a programming language as being fit for development of large-scale applications.
When you need to work on millions and billions of small items, in which every bit of efficiency adds up.
When you need to compete with another programming language, your language need to perform decently - it needs not be the best, but it certainly helps to stay close to the best performance.

Efficiency of storage (in memory, or on disk)

Computer memory was once a scarce resource. In those old days, the size of application data that could be processed by a computer was limited by the amount of computer memory, although that could arguably be worked around using clever programming (which would cost more to implement).

Efficiency of execution (within CPU, or between CPU and memory)

Already discussed in gardenhead's answer.
If a program needs to process very large arrays of small numbers stored consecutively, the efficiency of in-memory representation has a direct effect on its execution performance, because the large amount of data causes the throughput between CPU and memory to become a bottleneck. In this case, packing data more densely means that a single cache line fetch can retrieve more pieces of data.
However, this reasoning does not apply if the data isn't stored or processed consecutively.

The need for programming languages to provide an abstraction for small integers, even if limited to specific contexts

These needs often arise in the development of software libraries, including the language's own standard libraries. Below are several such cases.

Interoperability

Often, higher-level programming languages need to interact with the operating system, or pieces of software (libraries) written in other lower-level languages. These lower-level languages often communicate using "structs", which is a rigid specification of the memory layout of a record consisting of fields of different types.
For example, a higher-level language may need to specify that a certain foreign function accepts a char array of size 256. (Example.)
Some abstractions used by operating systems and file systems require the use of byte streams.
Some programming languages choose to provide utility functions (e.g. BitConverter) to help the packing and unpacking of narrow integers into bit-streams and byte-streams.
In these cases, the narrower integer types need not be a primitive type built into the language. Instead, they can be provided as a library type.

String handling

There are applications whose main design purposes are to manipulate strings. Thus, the efficiency of string handling is important to those types of applications.

File format handling

A lot of file formats were designed with a C-like mindset. As such, use of narrow width fields were prevalent.

Desirability, software quality, and programmer's responsibility

For many types of applications, automatic widening of integers is actually not a desirable feature. Neither is saturation nor wrap-around (modulus).
Many types of applications will benefit from the programmer's explicit specification of the largest permitted values in various critical points in the software, such as at the API level.

Consider the following scenario.

A software API accepts a JSON request. The request contains an array of child requests. The entire JSON request can be compressed with Deflate algorithm.
A malicious user creates a JSON request containing one billion child requests. All child requests are identical; the malicious user intends the system to burn some CPU cycles doing useless work. Due to compression, these identical child requests are compressed to a very small total size.
It is obvious that a predefined limit on the compressed size of the data is not sufficient. Instead, the API needs to impose a predefined limit on the number of child requests that can be contained in it, and/or a predefined limit on the deflated size of the data.

Often, software that can safely scale up many orders of magnitude must be engineered for that purpose, with increasing complexity. It does not come automatically even if the issue of integer overflow is eliminated. This comes to a full circle answering the language design perspective: often, software that refuses to perform a work when an unintended integer overflow occurs (by throwing an error or exception) is better than software that automatically complies with astronomically large operations.

This means the OP's perspective,

Why is it necessary to have a person set how many bytes should be used for a primitive value?

is not correct. The programmer should be allowed, and sometimes required, to specify the maximum magnitude that an integer value can take, at critical parts of the software. As gardenhead's answer points out, the natural limits imposed by primitive types are not useful for this purpose; the language must provide ways for programmers to declare magnitudes and enforce such limits.

It all comes from hardware.

A byte is the smallest addressable unit of memory on most hardware.

Every type you just mentioned is built from some multiple of bytes.

A byte is 8 bits. With that you could express 8 booleans but you can't look up just one at a time. You address 1, you're addressing all 8.

And it used to be that simple but then we went from an 8 bit bus to a 16, 32, and now 64 bit bus.

Which means while we can still address at the byte level we can't retrieve a single byte from memory any more without getting its neighboring bytes.

Faced with this hardware the language designers chose to allow us to choose types that allowed us to pick types that fit the hardware.

You can claim that such a detail can and should be abstracted away especially in a language that aims to run on any hardware. This would have hidden performance concerns but you may be right. It just didn't happen that way.

Java actually tries to do this. Bytes are automatically promoted to Ints. A fact that will drive you nuts the first time you try to do any serious bit shifting work in it.

So why didn't it work well?

Java's big selling point back in the day way that you could sit down with a known good C algorithm, type it up in Java, and with minor tweaks it would work. And C is very close to the hardware.

Keeping that going and abstracting size out of integral types just didn't work together.

So they could have. They just didn't.

Perhaps the programmer wouldn't want someone to be able to use a bigger number than a certain size and this lets them limit it.

This is valid thinking. There are methods for doing this. The clamp function for one. A language could go so far as to bake arbitrary bounds into their types. And when those bounds are known at compile time that would allow optimizations in how those numbers are stored.

Java just isn't that language.

Likely, one important reason of why these types exist in Java is simple and distressingly non-technical:

C and C++ also had these types!

Although it's hard to provide a proof that this is the reason, there is at least some strong evidence: The Oak Language Specification (Version 0.2) contains the following passage:

3.1 Integer Types

Integers in the Oak language are similar to those in C and C++, with two exceptions: all integer types are machine independent, and some of the traditional definitions have been changed to reflect changes in the world since C was introduced. The four integer types have widths of 8, 16, 32, and 64 bits, and are signed unless prefixed by the unsigned modifier.

So the the question could boil down to:

Why were short, int, and long invented in C?

I'm not sure whether the answer to the letter question is satisfactory in the context of the question that was asked here. But in combination with the other answers here, it might become clear that it can be beneficial to have these types (regardless of whether their existence in Java is only a legacy from C/C++).

The most important reasons I can think of are

A byte is the smallest addressable memory unit (as CandiedOrange already mentioned). A byte is the elementary building block of data, which can be read from a file or over the network. Some explicit representation of this should exist (and it does exist in most languages, even when it sometimes comes in disguise).
It is true that, in practice, it would make sense to represent all fields and local variables using a single type, and call this type int. There is a related question about that on stackoverflow: Why does the Java API use int instead of short or byte?. As I mentioned in my answer there, one justification for having the smaller types (byte and short) is that you can create arrays of these types: Java has a representation of arrays that is still rather "close to the hardware". In contrast to other languages (and in contrast to arrays of objects, like an Integer[n] array), an int[n] array is not a collection of references where the values are scattered throughout the heap. Instead, it will in practice be a consecutive block of n*4 bytes - one chunk of memory with a known size and data layout. When you have the choice of storing 1000 bytes in a collection of arbitrarily-sized integer value objects, or in a byte[1000] (which takes 1000 bytes), the latter may indeed save some memory. (Some other advantages of this may be more subtle, and only become obvious when interfacing Java with native libraries)

Regarding the points that you specifically asked about:

Couldn't the size just be determined dynamically depending on how big the number passed in was?

Dynamically setting the size of the data would mean it would need to be able to change dynamically as well. This could potentially cause performance issues?

It would likely be possible to dynamically set the size of variables, if one considered to design a completely new programming language from scratch. I'm not an expert at compiler construction, but think that it would be hard to sensibly mangage collections of dynamically changing types - particularly, when you have a strongly typed language. So it would probably boil down to all numbers being stored in a "generic, arbitrary precision number data type", which certainly would have performance impacts. Of course, there are programming languages that are strongly typed and/or offer arbitrarily sized number types, but I don't think that there is a real general purpose programming language that went this way.

Side notes:

You might have wondered about the unsigned modifier that was mentioned in the Oak spec. In fact, it also contains a remark: "unsigned isn’t implemented yet; it might never be.". And they were right.
In addition to wondering why C/C++ had these different integer types at all, you might wonder why they messed them up so horribly that you never know how many bits an int has. The justifications for this are usually related to performance, and can be looked up elsewhere.

It certainly shows you have not yet being teach about performance and architectures.

First, not every processor can handle the big types, so, you need to know the limitations and work with that.
Second, smaller types means more performance when doing operations.
Also, size matters, if you have to store data in a file or database the size will affect both performance and the final size of all the data, for instance, let's say you have a table with 15 columns, and you end up with several millions of records. The difference between chosen a small as necessary size for each column or chosen just the biggest type it will be a difference of possible Gigs of data and time in the performance of operations.
Also, it applies in complex calculations, where the size of the data being processed will have great impact, like in games for example.

Ignoring the importance of the data size always hits performance, you must use as many resources as necessary, but no more, always!

That is the difference between a program or system that does really simple things and is incredible inefficient requiring lots of resources and making the use of that system really costly; or a system that does a lot, but runs faster that others and is really cheap to run.

There are a couple of good reasons

(1) while the storage of one byte variable verses one long is insignificant, storage of millions in an array is very significant.

(2) "hardware native" arithmetic in based on particular integer sizes may be a lot more efficient, and for some algorithms on some platforms, that may be important.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange