A question about union in C - store as one type and read as another - is it implementation defined?

https://stackoverflow.com/questions/1812348

06-07-2019
|

Question

I was reading about union in C from K&R, as far as I understood, a single variable in union can hold any one of the several types and if something is stored as one type and extracted as another the result is purely implementation defined.

Now please check this code snippet:

#include<stdio.h>

int main(void)
{
  union a
  {
     int i;
     char ch[2];
  };

  union a u;
  u.ch[0] = 3;
  u.ch[1] = 2;

  printf("%d %d %d\n", u.ch[0], u.ch[1], u.i);

  return 0;
}

Output:

3 2 515

Here I am assigning values in the u.ch but retrieving from both u.ch and u.i. Is it implementation defined? Or am I doing something really silly?

I know it may seem very beginner to most of other people but I am unable to figure out the reason behind that output.

Thanks.

Solution

This is undefined behaviour. u.i and u.ch are located at the same memory address. So, the result of writing into one and reading from the other depends on the compiler, platform, architecture, and sometimes even compiler's optimization level. Therefore the output for u.i may not always be 515.

Example

For example gcc on my machine produces two different answers for -O0 and -O2.

Because my machine has 32-bit little-endian architecture, with -O0 I end up with two least significant bytes initialized to 2 and 3, two most significant bytes are uninitialized. So the union's memory looks like this: {3, 2, garbage, garbage}

Hence I get the output similar to 3 2 -1216937469.
With -O2, I get the output of 3 2 515 like you do, which makes union memory {3, 2, 0, 0}. What happens is that gcc optimizes the call to printf with actual values, so the assembly output looks like an equivalent of:
```
#include <stdio.h>
int main() {
    printf("%d %d %d\n", 3, 2, 515);
    return 0;
}
```
The value 515 can be obtained as other explained in other answers to this question. In essence it means that when gcc optimized the call it has chosen zeroes as the random value of a would-be uninitialized union.

Writing to one union member and reading from another usually does not make much sense, but sometimes it may be useful for programs compiled with strict aliasing.

OTHER TIPS

The answer to this question depends on the historical context, since the specification of the language changed with time. And this matter happens to be the one affected by the changes.

You said that you were reading K&R. The latest edition of that book (as of now), describes the first standardized version of C language - C89/90. In that version of C language writing one member of union and reading another member is undefined behavior. Not implementation defined (which is a different thing), but undefined behavior. The relevant portion of the language standard in this case is 6.5/7.

Now, at some later point in evolution of C (C99 version of language specification with Technical Corrigendum 3 applied) it suddenly became legal to use union for type punning, i.e. to write one member of the union and then read another.

Note that attempting to do that can still lead to undefined behavior. If the value you read happens to be invalid (so called "trap representation") for the type you read it through, then the behavior is still undefined. Otherwise, the value you read is implementation defined.

Your specific example is relatively safe for type punning from int to char[2] array. It is always legal in C language to reinterpret the content of any object as a char array (again, 6.5/7).

However, the reverse is not true. Writing data into the char[2] array member of your union and then reading it as an int can potentially create a trap representation and lead to undefined behavior. The potential danger exists even if your char array has sufficient length to cover the entire int.

But in your specific case, if int happens to be larger than char[2], the int you read will cover uninitialized area beyond the end of the array, which again leads to undefined behavior.

The reason behind the output is that on your machine integers are stored in little-endian format: the least-significant bytes are stored first. Hence the byte sequence [3,2,0,0] represents the integer 3+2*256=515.

This result depends on the specific implementation and the platform.

The output from such code will be dependent on your platform and C compiler implementation. Your output makes me think you're running this code on a litte-endian system (probably x86). If you were to put 515 into i and look at it in a debugger, you would see that the lowest-order byte would be a 3 and the next byte in memory would be a 2, which maps exactly to what you put in ch.

If you did this on a big-endian system, you would have (probably) gotten 770 (assuming 16-bit ints) or 50462720 (assuming 32-bit ints).

It is implementation dependent and results might vary on a different platform/compiler but it seems this is what is happening:

515 in binary is

1000000011

Padding zeros to make it two bytes (assuming 16 bit int):

0000001000000011

The two bytes are:

00000010 and 00000011

Which is 2 and 3

Hope someone explains why they are reversed - my guess is that chars are not reversed but the int is little endian.

Amount of memory allocated to a union is equal to the memory required to store the biggest member. In this case, you have an int and a char array of length 2. Assuming int is 16 bit and char is 8 bit, both require same space and hence the union is allocated two bytes.

When you assign three (00000011) and two (00000010) to the char array, the state of union is 0000001100000010. When you read the int from this union, it converts the whole thing into and integer. Assuming little-endian representation where LSB is stored at lowest address, the int read from the union would be 0000001000000011 which is the binary for 515.

NOTE: This holds true even if the int was 32 bit - Check Amnon's answer

If you're on a 32-bit system, then an int is 4 bytes but you only initialise only 2 bytes. Accessing uninitialised data is undefined behaviour.

Assuming you're on a system with 16-bit ints, then what you are doing is still implementation defined. If your system is little endian, then u.ch[0] will correspond with the least significant byte of u.i and u.ch1 will be the most significant byte. On a big endian system, it's the other way around. Also, the C standard does not force the implementation to use two's complement to represent signed integer values, though two's complement is the most common. Obviously, the size of an integer is also implementation defined.

Hint: it's easier to see what's happening if you use hexadecimal values. On a little endian system, the result in hex would be 0x0203.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow