Question

I'm working on a file format that should be written and read in several different operating systems and computers. Some of those computers should be x86 machines, others x86-64. Some other processors may exist, but I'm not concerned about them yet.

This file format should contain several numbers that would be read like this:

struct LongAsChars{
    char c1, c2, c3, c4;
};

long readLong(FILE* file){
    int b1 = fgetc(file);
    int b2 = fgetc(file);
    int b3 = fgetc(file);
    int b4 = fgetc(file);
    if(b1<0||b2<0||b3<0||b4<0){
        //throwError
    }

    LongAsChars lng;
    lng.c1 = (char) b1;
    lng.c2 = (char) b2;
    lng.c3 = (char) b3;
    lng.c4 = (char) b4;

    long* value = (long*) &lng;

    return *value;
}

and written as:

void writeLong(long x, FILE* f){
    long* xptr = &x;
    LongAsChars* lng = (LongAsChars*) xptr;
    fputc(lng->c1, f);
    fputc(lng->c2, f);
    fputc(lng->c3, f);
    fputc(lng->c4, f);
}

Although this seems to be working on my computer, I'm concerned that it may not in others or that the file format may end up being different across computers(32 bits vs 64 bits computers, for example). Am I doing something wrong? How should I implement my code to use a constant number of bytes per number?

Should I just use fread(which would possibly make my code faster too) instead?

Was it helpful?

Solution

Use the types in stdint.h to ensure you get the same number of bytes in and out.

Then you're just left with dealing with endianness issues, which you code probably doesn't really handle.

Serializing the long with an aliased char* leaves you with different byte orders in the written file for platforms with different endianess.

You should decompose the bytes something like so:

char c1 = (val >>  0) & 0xff;
char c2 = (val >>  8) & 0xff;
char c3 = (val >> 16) & 0xff;
char c4 = (val >> 24) & 0xff;

And recompose then using something like:

val = (c4 << 24) |
      (c3 << 16) |
      (c2 <<  8) |
      (c1 <<  0);

OTHER TIPS

You might also run into issues with endianness. Why not just use something like NetCDF or HDF, which take care of any portability issues that may arise?

Rather than using structures with characters in them, consider a more mathematical approach:

long l  = fgetc() << 24;
     l |= fgetc() << 16;
     l |= fgetc() <<  8;
     l |= fgetc() <<  0;

This is a little more direct and clear about what you are trying to accomplish. It can also be implemented in a loop to handle larger numbers.

You don't want to use long int. That can be different sizes on different platforms, so is a non-starter for a platform-independent format. You have to decide what range of values needs to be stored in the file. 32 bits is probably easiest.

You say you aren't worried about other platforms yet. I'll take that to mean you want to retain the possibility of supporting them, in which case you should define the byte-order of your file format. x86 is little-endian, so you might think that's the best. But big-endian is the "standard" interchange order if anything is, since it's used in networking.

If you go for big-endian ("network byte order"):

// can't be bothered to support really crazy platforms: it is in
// any case difficult even to exchange files with 9-bit machines,
// so we'll cross that bridge if we come to it.
assert(CHAR_BIT == 8);
assert(sizeof(uint32_t) == 4);

{
    // write value
    uint32_t value = 23;
    const uint32_t networkOrderValue = htonl(value);
    fwrite(&networkOrderValue, sizeof(uint32_t), 1, file);
}

{
    // read value
    uint32_t networkOrderValue;
    fread(&networkOrderValue, sizeof(uint32_t), 1, file);
    uint32_t value = ntohl(networkOrderValue);
}

Actually, you don't even need to declare two variables, it's just a bit confusing to replace "value" with its network order equivalent in the same variable.

It works because "network byte order" is defined to be whatever arrangement of bits results in an interchangeable (big-endian) order in memory. No need to mess with unions because any stored object in C can be treated as a sequence of char. No need to special-case for endianness because that's what ntohl/htonl are for.

If this is too slow, you can start thinking about fiendishly optimised platform-specific byte-swapping, with SIMD or whatever. Or using little-endian, on the assumption that most of your platforms will be little-endian and so it's faster "on average" across them. In that case you'll need to write or find "host to little-endian" and "little-endian to host" functions, which of course on x86 just do nothing.

I believe the most cross architecture approach is to use the uintXX_t types, as defined in stdint.h. See man page here. For example a int32_t will give you a 32 bit integer on x86 and x86-64. I use these by default now in all of my code and have had no troubles, as they are fairly standard across all *NIX.

Assuming sizeof(uint32_t) == 4, there are 4!=24 possible byte orders, of which little-endian and big-endian are the most prominent examples, but others have been used as well (e.g. PDP-endian).

Here are functions for reading and writing 32 bit unsigned integers from a stream, heeding an arbitrary byte order which is specified by the integer whose representation is the byte sequence 0,1,2,3: endian.h, endian.c

The header defines these prototypes

_Bool read_uint32(uint32_t * value, FILE * file, uint32_t order);
_Bool write_uint32(uint32_t value, FILE * file, uint32_t order);

and these constants

LITTLE_ENDIAN
BIG_ENDIAN
PDP_ENDIAN
HOST_ORDER
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top