Random Access File and extra ASCII characters in Java

https://stackoverflow.com/questions/13767081

05-12-2021
|

Question

I have a Random Access File filled with Strings (I know that they are not really Strings, although it will help me to explain the problem). What I want to do is to view a certain String, let's say String #4. While it would be simple for integers and generally primitive data types as they have a fixed byte length and I can read the right bytes by suming up all the previous bytes.

I have managed to solve this problem by giving all the String a fixed length of 16 chars, so if I have the word "dog", then this word in the RAF is "dog " (dog + 13 spaces) and the byte length was fixed too. Again, I could easily read the right value using the following method:

static String loadOne(int n) throws IOException {
    raf = new RandomAccessFile(file, "rw");
    raf.seek((n-1)*(fix+2));
    String x = raf.readUTF();
    return x;
}

Where n is the number of the value I want to read and fix is the number of chars (and bytes) of one String.

Everything seemed fine until I used an extra ASCII character - a polish letter - in one of the Strings, because it consists of 2 bytes. The char lenght was still the same - 16, but there were 17 bytes and the whole thing fell apart.

What can I do?

La solution

I strongly suspect you're not using readUTF the way it's expected to be used. Did you read exactly what it does?

The first two bytes are read, starting from the current file pointer, as if by readUnsignedShort. This value gives the number of following bytes that are in the encoded string, not the length of the resulting string. The following bytes are then interpreted as bytes encoding characters in the modified UTF-8 format and are converted into characters.

Does that match what's stored in your file? (You haven't specified anything about the format of the file.)

Given that UTF-8 is not fixed width, it sounds inappapropriate for your scenario.

I'd suggest using 32 bytes per entry, which will always give 16 char values as UTF-16 code units. You can convert this very simply using new String(data, "UTF-16BE") and text.getBytes("UTF-16BE") (or use LE instead of BE if you want). That way you'll have a genuinely fixed-length string, in terms of bytes, not just characters.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow