Question

Is there any one-size-fits-all (more or less) way to read a text file in D?

The requirement is that the function would auto-detect the encoding and give me the entire data of the file in a consistent format, like a string or a dstring. It should auto-detect BOMs and interpret them as appropriate.

I tried std.file.readText() but it doesn't handle different encodings well.

(Of course, this will have a non-zero failure rate, and that's acceptable for my application.)

Was it helpful?

Solution

I believe that the only real options for file I/O in Phobos at this point (aside from calling C functions) are std.file.readText and std.stdio.File. readText will read in a file as an array of chars, wchars, or dchars (defaulting to immutable(char)[] - i.e. string). I believe that the encoding must be UTF-8, UTF-16, and UTF-32 for chars, wchars, and dchars respectively, though I'd have to go digging in the source code to be sure. Any encodings which are compatible with those encodings (e.g. ASCII is compatible with UTF-8) should work just fine.

If you use File, then you have several options for functions to read the file with - including readln and rawRead. However, you essentially read the file in using a UTF-8, UTF-16, or UTF-32 compatible encoding just like with readText, or you read it in as binary data and manipulate it yourself.

Since, the character types in D are char, wchar, and dchar, which are UTF-8, UTF-16, and UTF-32 code units respectively, unless you want to read the data in binary format, the file is going to have to be encoded in an encoding compatible with one of those three types of unicode. Given a string in a particular encoding, you can convert it to another encoding using the functions in std.utf. However, I'm not aware of any way to query a file for its encoding type other than using readText to try and read the file in a given encoding and see if it succeeds.

So, unless you want to process a file yourself and determine on the fly what encoding it's in, your best bet is probably to just use readText with each consecutive string type, using the first one which succeeds. However, since text files are normally in UTF-8 or a UTF-8 compatible encoding, I would expect that readText used with a normal string would almost always work just fine.

OTHER TIPS

As for dealing with checking the BOM:

char[] ConvertViaBOM(ubyte[] data) {
  char[] UTF8()   { /*...*/ }
  char[] UTF16LE(){ /*...*/ }
  char[] UTF16BE(){ /*...*/ }
  char[] UTF32LE(){ /*...*/ }
  char[] UTF32BE(){ /*...*/ }

  switch (data.length) {
    default:
    case 4:
      if (data[0..4] == [cast(ubyte)0x00, 0x00, 0xFE, 0xFF]) return UTF32BE();
      if (data[0..4] == [cast(ubyte)0xFF, 0xFE, 0x00, 0x00]) return UTF32LE();
      goto case 3;

    case 3:
      if (data[0..3] == [cast(ubyte)0xEF, 0xBB, 0xBF]) return UTF8();
      goto case 2;

    case 2:
      if (data[0..2] == [cast(ubyte)0xFE, 0xFF]) return UTF16BE();
      if (data[0..2] == [cast(ubyte)0xFF, 0xFE]) return UTF16LE();
      goto case 1;

    case 1:
      return UTF8();
  }
}

Adding more obscure BOM's is left as an exercise for the reader.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top