Convert a text file to UTF8 in D

https://stackoverflow.com/questions/22310869

12-06-2023
|

Question

I'm attempting to use the Phobos standard library functions to read in any valid UTF file (UTF-8, UTF-16, or UTF-32) and get it back as a UTF-8 string (aka D's string). After looking through the docs, the most concise function I could think of to do so is

using std.file, std.utf;

string readToUTF8(in string filename)
{
    try {
        return readText(filename);
    }
    catch (UTFException e) {
        try {
            return toUTF8(readText!wstring(filename));
        }
        catch (UTFException e) {
            return toUTF8(readText!dstring(filename));
        }
    }
}

However, catching a cascading series of exceptions seems extremely hackish. Is there a "cleaner" way to go about it without relying on catching a series of exceptions?

Additionally, the above function seems to return a one-byte BOM in the resulting string if the source file was UTF-16 or UTF-32, which I would like to omit given that it's UTF-8. Is there a way to omit that besides explicitly stripping it?

La solution

One of your questions answers the other: the BOM allows you to identify the exact UTF encoding used in the file.

Ideally, readText would do this for you. Currently, it doesn't, so you'd have to implement it yourself.

I'd recommend using std.file.read, casting the returned void[] to a ubyte[], then looking at the first few bytes to see if they start with a BOM, then cast the result to the appropriate string type and convert it to a string (using toUTF8 or to!string).

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow