Processing Unicode characters in C++

https://stackoverflow.com/questions/9333710

30-04-2021
|

Domanda

I have a file that contains Unicode text in an unstated encoding. I want to scan through this file looking for any Arabic code points in the range U+0600 through U+06FF, and map each applicable Unicode code point to a byte of ASCII, so that the newly produced file will be composed of purely ASCII characters, with all code points under 128.

How do I go about doing this? I tried to read them the same way as you read ASCII, but my terminal shows ?? because it’s a multi-byte character.

NOTE: the file is made up of a subset of the Unicode character set, and the subset size is smaller than the size of ASCII characters. Therefore I am able to do a 1:1 mapping from this particular Unicode subset to ASCII.

Soluzione

This is either impossible, or it’s trivial. Here are the trivial approaches:

If no code point exceeds 127, then simply write it out in ASCII. Done.
If some code points exceed 127, then you must choose how to represent them in ASCII. A common strategy is to use XML syntax, as in α for U+03B1. This will take up to 8 ASCII characters for each trans-ASCII Unicode code point transcribed.

The impossible ones I leave as an excercise for the original poster. I won’t even mention the foolish-but-possible (read: stupid) approaches, as these are legion. Data destruction is a capital crime in data processing, and should be treated as such.

Note that I am assuming by ‘Unicode character’ you actually mean ‘Unicode code point’; that is, a programmer-visible character. For user-visible characters, you need ‘Unicode grapheme (cluster)’ instead.

Also, unless you normalize your text first, you’ll hate the world. I suggest NFD.

EDIT

After further clarification by the original poster, it seems that what he wants to do is very easily accomplished using existing tools without writing a new program. For example, this converts a certain set of Arabic characters from a UTF-8 input file into an ASCII output file:

$ perl -CSAD -Mutf8 -pe 'tr[ابتثجحخد][abttjhhd]' < input.utf8 > output.ascii

That only handles these code points:

U+0627 ‭ ا  ARABIC LETTER ALEF
U+0628 ‭ ب  ARABIC LETTER BEH
U+0629 ‭ ة  ARABIC LETTER TEH MARBUTA
U+062A ‭ ت  ARABIC LETTER TEH
U+062B ‭ ث  ARABIC LETTER THEH
U+062C ‭ ج  ARABIC LETTER JEEM
U+062D ‭ ح  ARABIC LETTER HAH
U+062E ‭ خ  ARABIC LETTER KHAH
U+062F ‭ د  ARABIC LETTER DAL

So you’ll have to extend it to whatever mapping you want.

If you want it in a script instead of a command-line tool, it’s also easy, plus then you can talk about the characters by name by setting up a mapping, such as:

 "\N{ARABIC LETTER ALEF}"   =>  "a",
 "\N{ARABIC LETTER BEH}"    =>  "b",
 "\N{ARABIC LETTER TEH}"    =>  "t",
 "\N{ARABIC LETTER THEH}"   =>  "t",
 "\N{ARABIC LETTER JEEM}"   =>  "j",
 "\N{ARABIC LETTER HAH}"    =>  "h",
 "\N{ARABIC LETTER KHAH}"   =>  "h",
 "\N{ARABIC LETTER DAL}"    =>  "d",

If this is supposed to be a component in a larger C++ program, then perhaps you will want to implement this in C++, possibly but not necessary using the ICU4C library, which includes transliteration support.

But if all you need is a simple conversion, I don’t understand why you would write a dedicated C++ program. Seems like way too much work.

Altri suggerimenti

You cannot read in the data unless you know the format. Open the filein with microsoft word, and go to "Save As", "Other formats", "Plain Text (.txt)", save. At the conversion box, select "Other encoding", "Unicode" (which is UTF16LE) and "OK". That file is now saved as UTF16LE.

std:ifstream infile("myfile.txt", std::ios::binary); //open stream
infile.seekg (0, ios::end); //get it's size
int length = infile.tellg();
infile.seekg (0, ios::beg);
std::wstring filetext(length/2); //allocate space
ifstream.read((char*)&filetext[0], length); //read entire file
std::string final(length/2);
for(int i=0; i<length/2; ++i) { //"shift" the variables to "valid" range
    if (filetext[length/2] >= 0x600 && filetext[length/2] <= 0xFF)
        final[length/2] = filetext[length/2]-0x600;
    else
        throw std::exception("INVALID CHARACTER");
}
//done

Warnings all over: I highly doubt this will result in what you want, but this is the best that can be managed, since you haven't told us the translation that needs doing, or the format of the file. Also, I'm assuming your computer and compiler are the same as mine. If not, some or all of this might be wrong, but it's the best I can do with this missing information you haven't told us.

In order to parse out Unicode codepoints, you have to first decode the file into its unencoded Unicode representation (which is equivilent to UTF-32). In order to do that, you first need to know how the file was encoded so it can be decoded. For instance, Unicode codepoints U+0600 and U+06FF are encoded as 0xD8 0x80 and 0xDB 0xBF in UTF-8, as 0x00 0x06 and 0xFF 0x06 in UTF-16LE, as 0x06 0x00 and 0x06 0xFF in UTF-16BE, etc.

If the file starts with a BOM, then you know the exact encoding used and can interpret the rest of the file accordingly. For instance, the UTF-8 BOM is 0xEF 0xBB 0xBF, UTF-16LE is 0xFF 0xFE, UTF-16BE is 0xFE 0xFF, and so on.

If the file does not start with a BOM, then you have to analyze the data and perform heiristics on it to detect the encoding, but that is not 100% reliable. Although it is fairly easy to detect UTF encodings, it is nearly impossible to detect Ansi encodings with any measure of reliability. Even detecting UTF encodings without a BOM present can cause false results at times (read this, this, and this).

Don't ever guess, you will risk data loss. If you do not know the exact encoding used, ask the user for it.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow