Is it possible to use a Unicode “argv”?

https://stackoverflow.com/questions/1664476

12-09-2019
|

Question

I'm writing a little wrapper for an application that uses files as arguments.

The wrapper needs to be in Unicode, so I'm using wchar_t for the characters and strings I have. Now I find myself in a problem, I need to have the arguments of the program in a array of wchar_t's and in a wchar_t string.

Is it possible? I'm defining the main function as

int main(int argc, char *argv[])

Should I use wchar_t's for argv?

Thank you very much, I seem not to find useful info on how to use Unicode properly in C.

Solution

In general, no. It will depend on the O/S, but the C standard says that the arguments to 'main()' must be 'main(int argc, char **argv)' or equivalent, so unless char and wchar_t are the same basic type, you can't do it.

Having said that, you could get UTF-8 argument strings into the program, convert them to UTF-16 or UTF-32, and then get on with life.

On a Mac (10.5.8, Leopard), I got:

Osiris JL: echo "ï€" | odx
0x0000: C3 AF E2 82 AC 0A                                 ......
0x0006:
Osiris JL:

That's all UTF-8 encoded. (odx is a hex dump program).

OTHER TIPS

Portable code doesn't support it. Windows (for example) supports using wmain instead of main, in which case argv is passed as wide characters.

On Windows, you can use GetCommandLineW() and CommandLineToArgvW() to produce an argv-style wchar_t[] array, even if the app is not compiled for Unicode.

On Windows anyway, you can have a wmain() for UNICODE builds. Not portable though. I dunno if GCC or Unix/Linux platforms provide anything similar.

Assuming that your Linux environment uses UTF-8 encoding then the following code will prepare your program for easy Unicode treatment in C++:

    int main(int argc, char * argv[]) {
      std::setlocale(LC_CTYPE, "");
      // ...
    }

Next, wchar_t type is 32-bit in Linux, which means it can hold individual Unicode code points and you can safely use wstring type for classical string processing in C++ (character by character). With setlocale call above, inserting into wcout will automatically translate your output into UTF-8 and extracting from wcin will automatically translate UTF-8 input into UTF-32 (1 character = 1 code point). The only problem that remains is that argv[i] strings are still UTF-8 encoded.

You can use the following function to decode UTF-8 into UTF-32. If the input string is corrupted it will return properly converted characters until the place where the UTF-8 rules were broken. You could improve it if you need more error reporting. But for argv data one can safely assume that it is correct UTF-8:

#define ARR_LEN(x) (sizeof(x)/sizeof(x[0]))

    wstring Convert(const char * s) {
        typedef unsigned char byte;
        struct Level { 
            byte Head, Data, Null; 
            Level(byte h, byte d) {
                Head = h; // the head shifted to the right
                Data = d; // number of data bits
                Null = h << d; // encoded byte with zero data bits
            }
            bool encoded(byte b) { return b>>Data == Head; }
        }; // struct Level
        Level lev[] = { 
            Level(2, 6),
            Level(6, 5), 
            Level(14, 4), 
            Level(30, 3), 
            Level(62, 2), 
            Level(126, 1)
        };

        wchar_t wc = 0;
        const char * p = s;
        wstring result;
        while (*p != 0) {
            byte b = *p++;
            if (b>>7 == 0) { // deal with ASCII
                wc = b;
                result.push_back(wc);
                continue;
            } // ASCII
            bool found = false;
            for (int i = 1; i < ARR_LEN(lev); ++i) {
                if (lev[i].encoded(b)) {
                    wc = b ^ lev[i].Null; // remove the head
                    wc <<= lev[0].Data * i;
                    for (int j = i; j > 0; --j) { // trailing bytes
                        if (*p == 0) return result; // unexpected
                        b = *p++;   
                        if (!lev[0].encoded(b)) // encoding corrupted
                            return result;
                        wchar_t tmp = b ^ lev[0].Null;
                        wc |= tmp << lev[0].Data*(j-1);
                    } // trailing bytes
                    result.push_back(wc);
                    found = true;
                    break;
                } // lev[i]
            }   // for lev
            if (!found) return result; // encoding incorrect
        }   // while
        return result;
    }   // wstring Convert

On Windows, you can use tchar.h and _tmain, which will be turned into wmain if the _UNICODE symbol is defined at compile time, or main otherwise. TCHAR *argv[] will similarly be expanded to WCHAR * argv[] if unicode is defined, and char * argv[] if not.

If you want to have your main method work cross platform, you can define your own macros to the same effect.

TCHAR.h contains a number of convenience macros for conversion between wchar and char.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow