Question

GCC has -finput-charset, -fexec-charset and -fwide-exec-charset three compile options to specify particular encodings involved in a "compile chain". Like the following:

+--------+   -finput-charset     +----------+    -fexec-charset (or)    +-----+
| source | ------------------->  | compiler |  -----------------------> | exe |
+--------+                       +----------+    -fwide-exec-charset    +-----+

Reference: GCC compiler options

I found a question about -finput-charset here: Specification of source charset encoding in MSVC++, like gcc “-finput-charset=CharSet”. But I want to know whether VC has a compiler option like -fexec-charset in GCC to specify the execution character set.

I found a seemed relative option in Visual Studio: Project Properties/Configuration Properties/General/Character Set. And the value is Use Unicode Character Set. Does it do the same thing as -fexec-charset in GCC? In that way I want to set the execution character set to UTF-8. How to?

Why I want to set the encoding of the execution?

I'm writing an application in C++ which needs to communicate with a db server. And the charset of the tables is utf8. After I build some tests, the tests will catch exceptions thrown around insertion operations on db tables. The exceptions tell me that they meet incorrect string values. I suppose that it's caused by the wrong encoding right? BTW, are there any other ways to handle this issue?

Was it helpful?

Solution

AFAIK, VC++ doesn't have a commandline flag to let you specify a UTF-8 execution character set. However it does (sporadically) support the undocumented

#pragma execution_character_set("utf-8")

referred to here.

To get the effect of a commandline flag with this pragma you can write the pragma in a header file, say, preinclude.h and pre-include this header in every compilation by passing the flag /FI preinclude.h. See this documentation for how to set this flag from the IDE.

The pragma was supported in VC++ 2010, then forgotten in VC++ 2012, and is supported again in VC++ 2013

OTHER TIPS

It should be noted that the pragma execution_character_set does only seem to apply to character string literals ("Hello World") and not wide character string literals (L"Hello World").

I did some experiments to find out how source and execution character sets are implemented in MSVC. I did the experiments with Visual Studio 2015 on a Windows system where CP_ACP is 1252 and summarize the results as follows:

Character literals

  • If MSVC determines the source file to be a Unicode file, that is it is encoded in UTF-8 or UTF-16, it converts characters to CP_ACP. If a Unicode character is not within the range of CP_ACP, MSVC issues a C4566 warning ("character represented by universal-character-name '\U0001D575' cannot be represented in the current code page (1252)"). MSVC assumes the execution character set of the compiled software is CP_ACP of the compiler. That implies that you should compile the software under the CP_ACP of the target environment, i.e. if you want to execute the software on a Windows system with code page 1252 you should compile it under code page 1252 and not execute it on a system with any other code page. In practice it might work if your literals are ASCII encoded (C0 Control and Basic Latin Unicode block) since most common SBCS code pages extend this encoding. However, there are some which do not, especially DBCS code pages

  • If MSVC determines that the source file is not a Unicode file, it interprets the source file according to CP_ACP and assumes that the execution character set is CP_ACP. As with Unicode files you should compile the software under the CP_ACP of the target environment and have the same problems.

All "ANSI" Windows API functions (e.g. CreateFileA) interpret strings of type LPSTR according to CP_ACP or CP_THREAD_ACP (which defaults to CP_ACP). It's not easy to find out which functions use CP_ACP or CP_THREAD_ACP so it's best to never change CP_THREAD_ACP.

Wide character literals

The execution character set for wide character literals is always Unicode and the encoding is UTF-16LE. All wide character Windows API functions (e.g. CreateFile) interpret string of type LPWSTR as UTF-16LE strings. That also implies that wcslen does not return the number of Unicode characters but the number wchar_t characters of a wide character string. UTF-16 is also different from UCS-2 in some cases.

  • If MSVC determines the source file to be a Unicode file, it converts the characters to UTF-16LE.
  • If MSVC determines that the source file is not a Unicode file, it reads file according to CP_ACP and extends the characters to two bytes without interpreting them. That is, if a character is encoded as 0xFF in CP_ACP it will be written as 0x00 0xFF regardless of whether the CP_ACP character 0xFF is the Unicode character U+00FF.

I haven't had the chance to repeat my experiments on a DBCS Windows system because I don't speak the languages that usually use such code pages. Perhaps some body can repeat the experiments on such a system.

For me the conclusion of the experiment is that you should avoid character literals, even if you use the execution_character_set pragma.

The pragma just changes how character string literals are encoded in the binary but does not change the execution character set of the libraries you use or the kernel. If you wanted to use the execution_character_set pragma, you would have to recompile Windows and all other libraries you use completely with the pragma which is of course impossible. So I would recommend against using it. It might work for some systems since UTF-8 works with most character string functions in the CRT and CP_ACP usually includes ASCII but you should check whether these assumptions really hold in your target environment and whether the required effort of this misuse is really worth it. Moreover, the pragma seems to be undocumented and I might not work in future releases.

Otherwise you have to compile separate binaries for all code pages that are in use in your target systems. The only way to avoid multiple binaries would be when you externalize all strings to resources which are UTF-16LE encoded and convert the strings to CP_ACP if required. In this case you have to save the resource scripts (.rc files) as UTF-8, invoke rc with /c65001 (UTF-16LE does not work) and include the strings for all code pages that are in use in your target systems.

I would advice to encode your files in a Unicode encoding, such as UTF-8 or UTF-16LE, and use wide character literals if you can't externalize the strings to resources and compile with UNICODE and _UNICODE defined. It's not advisable to use string and character literals anyhow, prefer resources. Use WideCharacterToMultiByte and MultiByteToWideChar for functions which expect strings that are encoded according to CP_ACP or some other code page.

The source encoding detection heuristic of MSVC works best with BOM enabled (even in UTF-8).

I'm not an expert on Asian languages but I read that han unification in Unicode is controversial. So using Unicode might not be the solution to all problems and there might be cases where it doesn't meet the requirements but I would say for the majority languages Unicode is what works best under Windows.

It's a mistake of Microsoft to be not explicit about this and document the behaviour of their compilers and operating system.

The Visual Studio 2015 Update 2 and later supports setting the execution character set:

You can use the option /utf-8 which combines the options /source-charset:utf-8 and /execution-charset:utf-8. From the link above:

In those cases where BOM-less UTF-8 files already exist or where changing to a BOM is a problem, use the /source-charset:utf-8 option to correctly read these files.

Use of /execution-charset or /utf-8 can help when targeting code between Linux and Windows as Linux commonly uses BOM-less UTF-8 files and a UTF-8 execution character set.

Project Properties/Configuration Properties/General/Character Set only sets Macros Unicode/MBCS but not the source character set or execution character set.

Credit on @user3998276 's answer and the great experiment.

The conclusion tells me a lot

  • when meet L"string", wide string:

    • compiler first detects the cpp-file-encoding, then:
      • Unicode--> just use utf-16 // may here also has a conversion, like u8 to u16.
      • ACP--> convert the Unicode string to ACP
  • when meet "string", ordinary string literal:

    • compiler first detects the cpp-file-encoding, then
      • Unicode --> covert the Unicode character to ACP character
      • ACP --> just read the source file according ACP

As to your problem, I think the 'insertion operations on db tables' is just a call to the db insertion API. So, all you need to do is to organize the command, like SQL, in UTF8. Once the API can understand your command, it can write the right value(imagine binary steam) for you.

Try:

  • In c++11 and later, you can specify utf-8 string by prefix 'u8', like

u8"INSERT INTO table_name (col1, col2,...) VALUES (v1, v2,....)"

http://en.cppreference.com/w/cpp/language/string_literal

  • Use a third-party string wrapper, like QString from QT.

    First wrap your SQL to QString, then it can be easily convert to utf8, QByteArray x = mySql.toUtf8(). The QByteArray is just 'array of byte', so you can static_cast it to the type the insertion API wants.

Again, read the answer of @user3998276 carefully, you may need to change the encoding of your cpp file to Unicode if there are some character cannot be represent in you ANSI code page.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top