Char * encoding

蘭亭文藝 2019-12-17

展開全文

If I write the statement below in C++ under Visual Studio, what will be encoding here?

const char *c = "￡";

Under the Visual Studio project settings I have set the "Charset" to "Not set".

-----------------------------

Setting the charset to 'Not Set' simply means that neither of the preprocessor macros _UNICODE and _MBCS will be set. This has no effect on what character sets are used by the compiler.

The two settings that determine how the bytes of your source are converted to a string literal in the program are the 'source character set' and the 'execution character set'. The compiler will convert string literals from the source encoding to the execution encoding.

Source encoding:

The source encoding is the encoding used by the compiler to interpret the source file's bytes. It applies not just to string and character literals, but also to everything else in source including, for example, identifiers.

If Visual Studio's compiler detects a Unicode 'signature' in a source file then it will use the corresponding Unicode encoding as the source encoding. Otherwise it will use the system's codepage encoding as the source encoding.

Execution encoding:

The execution encoding is the encoding the compiler stores string and character literals as, such that the string and character data created by literals will be encoded using the execution encoding.

Visual Studio's compiler uses the system's codepage as the execution encoding.

When Visual Studio performs the conversion of string and character literal data from the source encoding to the execution encoding it will replace characters that cannot be represented in the execution encoding set with '?'.

So for your example:

const char *c = "￡";

Assuming that your source is saved using Microsoft's "UTF-8 with signature" format and your system uses CP1252 as most systems in the West do, the string literal will be converted to:

0xA3 0x00

On the other hand, if the execution charset is something that doesn't include '￡', such as cp1251 (Cyrillic, used in Window's Russian locale), then the string literal will end up:

0x3F 0x00

If you want to avoid depending on the source code encoding you can use Universal Character Names (UCNs):

const char *c = "\u00A3"; // "￡"

If you want to guarantee a UTF-8 representation you'll also need to avoid dependence on the execution encoding. You can do that by manually encoding it:

const char *c = "\xC2\xA3"; // UTF-8 encoding of "￡"

C++11 introduces UTF-8 string literals, which will be better when your compiler supports them:

const char *c = u8"￡";

const char *c = u8"\u00A3"; // "￡"

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：蘭亭文藝 > 《待分類》

舉報(bào)/認(rèn)領(lǐng)