Why does the varchar datatype allow unicode values?

https://dba.stackexchange.com/questions/196570

11-10-2020
|

Question

I have a table with a varchar column. It is allowing Trademark(™), copyright(©) and other Unicode characters as shown below.

Create table VarcharUnicodeCheck
(
col1 varchar(100)
)

insert into VarcharUnicodeCheck (col1) values ('MyCompany')
insert into VarcharUnicodeCheck (col1) values ('MyCompany™')
insert into VarcharUnicodeCheck (col1) values ('MyCompany░')
insert into VarcharUnicodeCheck (col1) values ('MyCompanyï')
insert into VarcharUnicodeCheck (col1) values ('MyCompany')

select * from VarcharUnicodeCheck

But the definition of varchar says, it allows non-unicode string data. But the Trademark(™) and Registered(®) symbols are Unicode characters. Does the definition contradicts the property of varchar datatype? I read couple of links like first one and second one. But still I could not understand why it allows unicode string when the definition says that it allows only non-unicode string values.

Solution

But the Trademark(™) and Registered(®) symbols are Unicode characters.

Your are wrong here. Your strings contain only ascii characters.

Here is a simple test that shows you that your characters are all ascii (+ some extended ascii with ascii codes between 128 and 255):

declare @VarcharUnicodeCheck table
(
col1 varchar(100)
)

insert into @VarcharUnicodeCheck (col1) values ('MyCompany')
insert into @VarcharUnicodeCheck (col1) values ('MyCompany™')
insert into @VarcharUnicodeCheck (col1) values ('MyCompany░')
insert into @VarcharUnicodeCheck (col1) values ('MyCompanyï')
insert into @VarcharUnicodeCheck (col1) values ('MyCompany')

select *,
        right(col1, 1)as last_char, 
        ascii(right(col1, 1)) as_last_char_ascii
from @VarcharUnicodeCheck;

Here you can clearly see that all your characters are 1-byte encoded:

Yes they are not pure ascii characters but they are Extended ASCII.

Here I show you real unicode character Trademark(™) and its code and binary representation:

declare @t table (uni_ch nchar(1), ascii_ch char(1));
insert into @t values (N'™', '™');

select unicode(uni_ch) as [unicode of ™], 
       ascii(ascii_ch) [ascii of ™], 
       cast(uni_ch as varbinary(10)) as [uni_ch as varbinary], 
       cast(ascii_ch as varbinary(10)) as [ascii_ch as varbinary]
from @t;

Finally, you can see that Trademark(™) unicode character has 8482 code and not 153:

select nchar(8482), nchar(153)

OTHER TIPS

From the comments, I agree "Extended ASCII" is really bad term that actually means a code page that maps characters/code points in the 128-255 range, beyond the standard 0-127 code point range defined by ASCII.

SQL Server supports many code pages via collations. Non-ASCII characters can be stored in varchar as long as the underlying collation supports the character.

The '™' character can be stored in varchar/char columns when the SQL Server collation code page is 1250 or greater. The query bellow will list these:

SELECT COLLATIONPROPERTY(name, 'CodePage') AS code_page, name, description
FROM sys.fn_helpcollations()
WHERE COLLATIONPROPERTY(name, 'CodePage') >= 1250
ORDER BY name;

But only a subset of these also support the '©' character so the column collation will need to be one of the following to support both:

SELECT COLLATIONPROPERTY(name, 'CodePage') AS code_page, name, description
FROM sys.fn_helpcollations()
WHERE COLLATIONPROPERTY(name, 'CodePage') IN(
    1250
    ,1251
    ,1252
    ,1253
    ,1254
    ,1255
    ,1256
    ,1257
    ,1258
)
ORDER BY name;

But the definition of varchar says, it allows non-unicode string data. But the Trademark(™) and Registered(®) symbols are Unicode characters. Does the definition contradicts the property of varchar datatype?

While the other answers are not incorrect, I think it would help to point out a confusion in base terminology. I have emphasized two words in the above quote from the question as an example of this confusion. When the SQL Server documentation speaks of Unicode and non-Unicode data, they are not talking about the characters. They are speaking of the byte sequences that represent certain characters. The primary difference between the Unicode types (NCHAR, NVARCHAR, XML, and the deprecated / evil NTEXT) and the non-Unicode types (CHAR, VARCHAR, and the deprecated / evil TEXT) is what types of byte sequences they can store.

The non-Unicode types store one of several 8-bit encodings, while the Unicode types store a single 16-bit Unicode encoding: UTF-16 Little Endian. As the other answers have mentioned, which characters can be stored in an 8-bit / non-Unicode encoding depends on the code page, which is determined by the Collation. While others have noted that the byte value of a "character" can vary across code pages that it is found on, the byte value can even vary within the same code page when dealing with one of the several EBCDIC code pages (variations of Windows-1252), which are only found in the older, shouldn't-really-be-used SQL Server Collations (i.e. those having names starting with SQL_).

Hence, the definition is accurate: whatever characters you can manage to store in a non-Unicode type are always 8-bit (even if they use two 8-bit values in combination as a single "character", which is what the Double-Byte Character Set / DBCS code pages allow for). And the Unicode datatypes are always 16-bit, even if they sometimes use two 16-bit values in combination as a single "character" (i.e. a surrogate pair which in turn represents a Supplementary Character).

AND, due to SQL Server natively supporting the UTF-8 encoding for `VARCHAR` and `CHAR` datatypes as of SQL Server 2019,

VARCHAR can no longer be referred to as "non-Unicode". So, starting with the first public beta of SQL Server 2019 in September 2018, we should refer to VARCHAR as an "8-bit datatype", even when speaking in terms of versions prior to SQL Server 2019. This terminology holds true for all 4 types of encodings that can be used with VARCHAR:

Extended ASCII
Double-Byte Character Sets (DBCS)
EBCDIC
UTF-8 (Unicode)

Only the TEXT datatype (deprecated as of SQL Server 2005, so don't use it) is "non-Unicode", but that's just a technicality, and referring to it as an "8-bit datatype" is accurate.

NVARCHAR, NCHAR, and NTEXT can be referred to as "UTF-16" or a "16-bit datatype". Oracle, I believe, uses the terminology of "Unicode-only" for NVARCHAR, but that doesn't clearly rule out the possibility of using UTF-8 (also a Unicode encoding), which won't work, so probably best to stick with the first two options.

For details on the new UTF-8 encodings, please see my post:

Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

P.S. I am slowly working my way through updating the SQL Server documentation to reflect these changes.

P.P.S. Microsoft has already updated some pages with UTF-8 info, including the char and varchar documentation referenced in the question. It no longer contains the phrase "non-Unicode". But that is just an FYI; it doesn't change the question since this is about non-Unicode encodings containing characters that were mistakenly thought to be Unicode-only.

The question contains a central misconception about what Unicode is. The Unicode character set, along with its encodings such as UTF-8 and UTF-16, is one of many ways of representing text in a computer, and one whose aim is to supersede all other character sets and encodings. If "non-Unicode data" meant "characters not present in Unicode", then none of the text I have used in this answer could be stored in that type, because all the letters of the Latin alphabet and common punctuation used in everyday English are included in Unicode.

Text representations can broadly be thought of in two parts: a character set mapping the different characters (letters, digits, symbols, etc) to numbers on a reference chart; and an encoding representing those numbers as patterns of bits (on disk, over a network connection, etc). Here we are mostly concerned with the first part: which characters are listed on the charts for a particular character set.

Since Unicode aims to have numbers (which it calls "code points") for every character in the world, references like Wikipedia will often refer to the Unicode position of a character as a standard piece of reference information. However, that doesn't mean that other character sets don't also have a mapping for that same character.

One of the oldest and simplest character sets (and encodings) still in use is ASCII, which has mappings for 128 different characters (0 to 127), because it uses 7 bits to encode each character. Since this excludes many accented characters and common symbols, later encodings use 8 bits, and map the same first 128 characters, adding to the character set by filling positions 128 to 255. Notable among these are the standard ISO 8859-1 and ISO 8859-15, and the Microsoft-specific Windows Code Page 1252.

So, to come back to MS SQL Server: a "Unicode string", as stored in an nchar, nvarchar, or ntext column, can represent all the characters mapped in the Unicode character set, because it uses a Unicode encoding to store the data. A "non-Unicode string", as stored in a char, varchar, or text column, can represent only the characters mapped in some other encoding. Anything you can store in a non-Unicode column can also be stored in a Unicode column, but not vice versa.

To know exactly which characters you can store, you need to know the "collation" in use, which dictates what Microsoft refers to as a "code page", as explained on this Microsoft reference page. It's likely in your case that you're using the very common Code Page 1252, which I mentioned earlier.

The characters you mentioned exist in both Unicode and Code Page 1252:

Trademark(™) appears in Unicode at position 8482, and in CP1252 at position 153
Registered(®), as it happens, appears in both Unicode and CP1252 at position 174

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange

Why does the varchar datatype allow unicode values?

AND, due to SQL Server natively supporting the UTF-8 encoding for VARCHAR and CHAR datatypes as of SQL Server 2019,

AND, due to SQL Server natively supporting the UTF-8 encoding for `VARCHAR` and `CHAR` datatypes as of SQL Server 2019,