Что означает набор символов и сопоставление?

https://stackoverflow.com/questions/341273

19-08-2019
|

Вопрос

Я могу прочитать документацию по MySQL , и это довольно ясно. Но как решить, какой набор символов использовать? На какие данные влияет сопоставление?

Я прошу объяснить два из них и как их выбрать.

Решение

Из MySQL документы :

набор символов - это набор символов   и кодировки. сопоставление - это набор   правила сравнения символов в   набор символов. Давайте сделаем   Различие ясно с примером   воображаемый набор символов.

Предположим, что у нас есть алфавит с   четыре буквы: «A», «B», «a», «b». Мы   дайте каждой букве число: «А» = 0,   «B» = 1, «a» = 2, «b» = 3. Буква   «А» является символом, число 0 является   кодировка для 'A' и комбинация   из всех четырех букв и их   кодировки - это набор символов.

Теперь предположим, что мы хотим сравнить   два строковых значения, «A» и «B».   Самый простой способ сделать это, чтобы посмотреть на   кодировки: 0 для «А» и 1 для   'B'. Поскольку 0 меньше 1, мы говорим   «А» меньше, чем «В». Теперь, что мы имеем   только что сделал это применить сопоставление к нашему   набор символов. Сличение это набор   правил (в данном случае только одно правило):   " сравните кодировки. " Мы называем это   самый простой из всех возможных сопоставлений   двоичное сопоставление.

Но что если мы хотим сказать, что   строчные и прописные буквы   эквивалент? Тогда мы бы в   как минимум два правила: (1) относиться к   строчные буквы «а» и «б» как   эквивалентно 'A' и 'B'; (2) тогда   сравните кодировки. Мы называем это   сортировка без учета регистра. Это   немного сложнее, чем двоичный   сверка.

В реальной жизни большинство наборов символов имеют   много символов: не только «A» и «B»   но целые алфавиты, иногда   множественные алфавиты или восточная письменность   системы с тысячами символов,   наряду со многими специальными символами и   знаки препинания. Также в реальной жизни,   большинство сопоставлений имеют много правил: не   просто нечувствительность к регистру, но и   нечувствительность к акценту (" accent " является   пометьте прикрепленный к персонажу как в   Немецкий '& # 246;') и многосимвольный   сопоставления (например, правило «& # 246;» =   «О» в одном из двух немецких   Параметры сортировки).

Другие советы

A character encoding is a way to encode characters so that they fit in memory. That is, if the charset is ISO-8859-15, the euro symbol, €, will be encoded as 0xa4, and in UTF-8, it will be 0xe282ac.

The collation is how to compare characters, in latin9, there are letters as e é è ê f, if sorted by their binary representation, it will go e f é ê è but if the collation is set to, for example, French, you'll have them in the order you thought they would be, which is all of e é è ê are equal, and then f.

A character set is a subset of all written glyphs. A character encoding specifies how those characters are mapped to numeric values. Some character encodings, like UTF-8 and UTF-16, can encode any character in the Universal Character Set. Others, like US-ASCII or ISO-8859-1 can only encode a small subset, since they use 7 and 8 bits per character, respectively. Because many standards specify both a character set and a character encoding, the term "character set" is often substituted freely for "character encoding".

A collation comprises rules that specify how characters can be compared for sorting. Collations rules can be locale-specific: the proper order of two characters varies from language to language.

Choosing a character set and collation comes down to whether your application is internationalized or not. If not, what locale are you targeting?

In order to choose what character set you want to support, you have to consider your application. If you are storing user-supplied input, it might be hard to foresee all the locales in which your software will eventually be used. To support them all, it might be best to support the UCS (Unicode) from the start. However, there is a cost to this; many western European characters will now require two bytes of storage per character instead of one.

Choosing the right collation can help performance if your database uses the collation to create an index, and later uses that index to provide sorted results. However, since collation rules are often locale-specific, that index will be worthless if you need to sort results according to the rules of another locale.

I suggest to use utf8mb4_unicode_ci, which is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow