文字セットと照合は正確に何を意味しますか？

https://stackoverflow.com/questions/341273

19-08-2019
|

質問

MySQL のドキュメントを読むことができ、それはかなり明確です。しかし、どの文字セットを使用するかをどのように決定しますか？照合はどのデータに影響しますか？

この2つの説明とそれらの選択方法を求めています。

解決

文字セットは記号のセットです   およびエンコーディング。照合は、一連の   の文字を比較するための規則   キャラクターセット。作りましょう   の例で明確な区別   想像上の文字セット。

アルファベットがあり、   4文字：「A」、「B」、「a」、「b」。我々   各文字に数字を付けます： 'A' = 0、   「B」= 1、「a」= 2、「b」= 3。   「A」は記号、数字の0は   「A」のエンコーディング、およびその組み合わせ   すべての4文字とその   encodingsは文字セットです。

今、比較したいとします   2つの文字列値、「A」と「B」。の   これを行う最も簡単な方法は、   エンコード：「A」の場合は0、「A」の場合は1   「B」。 0は1より小さいため、   「A」は「B」より小さい。今、私たちがしたこと   完了したのは、照合を私たちに適用することです   キャラクターセット。照合はセットです   ルール（この場合は1つのルールのみ）：   <！> quot;エンコードを比較します。<！> quot;これを呼ぶ   可能な限りすべての照合a   バイナリ照合。

しかし、もし私たちが   小文字と大文字は   同等ですか？次に、   少なくとも2つのルール：（1）を扱う   小文字の「a」および「b」として   「A」および「B」と同等; （2）その後   エンコードを比較します。これを   大文字と小文字を区別しない照合。それは   バイナリよりも少し複雑   照合。

実際には、ほとんどの文字セットには   多くの文字：「A」と「B」だけでなく   しかし、アルファベット全体、時には   複数のアルファベットまたは東洋の文章   数千文字のシステム、   多くの特別なシンボルと一緒に   句読点。実生活でも   ほとんどの照合には多くのルールがあります。   大文字と小文字を区別しないだけでなく、   アクセントを区別しない（<！> quot; accent <！> quot;は   のようにキャラクターに付けられたマーク   ドイツ語の '<！>＃246;'）および複数文字   マッピング（「<！>＃246;」というルールなど=   2つのドイツ語のいずれかの「OE」   照合）。

他のヒント

A character encoding is a way to encode characters so that they fit in memory. That is, if the charset is ISO-8859-15, the euro symbol, €, will be encoded as 0xa4, and in UTF-8, it will be 0xe282ac.

The collation is how to compare characters, in latin9, there are letters as e é è ê f, if sorted by their binary representation, it will go e f é ê è but if the collation is set to, for example, French, you'll have them in the order you thought they would be, which is all of e é è ê are equal, and then f.

A character set is a subset of all written glyphs. A character encoding specifies how those characters are mapped to numeric values. Some character encodings, like UTF-8 and UTF-16, can encode any character in the Universal Character Set. Others, like US-ASCII or ISO-8859-1 can only encode a small subset, since they use 7 and 8 bits per character, respectively. Because many standards specify both a character set and a character encoding, the term "character set" is often substituted freely for "character encoding".

A collation comprises rules that specify how characters can be compared for sorting. Collations rules can be locale-specific: the proper order of two characters varies from language to language.

Choosing a character set and collation comes down to whether your application is internationalized or not. If not, what locale are you targeting?

In order to choose what character set you want to support, you have to consider your application. If you are storing user-supplied input, it might be hard to foresee all the locales in which your software will eventually be used. To support them all, it might be best to support the UCS (Unicode) from the start. However, there is a cost to this; many western European characters will now require two bytes of storage per character instead of one.

Choosing the right collation can help performance if your database uses the collation to create an index, and later uses that index to provide sorted results. However, since collation rules are often locale-specific, that index will be worthless if you need to sort results according to the rules of another locale.

I suggest to use utf8mb4_unicode_ci, which is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow