Question

Wikipedia has a listing of the x80—x9F "C1" range under Latin 1 Supplement for Unicode. This range is also reserved in the ISO-8859-1 codepage.

I'm looking at a file of strings, all of which are within the 7-bit ASCII range except for a few instances of \x96 where it looks like a dash would be, such as the middle of a street address.

I don't know if other characters in the C1 range might eventually show up in the data, so I'd like to know if there's a correct way to read the file. Are there are any 8-bit encodings which use x80 through x9F for character data instead of terminal control characters?

Était-ce utile?

La solution

There is a large number (potentially an infinite number) of 8-bit encodings that assign graphic characters to some or all bytes in the range 0x80 to 0x9F. Several encodings defined by Microsoft have U+2013 EN DASH “–” at byte position 0x96, and this character could conceivably appear in a street address, especially between numbers.

On the other hand, e.g. MacRoman has the letter “ñ” at position 0x96, and it could well appear within a street name in Spanish, for example.

For a rational analysis of the situation, you should inspect the data as a whole, possibly using a filter that finds all bytes outside the Ascii range 0x00 to 0x7F, look at the contexts in which the characters appear, and try to find technical information about the origin of the data.

Autres conseils

It's an en dash. I guess slightly different than a hyphen (0x2D).

http://www.ascii-code.com/

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top