How to determine in .NET if a file is UCS-2 vs. UTF-16

https://stackoverflow.com/questions/10746775

10-06-2021
|

题

I have flat files that I can load just fine in .NET in UTF-16, even though they are technically UCS2-LE (w/o BOM), and I understand this is because UCS-2 is an older standard that UTF-16 supercedes.

However, what I'm interested in is being able to determine if a file actually is UCS-2. I know that this means I'd be guessing. I have tried the .NET ports of chardet, the IMultilang2 interop, and some open source by Novell for trying to tease out a determination of UCS-2 over UTF-16 and I haven't had any success. I haven't found any technique that can determine the difference between UCS-2LE w/o BOM and invalid/overlong UTF-8.

Should I be inspecting them byte for byte and trying to decide if it's variable or fixed length encoding? Maybe look for missing codepoints? The issue is these text files have no special codepoints, they only have the bog standard Western character set. But TextPad saves them as UCS2-LE w/o BOM, and it complicates downstream file operations in our software that wants them to be fully compliant UTF-16 (and just force loading the files works, but won't work with the software's requirements).

解决方案

This wikipedia article section, http://en.wikipedia.org/wiki/UTF-16, speaks about the Basic Multilingual Plane, BMP. All code points in the BMP is identical for both UTF-16 and UCS-2. If TextPad is just encoding the BMP then you can treat the document as either UTF-16 or UCS-2.

It is when code points outside the BMP are encoded that a problem arises. UCS-2 cannot represent code points outside the BMP. http://en.wikipedia.org/wiki/Universal_Character_Set This would lead one to assume that if a code point is outside the BMP then it can be treated at UTF-16. This could be problematic if the program creating the file was doing UCS-2 improperly and using codepoints outside the BMP for ancillary reasons.

Most libraries and programs that read UTF allow you to specify what to do when an encoding error occurs on a per character basis(raise an exception, replace with a placeholder, simply ignore). If an improper UCS-2 file is run through one of these as UTF-16 it will raise errors. Understanding what the author of the file was trying to do outside the BMP would be the only way to handle them appropriately.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow