Question

I want/need a test case for testing/breaking conversions between UTF-32 and UTF-16.

For UTF-8 and UTF-16, I generally use the 'Chinese Bone' test: 0xE9 0xAA 0xA8 (UTF8) and 0x9AA8 (UTF16).

Does anyone have a negative test case that should break a poorly written implementation for UTF-16 and UTF-32? Ideally, the test will require use of at least two UTF-32 values.

Jeff

Was it helpful?

Solution

Not sure what you mean, here are some:

UTF-16

  • Lead surrogate with regular unit or another lead surrogate following \xD8\x00\x00\x00 or \xD8\x00\xDB\xFF
  • Trail surrogate without lead surrogate before it \x00\x61\xDC\00
  • Trail surrogate in lead position \xDF\xFF\xDB\xFF
  • Lead surrogate as last unit \xD8\x01<EOF>
  • Lead surrogate as last unit, followed by a half trail surrogate. This bug exists in python 2.7.3: '\xD8\x00\xDC'.decode('utf-16be')

UTF-32

  • Unit value returns true for value < 0, value > 0x10FFFF or 0xD800 <= value && value <= 0xDFFF
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top