Question

I need to get the first char of a text variable. I achieve this with one of the following simple methods:

string.sub(someText,1,1)

or

someText:sub(1,1)

If I do the following, I expect to get 'ñ' as the first letter. However, the result of either of the sub methods is 'Ã'

local someText = 'ñññññññ'
print('Test whole: '..someText) 
print('first char: '..someText:sub(1,1))
print('first char with .sub: '..string.sub(someText,1,1))

Here are the results from the console:

2014-03-02 09:08:47.959 Corona Simulator[1701:507] Test whole: ñññññññ
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char: Ã
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char with .sub: Ã

It seems like the string.sub() function is encoding the returned value in UTF-8. Just for kicks I tried using the utf8_decode() function that's provided by Corona SDK. It was not successful. The simulator indicated that the function expected a number but got nil instead.

I also searched the web to see if anyone else had ran into this issue. I found out that there is a fair amount of discussion on Lua, Corona, Unicode and UTF-8 but I did not come across anything that would address this specific problem.

Was it helpful?

Solution

Lua strings are 8-bit clean, which means strings in Lua are a stream of bytes. The UTF-8 character ñ has multiple bytes, but someText:sub(1,1) returns only the first single byte.

For UTF-8 encoding, all characters in the ASCII range have the same representation as in ASCII, that is, a single byte smaller than 128. For other CodePoints, a sequences of bytes where the first byte is in the range 194-244 and continuation bytes are in the range 128-191.

Because of this, you can use the pattern ".[\128-\191]*" to match a single UTF-8 CodePoint (not Grapheme):

for c in "ñññññññ":gmatch(".[\128-\191]*") do -- pretend the first string is in NFC
    print(c)
end

Output:

ñ
ñ
ñ
ñ
ñ
ñ
ñ

OTHER TIPS

Regarding the used character set: Just know what requirements you bake into your own code, and make sure those are actually satisfied. There are various typical requirements:

  • ASCII-compatible (aka any byte < 128 represents an ASCII character and all ASCII characters are represented as themselves)
  • Fixed-Size vs Variable-Width (maybe self-synchronizing) Character Set
  • No embedded 0-bytes

Write your code so you need as few of those requirements as cannot be avoided, and document them.

match a single UTF-8 character: Be sure what you mean by UTF-8 character. Is it Glyph or CodePoint? AFAIK you need full unicode-tables for glyph-matching. Do you actually have to get to this level at all?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top