The length of Arabic letters in Lua

https://stackoverflow.com/questions/21138583

28-09-2022
|

Question

In Lua language when I want to get the length of a single Arabic letter (such as "ف"), the answer will be 2!

Ex.

local letter = "ف"
print( letter:len() )

Output: 2

The same problem occur when I use (string.sub(a,b)). If I want to print the first letter of an Arabic word, I can't say (string.sub(1,1).

Ex.

local word_1 = "فولت"
print( word_1:sub(1,2) )

Output: ف
as you saw I put the second argument (2) not (1) to get the correct answer.
if I put the first argument 1 the answer will be:

print( word_1:sub(1,1) )

Output: Ù

Why does Lua count the length of a single Arabic letter as a two?

And is there a way to get the right length which is 1?

Solution

Lua is 8-bit clean.

In other words, a Lua string is a sequence of bytes, it doesn't support Unicode internally. The Arabic letter "ف" has 2 bytes, so Lua treats it as a string of length 2.

You need to use a special trick to manipulate Unicode, e.g, assuming UTF-8 is used, you can use this snippet to count the length of a string (Referece: Lua Unicode):

local _, count = string.gsub(unicode_string, "[^\128-\193]", "")

OTHER TIPS

Lua 5.3 is released now. It provides a basic UTF-8 library.

utf8.len can be used to get the length of a UTF-8 string:

print(utf8.len("ف"))
-- 1

Lua being 8-bit clean is enough to say that Lua supports Unicode. Though without additional unicode support library, the extent of support is minimal. For any Unicode string, there are at least 4 ways to measure it: Code units, Code points, Grapheme clusters. A fourth way is bytecount, which is a constant multiple of code units, depending on which UTF is used. UTF-8: 1 UTF16: 2 UTF32: 4. So, think clearly which of those measures you need where.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow