In unicode some characters, such as ä
, can be represented in two ways. They can be a single codepoint, such as U+00E4 in the case of ä
, or they can be formed from a “base” character immediately followed by a combining character such as a
followed by U+0308 (COMBINING DIAERESIS). In the latter case, the combined character consists of two code points, and Ruby’s String#length
method only returns the total number of code points so you can get differing values for the lengths of what appear to be the same strings.
s1 = "ä" # single codepoint
s2 = "a" # 'base' letter
s3 = "a\u0308" # base letter + combining character
[s1, s2, s3].each do |s|
puts "Letter: #{s}"
puts "Bytes: #{s.bytes}"
puts "Codepoints: #{s.codepoints}"
puts "Length: #{s.length}"
puts
end
Output:
Letter: ä
Bytes: [195, 164]
Codepoints: [228]
Length: 1
Letter: a
Bytes: [97]
Codepoints: [97]
Length: 1
Letter: ä
Bytes: [97, 204, 136]
Codepoints: [97, 776]
Length: 2
(The bytes
is the UTF-8 encoding of the characters. In UTF-8 some characters are encoded as multiple bytes – this is a separate issue from the combining characters.)
Ruby itself doesn’t (currently) have very much support for dealing with unicode issues like this, so you need to use an external library such as UnicodeUtils. The idea of length
can become pretty unclear when talking about different languages (what counts as a ‘single character’. You could use the display_width
method, which will probably give what you want for latin scripts. Another possibility is to use a normalized form, which makes sure all the characters are represented the same way, either all decomposed into conbining characters, or all (that have them available) using the single character:
require 'unicode_utils'
combined = "a\u0308"
single = "ä"
# nfc - normalized form composed - use a single code point if possible
puts UnicodeUtils.nfc(combined).length # => 1
puts UnicodeUtils.nfc(single).length # => 1
# nfd - normalized form decomposed - always use combining characters
puts UnicodeUtils.nfd(combined).length # => 2
puts UnicodeUtils.nfd(single).length # => 2