The length of a string with umlauts coming from the filesystem

Question 1

Ruby string does supprot only ASCII chracters till now. So you can use the gem - unicode for this, when you will be having non-ascii charcaters. Look here also - width .

require "unicode"

s1 = "Propädeutikum"
s2 = "Propadeutikum"
Unicode::width(s1) # => 13
Unicode::width(s2) # => 13

Read this post Re: how to capitalize nonascii characters ?

Hi,

Yes, use unicode gem for now. String operations on non ASCII characters are one of the topics for upcoming Ruby 2.2.
         matz.

Question 2

In unicode some characters, such as ä, can be represented in two ways. They can be a single codepoint, such as U+00E4 in the case of ä, or they can be formed from a “base” character immediately followed by a combining character such as a followed by U+0308 (COMBINING DIAERESIS). In the latter case, the combined character consists of two code points, and Ruby’s String#length method only returns the total number of code points so you can get differing values for the lengths of what appear to be the same strings.

s1 = "ä"        # single codepoint
s2 = "a"        # 'base' letter
s3 = "a\u0308"  # base letter + combining character

[s1, s2, s3].each do |s|
  puts "Letter:     #{s}"
  puts "Bytes:      #{s.bytes}"
  puts "Codepoints: #{s.codepoints}"
  puts "Length:     #{s.length}"
  puts
end

Output:

Letter:     ä
Bytes:      [195, 164]
Codepoints: [228]
Length:     1

Letter:     a
Bytes:      [97]
Codepoints: [97]
Length:     1

Letter:     ä
Bytes:      [97, 204, 136]
Codepoints: [97, 776]
Length:     2

(The bytes is the UTF-8 encoding of the characters. In UTF-8 some characters are encoded as multiple bytes – this is a separate issue from the combining characters.)

Ruby itself doesn’t (currently) have very much support for dealing with unicode issues like this, so you need to use an external library such as UnicodeUtils. The idea of length can become pretty unclear when talking about different languages (what counts as a ‘single character’. You could use the display_width method, which will probably give what you want for latin scripts. Another possibility is to use a normalized form, which makes sure all the characters are represented the same way, either all decomposed into conbining characters, or all (that have them available) using the single character:

require 'unicode_utils'

combined = "a\u0308"
single = "ä"

# nfc - normalized form composed - use a single code point if possible
puts UnicodeUtils.nfc(combined).length # => 1
puts UnicodeUtils.nfc(single).length   # => 1

# nfd - normalized form decomposed - always use combining characters
puts UnicodeUtils.nfd(combined).length # => 2
puts UnicodeUtils.nfd(single).length   # => 2

Question 3

Similar to Matt, but may be slightly more efficient.

"Propädeutikum".each_char.size
# => 13

t = Time.now
500000.times{
"Propädeutikum".each_char.size
}
puts Time.now - t
# => 0.364056992

t = Time.now
500000.times{
"Propädeutikum".chars.count
}
puts Time.now - t
# => 1.462392185

Question 4

Maybe you have a problem with Unicode equivalence and composed characters?

See the following example. Both texts look similar, but are encoded in different ways:

#encoding: utf-8
text = "Myl\u00E8ne.png" #"Mylène.png"
text2 = "Myle\u0300ne.png" #"Mylène.png"

puts text   #Mylène.png
puts text2  #Mylène.png

puts text.size   #10
puts text2.size  #11

puts text.chars.count #10
puts text2.chars.count #11

Some more details in my answer for Weird Characters encoding.

You can check it, if you compare the codepoints of your texts with text.codepoints.to_a. In my example I get:

p text.codepoints.to_a   #[77, 121, 108, 232, 110, 101, 46, 112, 110, 103]
p text2.codepoints.to_a  #[77, 121, 108, 101, 768, 110, 101, 46, 112, 110, 103]