Julia: Strange characters in my string

https://stackoverflow.com//questions/25082347

02-01-2020
|

Question

I scraped some text from the internet, which I put in an UTF8String. I can use this string normally, but when I select some specific characters (strange character with accents, like in my case ú), which are not part of the UTF8 standard, I get an error, saying that I used invalid indexes. This only happens when the string contains strange characters; my code works with normal string that do not contain strange characters.

Any way to solve this?

EDIT:

I have a variable word of type SubString{UTF8String} When I use do method(word), no problems occur. When I do method(word[2:end]) (assuming length of at least 2), I get an error in case the second character is strange (not in UTF8).

Solution

Julia does indexing on byte positions instead of character position. It is way more efficient for a variable length encoding like UTF-8, but it makes some operations use some more boilerplate.

The problem is that some codepoints is encoded as multiple bytes and when you slice the string from 2:end you would have got half of the first character (witch is invalid and you get an error).

The solution is to get the second valid index instead of 2 in the slice. I think that is something like str[nextind(str, 1):end]

PS. Sorry for a less than clear answer on my phone.

EDIT: I tried this, and it seems like SubString{UTF8String} and UTF8String has different behaviour on slicing. I've reported it as bug #7811 on GitHub.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow