Question

I scraped some text from the internet, which I put in an UTF8String. I can use this string normally, but when I select some specific characters (strange character with accents, like in my case ú), which are not part of the UTF8 standard, I get an error, saying that I used invalid indexes. This only happens when the string contains strange characters; my code works with normal string that do not contain strange characters.

Any way to solve this?

EDIT:

I have a variable word of type SubString{UTF8String} When I use do method(word), no problems occur. When I do method(word[2:end]) (assuming length of at least 2), I get an error in case the second character is strange (not in UTF8).

Was it helpful?

Solution

Julia does indexing on byte positions instead of character position. It is way more efficient for a variable length encoding like UTF-8, but it makes some operations use some more boilerplate.

The problem is that some codepoints is encoded as multiple bytes and when you slice the string from 2:end you would have got half of the first character (witch is invalid and you get an error).

The solution is to get the second valid index instead of 2 in the slice. I think that is something like str[nextind(str, 1):end]

PS. Sorry for a less than clear answer on my phone.

EDIT: I tried this, and it seems like SubString{UTF8String} and UTF8String has different behaviour on slicing. I've reported it as bug #7811 on GitHub.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top