Making non-ASCII data suitable for CRAN

https://stackoverflow.com/questions/18837855

28-06-2022
|

Question

I have some data that contains non-ASCII characters, that I want to include as an rda file in an R package. When I run an R CMD check on the package, I get a warning:

Warning: found non-ASCII strings

which is blocking it being allowed on CRAN.

There's a similar question about removing non-ASCII characters from data files, but I want to keep the non-ASCII characters.

You can grab the CSV data here. I'm reading it into R and resaving as rda with this code:

english_monarchs <- read.csv(
  wherever_you_downloaded_the_file_to, 
  fileEncoding     = "utf8",
  na.strings       = ""
)
save(english_monarchs, "english_monarchs.csv")

It's the name column of the dataset that contains non-ascii values.

head(levels(english_monarchs$name))
## [1] "Adda"                                "Æðelbehrt"                          
## [3] "Æðelberht I"                         "Æðelberht II and Eardwulf"          
## [5] "Æðelberht II, Ælfric and Eadberht I" "Æðelberht III"

Based upon the (not very clear) guidance in the Encoding Issues section of Writing R Extensions I think I ought to be encoding the factor levels as UTF-8, but the obvious method doesn't work:

Encoding(levels(english_monarchs$name)) <- "utf8"  #each encoding still "unknown"

How can I make the data portable enough to be accepted on CRAN?

Solution

The thing that worked for me was to declare the encoding as "latin1", and then use iconv to convert to UTF-8.

Encoding(levels(english_monarchs$name)) <- "latin1"
levels(english_monarchs$name) <- iconv(
  levels(english_monarchs$name), 
  "latin1", 
  "UTF-8"
)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow