Question

I have some data that contains non-ASCII characters, that I want to include as an rda file in an R package. When I run an R CMD check on the package, I get a warning:

Warning: found non-ASCII strings

which is blocking it being allowed on CRAN.

There's a similar question about removing non-ASCII characters from data files, but I want to keep the non-ASCII characters.

You can grab the CSV data here. I'm reading it into R and resaving as rda with this code:

english_monarchs <- read.csv(
  wherever_you_downloaded_the_file_to, 
  fileEncoding     = "utf8",
  na.strings       = ""
)
save(english_monarchs, "english_monarchs.csv")

It's the name column of the dataset that contains non-ascii values.

head(levels(english_monarchs$name))
## [1] "Adda"                                "Æðelbehrt"                          
## [3] "Æðelberht I"                         "Æðelberht II and Eardwulf"          
## [5] "Æðelberht II, Ælfric and Eadberht I" "Æðelberht III"

Based upon the (not very clear) guidance in the Encoding Issues section of Writing R Extensions I think I ought to be encoding the factor levels as UTF-8, but the obvious method doesn't work:

Encoding(levels(english_monarchs$name)) <- "utf8"  #each encoding still "unknown"

How can I make the data portable enough to be accepted on CRAN?

Était-ce utile?

La solution

The thing that worked for me was to declare the encoding as "latin1", and then use iconv to convert to UTF-8.

Encoding(levels(english_monarchs$name)) <- "latin1"
levels(english_monarchs$name) <- iconv(
  levels(english_monarchs$name), 
  "latin1", 
  "UTF-8"
)
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top