Converting a \u escaped Unicode string to ASCII

Question 1

Use parse, but don't evaluate the results:

x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1

Question 2

With the stringi package:

> x <- 'pretty\\u003D\\u003Ebig'
> stringi::stri_unescape_unicode(x)
[1] "pretty=>big"

Question 3

Although I have accepted Hong ooi's answer, I can't help thinking parse and eval is a heavyweight solution. Also, as pointed out, it is not secure, although for my application I can be confident that I will not get dangerous quotes.

So, I have devised an alternative, somewhat brutal, approach:

udecode <- function(string){
  uconv <- function(chars) intToUtf8(strtoi(chars, 16L))
  ufilter <- function(string) {
    if (substr(string, 1, 1)=="|") uconv(substr(string, 2, 5)) else string
  }
  string <- gsub("\\\\u([[:xdigit:]]{4})", ",|\\1,", string, perl=TRUE)
  strings <- unlist(strsplit(string, ","))
  string <- paste(sapply(strings, ufilter), collapse='')
  return(string)
}

Any simplifications welcomed!

Question 4

A use for eval(parse)!

eval(parse(text=paste0("'", x, "'")))

This has its own problems of course, such as having to manually escape any quote marks within the string. But it should work for any valid Unicode sequences that may appear.

Question 5

I sympathise; I have struggled with R and unicode text in the past and not always successfully. If your data is in x then first try a global replace, something like this:

x <- gsub("\u003D", "=>", x)

I sometimes use a construction like

lapply(x, utf8ToInt)

to see where the high code points are e.g. anything over 150. This helps me locate problems caused by non-breaking spaces, for example, which seem to pop up every now and again.

Question 6

> iconv('pretty\u003D\u003Ebig', "UTF-8", "ASCII")
[1] "pretty=>big"

but you appear to have an extra escape

Question 7

The trick here is that '\\u003D' is actually 6 characters while you want '\u003D' which is only one character. The further trick is that to match those backslashes you need to use doubly escaped backslashes in the pattern:

gsub("\\\\u003D\\\\u003E", "\u003D\u003E", x)
#[1] "pretty=>big"

To replace multiple characters with one character you need to target the entire pattern. You cannot simply delete a backslash. (Since you have indicated this is a more general problem, I think the answer might lie in modifications to your as yet undescribed method for downloading this text.)

When I load your functions and the dependencies, this code works:

> freq <- ngram(c('pretty\u003D\u003Ebig'), year_start = 1950)
> 
> str(freq)
'data.frame':   59 obs. of  4 variables:
 $ Year     : num  1950 1951 1952 1953 1954 ...
 $ Phrase   : Factor w/ 1 level "pretty=>big": 1 1 1 1 1 1 1 1 1 1 ...
 $ Frequency: num  1.52e-10 6.03e-10 5.98e-10 8.27e-10 8.13e-10 ...
 $ Corpus   : Factor w/ 1 level "eng_2012": 1 1 1 1 1 1 1 1 1 1 ...

(So I guess I am still not clear on the use case.)