Question

In order to change values in results data frame, I use a stringr-based function, recommended in an answer by Hadley Wickham (https://stackoverflow.com/a/12829731/2872891). I left the function intact with the exception of changing df in the end to return (df), which I like better. However, I see some strange behavior and I'm not sure what is the reason for it. The subsequent calls of replace_all, in particular, calls #3 and #4 do not recover the original data: http: and mailto:. A reproducible example follows.

Data (just one record of data):

Please see this Gist on GitHub: https://gist.github.com/abnova/1709b1e0cf8a57570bd1#file-gistfile1-r

Code (for brevity, I removed my comments with detailed explainations):

DATA_SEP <- ":"

rx <- "([[:alpha:]][^.:]|[[:blank:]])::([[:alpha:]][^:]|[[:blank:]])"
results <- gsub(rx, "\\1@@\\2", results)
results <- gsub(": ", "!@#", results) # should be after the ::-gsub
results <- gsub("http://", "http//", results)
results <- gsub("mailto:", "mailto@", results)

results <- gsub("-\\r\\n", "-", results) # order is important here
results <- gsub("\\r\\n", " ", results)

results <- gsub("\\n:gpl:962356288", ":gpl:962356288", results)

results <- readLines(textConnection(unlist(results)))
numLines <- length(results)
results <- lapply(results, function(x) gsub(".$", "", x))

data <- read.table(textConnection(unlist(results)),
                   header = FALSE, fill = TRUE,
                   sep = DATA_SEP, quote = "",
                   colClasses = "character", row.names = NULL,
                   nrows = numLines, comment.char = "",
                   strip.white = TRUE)

replace_all(data, fixed("!@#"), ": ")
replace_all(data, fixed("@@"), "::")
replace_all(data, fixed("http//"), "http://")
replace_all(data, fixed("mailto@"), "mailto:")

Results - actual:

> data$V3
[1] "http//www.accessgrid.org/"
> data$V17
[1] "http//mailto@accessgrid-tech@lists.sourceforge.net"

Results - expected:

> data$V3
[1] "http://www.accessgrid.org/"
> data$V17
[1] "http://mailto:accessgrid-tech@lists.sourceforge.net"

I'd appreciate any help and/or advice.

Was it helpful?

Solution 2

After almost having finished the alternative (gsub-based) implementation, suggested by @hwnd, I realized what was the problem with my original code. I quickly tested the fixed code and it confirmed my thoughts. I simply needed, for each subsequent replace_str call, to re-save the result, returned by each previous call. Therefore, the fixed code looks like this:

# Now we can safely do post-processing, recovering original data
data <- replace_all(data, fixed("!@#"), ": ")
data <- replace_all(data, fixed("@@"), "::")
data <- replace_all(data, fixed("http//"), "http://")
data <- replace_all(data, fixed("mailto@"), "mailto:")

Again, thanks to @hwnd for valuable suggestions, which helped me to figure out this issue.

OTHER TIPS

I tested this and found an issue with the replacement using multiple calls to replace_all back to back.

replace_all(data, fixed("!@#"), ": ")
replace_all(data, fixed("@@"), "::")
replace_all(data, fixed("http//"), "http://")
replace_all(data, fixed("mailto@"), "mailto:")

The reason you are not seeing the expected output is because you are not assigning the result of the replace_all calls to anything afterwards. It should be..

data <- replace_all(data, fixed("!@#"), ": ")
data <- replace_all(data, fixed("@@"), "::")
data <- replace_all(data, fixed("http//"), "http://")
data <- replace_all(data, fixed("mailto@"), "mailto:")
data

Another way to do this without using stringr would be to create vectors that contain your pattern and replacements and loop through them with one call for the replacement.

re  <- c('!@#', '@@', 'http//', 'mailto@')
val <- c(': ',  '::', 'http://', 'mailto:')

replace_all <- function(pattern, repl, x) {
    for (i in 1:length(pattern))
       x <- gsub(pattern[i], repl[i], x, fixed=T)
       x
}
replace_all(re, val, data)

Output

[3] "http://www.accessgrid.org/"
[17] "http://mailto:accessgrid-tech@lists.sourceforge.net"   
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top