R performance issues using gsub and sapply

https://stackoverflow.com/questions/22610597

20-06-2023
|

Question

I have a data frame consisting of +10 million records (all_postcodes). [Edit] Here are just a few records:

pcode  area  east    north   area2     area3      area4      area5
AB101AA 10  394251  806376  S92000003 S08000006  S12000033  S13002483
AB101AB 10  394232  806470  S92000003 S08000006  S12000033  S13002483
AB101AF 10  394181  806429  S92000003 S08000006  S12000033  S13002483
AB101AG 10  394251  806376  S92000003 S08000006  S12000033  S13002483

I want to create a new column containing normalised versions of one of the columns using the following function:

pcode_normalize <- function (x) {
x <- gsub("  ", " ", x)
if (length(which(strsplit(x, "")[[1]]==" ")) == 0) {
x <- paste(substr(x, 1, 4), substr(x, 5, 7))
}
x
}

I tried to execute it as follows:

all_postcodes$npcode <- sapply(all_postcodes$pcode, pcode_normalize)

but it takes too long. Any suggestions how to improve the performance?

Solution

All the functions you used in pcode_normalize are already vectorized. There's no need to loop using sapply. It also looks like you're using strsplit to look for single-spaces. grepl would be faster.

Using fixed=TRUE in your calls to gsub and grepl will be faster, since you're not actually using regular expressions.

pcode_normalize <- function (x) {
  x <- gsub("  ", " ", x, fixed=TRUE)
  sp <- grepl(" ", x, fixed=TRUE)
  x[!sp] <- paste(substr(x[!sp], 1, 4), substr(x[!sp], 5, 7))
  x
}
all_postcodes$npcode <- pcode_normalize(all_postcodes$pcode)

I couldn't actually test this, since you didn't provide any example data, but it should get you on the right path.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow