There have been a number of duplicate questions (here and here) since this one was asked. I have collected (and improved) some of the more idiomatic answers and benchmarked them against my own Rcpp
implementation.
For simplicity, I have compared functions that take as input and return as output a character matrix, not a data frame containing only character variables. You can always coerce from one to the other with as.matrix
and as.data.frame
(see bottom for example).
Rcpp::sourceCpp(code = '
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void shift_na_in_place(CharacterMatrix x)
{
int m = x.nrow();
int n = x.ncol();
for (int i = 0, k = 0, k0 = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
if (x[k] != NA_STRING) {
x[k0] = x[k];
k0 += m;
}
k += m;
}
while (k0 < k) {
x[k0] = NA_STRING;
k0 += m;
}
k = (k % m) + 1;
k0 = k;
}
if (x.attr("dimnames") != R_NilValue) {
List dn = x.attr("dimnames");
dn[1] = R_NilValue;
if (dn.attr("names") != R_NilValue) {
CharacterVector ndn = dn.attr("names");
ndn[1] = "";
}
}
}
// [[Rcpp::export]]
CharacterMatrix shift_na(CharacterMatrix x)
{
CharacterMatrix y = clone(x);
shift_na_in_place(y);
return y;
}
')
f1 <- function(x) {
t(apply(x, 1L, function(y) {r <- is.na(y); c(y[!r], y[r])}))
}
f2 <- function(x) {
t(apply(x, 1L, function(y) y[order(is.na(y), method = "radix")]))
}
f3 <- function(x) {
d <- dim(x)
dn <- dimnames(x)
matrix(x[order(row(x), is.na(x), method = "radix")],
nrow = d[1L], ncol = d[2L], byrow = TRUE,
dimnames = if (!is.null(dn)) c(dn[1L], list(NULL)))
}
f4 <- function(x) {
d <- dim(x)
dn <- dimnames(x)
matrix(x[order(is.na(x) + (row(x) - 1L) * 2L + 1L, method = "radix")],
nrow = d[1L], ncol = d[2L], byrow = TRUE,
dimnames = if (!is.null(dn)) c(dn[1L], list(NULL)))
}
set.seed(1L)
m <- 1e+05L
n <- 10L
x <- sample(c(letters, NA), size = m * n, replace = TRUE, prob = c(rep(1, 26), 13))
dim(x) <- c(m, n)
microbenchmark::microbenchmark(shift_na(x), f1(x), f2(x), f3(x), f4(x), check = "identical")
Unit: milliseconds
expr min lq mean median uq max neval
shift_na(x) 10.04959 10.32019 10.82935 10.41968 10.60104 22.69412 100
f1(x) 141.95959 150.83875 180.49025 167.01266 211.52478 248.07587 100
f2(x) 722.27211 759.75710 780.69368 773.26920 797.01253 857.07905 100
f3(x) 18.45201 19.15436 22.47760 21.59577 22.40543 66.47121 100
f4(x) 30.03168 31.62765 35.22960 33.92801 35.06384 85.92661 100
The dedicated Rcpp
implementation shift_na
is fastest, as you might expect, but f3
and f4
are not much slower. A few finer points:
f1
and f2
call apply
, which is built on an R for
loop, so it is not surprising that they are slow.
f3
and f4
have to allocate memory for is.na(x)
and row(x)
, which could be a hindrance for large enough x
.
f3
is faster than f4
because the "radix"
sort uses a faster algorithm when the range (maximum minus minimum) of the integer vector being sorted is less than 100000 (see ?sort
). Here, the ranges are:
is.na(x): 1
row(x): 99999
is.na(x) + (row(x) - 1L) * 2L + 1L: 199999
shift_na(x)
creates a copy of x
and modifies the copy in place. If you cannot or do not want to allocate memory for a copy because x
is very large, then you can do shift_na_in_place(x)
to modify x
in place.
shift_na_in_place
should be preferred over shift_na
if you have a data frame data
containing character variables, rather than a character matrix. In this situation, there is no need to preserve the intermediate as.matrix(data)
; it can be modified in place:
x <- as.matrix(data)
shift_na_in_place(x)
newdata <- as.data.frame(x)