Pergunta

Suppose I have vector vec <- c("D","B","B","C","C").

My objective is to end up with a list of dimension length(unique(vec)), where each i of this list returns a vector of indices which denote the locations of unique(vec)[i] in vec.

For example, this list for vec would return:

exampleList <- list()
exampleList[[1]] <- c(1) #Since "D" is the first element
exampleList[[2]] <- c(2,3) #Since "B" is the 2nd/3rd element.
exampleList[[3]] <- c(4,5) #Since "C" is the 4th/5th element.

I tried the following approach but it's too slow. My example is large so I need faster code:

vec <- c("D","B","B","C","C")
uniques <- unique(vec)
exampleList <- lapply(1:3,function(i) {
    which(vec==uniques[i])
})
exampleList
Foi útil?

Solução

Update: The behaviour DT[, list(list(.)), by=.] sometimes resulted in wrong results in R version >= 3.1.0. This is now fixed in commit #1280 in the current development version of data.table v1.9.3. From NEWS:

  • DT[, list(list(.)), by=.] returns correct results in R >=3.1.0 as well. The bug was due to recent (welcoming) changes in R v3.1.0 where list(.) does not result in a copy. Closes #481.

Using data.table is about 15x faster than tapply:

library(data.table)

vec <- c("D","B","B","C","C")

dt = as.data.table(vec)[, list(list(.I)), by = vec]
dt
#   vec  V1
#1:   D   1
#2:   B 2,3
#3:   C 4,5

# to get it in the desired format
# (perhaps in the future data.table's setnames will work for lists instead)
setattr(dt$V1, 'names', dt$vec)
dt$V1
#$D
#[1] 1
#
#$B
#[1] 2 3
#
#$C
#[1] 4 5

Speed tests:

vec = sample(letters, 1e7, T)

system.time(tapply(seq_along(vec), vec, identity)[unique(vec)])
#   user  system elapsed 
#   7.92    0.35    8.50 

system.time({dt = as.data.table(vec)[, list(list(.I)), by = vec]; setattr(dt$V1, 'names', dt$vec); dt$V1})
#   user  system elapsed 
#   0.39    0.09    0.49 

Outras dicas

split(seq_along(vec), vec)

this is faster and shorter than tapply solution:

vec = sample(letters, 1e7, T)
system.time(res1 <- tapply(seq_along(vec), vec, identity)[unique(vec)])
#   user  system elapsed 
#  1.808   0.364   2.176 
system.time(res2 <- split(seq_along(vec), vec))
#   user  system elapsed 
#  0.876   0.152   1.029 

You can do this with tapply:

vec <- c("D", "B", "B", "C", "C")
tapply(seq_along(vec), vec, identity)[unique(vec)]
# $D
# [1] 1
# 
# $B
# [1] 2 3
# 
# $C
# [1] 4 5

The identity function returns its argument as its result, and indexing by unique(vec) ensures you get it back in the same order of the elements in your original vector.

To maintain the order of josilber's answer, simply index the result by the uniques vector you created:

vec <- c("D","B","B","C","C")

uniques <- unique(vec)

tapply(seq_along(vec), vec, identity)[uniques]

# $D
# [1] 1
#
# $B
# [1] 2 3
#
# $C
# [1] 4 5
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top