문제

Say I have a lookup table as following

dt <- data.frame(name=c("jack","jill","sam","dan"),age=c(20,14,28,13))

  name age
1 jack  20
2 jill  14
3  sam  28
4  dan  13

Now I want to convert the following vectors to the vectors containing the ages of each element.

query1 <- c("jack","dan")
query2 <- c("sam")
query3 <- c("jack","sam", "dan")

I can build the following function(which I don't like) to accomplish this task,

get.age <- function(x) {

  answer <- list()
  for(i in 1:length(x)){
    answer[[i]]<- dt[dt$name==x[i],"age"]
  }
  ldply(answer)$V1
}

which gets the job done like this

> get.age(query1)
[1] 20 13
> get.age(query2)
[1] 28
> get.age(query3)
[1] 20 28 13

But I don't like the solution because it uses a for loop and some dirty hack. Ideally, I would want to do it more R-like using vector operations like this(which doesn't seem to work)

> dt[dt$name==c("jack","dan"),"age"]
[1] 20 13 #worked
> dt[dt$name==c("jack","sam"),"age"]
[1] 20  # not the right answer

The following solution works, but this requires the prior knowledge of how many things I lookup.

dt[dt$name=="jack" | dt$name=="sam","age"]

[1] 20 28

I would like to know a method that can handle arbitrary size of vectors that converts the keys to elements without using for loop, if there is any

도움이 되었습니까?

해결책

For a true lookup table, the result should be the length of the query and also deal with replication in the query. The approaches using match(...) are the only ones that do this:

query4 <- c("jack","sam", "dan","sam","jack")
dt[match(query4,dt$name),]$age
# [1] 20 28 13 28 20

This is because match(LHS,RHS) returns an integer vector of length(LHS) which contains the row numbers of the RHS which match the corresponding element of LHS.

The approaches based on comparison (==) will generally not work. This s because, when comparing two vectors, R tries to replicate the shorter one however many times are needed to make it the same length as the longer one, and then does an element-by-element comparison. So in the case of dt$name==query1, for example, the RHS gets replicated twice and the comparison is between c("jack","jill","sam","dan") and c("jack","dan","jack","dan").

dt$name==query1   # RHS is: c("jack","dan","jack","dan")
# [1]  TRUE FALSE FALSE  TRUE
dt$name==query2   # RHS is: c("sam","sam","sam","sam")
# [1] FALSE FALSE  TRUE FALSE
dt$name==query3   # RHS is: c("jack","sam", "dan","jack") with warning
# [1]  TRUE FALSE FALSE FALSE
# with warning:   longer object length is not a multiple of shorter object length

On the other hand, using LHS %in% RHS gives a result with length(LHS) and either T or F depending on whether that element is present in RHS.

dt$name %in% query1
# [1]  TRUE FALSE FALSE  TRUE
query1 %in% dt$name
# [1] TRUE TRUE

Note that it looks like df$name %in% query1 and df$name==query1 give the same result, but that's an artifact of query1 being replicated twice in the latter comparison. See, for example:

dt$name %in% query3
# [1]  TRUE FALSE  TRUE  TRUE
dt$name  ==  query3
# [1]  TRUE FALSE FALSE FALSE

다른 팁

You want %in%, it returns of logical vector that is used to subset the data frame

dt[dt$name %in% query3,"age"]

There are a lot of ways to do this, but I'll throw out one that I find useful. match(). @jlhoward's answer goes into more detail and explains why my previous == examples were wrong.

> match(query1, dt$name) #these give us the index of the *first* matching value
[1] 1 4
> match(query2, dt$name)
[1] 3

> dt$age[match(query1, dt$name)]
[1] 20 13
> dt$age[match(query2, dt$name)]
[1] 28

You can also use %in% unlike match this returns TRUE and FALSE for the elements that exist in the comparison (be sure to get the order right for, dt$name %in% query1 returns TRUE FALSE FALSE TRUE, query1 %in% dt$name returns TRUE TRUE)

> dt[dt$name %in% query1, ][,'age',]
[1] 20 13

With dplyr you can use filter

> require(dplyr)
> filter(dt, name %in% query1)
  name age
1 jack  20
2 dan  13
> filter(dt, name %in% query1)$age
[1] 20 13
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top