Better way to conditionally fill a column in one df based on multiple values in a row in another df in R

https://stackoverflow.com/questions/21688323

09-10-2022
|

質問

I have two dataframes that I would like to relate.

The first dataframe (LOC) has about two hundred animals studied over 5 years, so about 100000 point locations. I need to know what pack each animal was in during the study duration, but unfortunately, animals change packs every so often. I need each point location to have a pack name associated with it since I will base further analyses on the pack, not the individual.

The second dataframe (PACK) has pack affiliations for these three hundred animals, and each row represents an entry day and exit day into a pack. My real data is a table with about 700 rows, and each row represents how long that animal stayed in the pack. For example in the PACK dataframe below, animal "W1" was in SunnyLake in the first day of year "A", then on day 200 left and went to RainyLake and stayed there until the end of year "B" (a leap year).

Since I have so many animals and so many switches among years, I want to know a way to tell R which days the animal belonged to SunnyLake and RainyLake in the LOC table.

So far my approach has been the following, but it is slow. I'm pretty new to R, so I thought there must be a faster, more elegant way to do this. If you see a better solution, please let me know (and keep in mind I just started working in R a few months ago)!

My example data:

Location table

LOC <- data.frame(matrix(NA, nrow = 8, ncol = 4))
colnames(LOC) <- c("ID", "Yr", "Dy", "Pack")
LOC$ID <- "W1"
LOC$Yr <- rep(c("A", "B"), each = 4)
LOC$Dy <- c(4, 200, 300, 335, 3, 100, 150, 350)
LOC

  ID Yr  Dy Pack
1 W1  A   4   NA
2 W1  A 200   NA
3 W1  A 300   NA
4 W1  A 335   NA
5 W1  B   3   NA
6 W1  B 100   NA
7 W1  B 150   NA
8 W1  B 350   NA

Pack Table

PACK <- data.frame(matrix(nrow = 3, ncol = 5))
colnames(PACK) <- c("ID", "Pack", "EnterDay", "ExitDay", "Yr")
PACK$ID <- "W1"
PACK$Pack <- c("SunnyLake", "RainyLake", "RainyLake")
PACK$EnterDay <- c(1, 201, 1)
PACK$ExitDay <- c(200, 365, 366)
PACK$Yr <- c("A", "A", "B")
PACK

  ID      Pack EnterDay ExitDay Yr
1 W1 SunnyLake        1     200  A
2 W1 RainyLake      201     365  A
3 W1 RainyLake        1     366  B

My way of steamrolling through each row in PACK and filling the "Pack" column in LOC

for (i in 1:nrow(PACK)){
  cat("LOC$Pack[LOC$ID == \"", as.character(PACK$ID[i]),"\" & LOC$Yr == \"", as.character(PACK$Yr[i]),"\" & LOC$Dy >= ", PACK$EnterDay[i], " & LOC$Dy <= ", PACK$ExitDay[i],"] <- \"", as.character(PACK$Pack[i]),"\"\n", sep="")  
}

That command prints the following, which I paste back into the console and run.

LOC$Pack[LOC$ID == "W1" & LOC$Yr == "A" & LOC$Dy >= 1 & LOC$Dy <= 200] <- "SunnyLake"
LOC$Pack[LOC$ID == "W1" & LOC$Yr == "A" & LOC$Dy >= 201 & LOC$Dy <= 365] <- "RainyLake"
LOC$Pack[LOC$ID == "W1" & LOC$Yr == "B" & LOC$Dy >= 1 & LOC$Dy <= 366] <- "RainyLake"

The resulting LOC table looks like this:

LOC
  ID Yr  Dy      Pack
1 W1  A   4 SunnyLake
2 W1  A 200 SunnyLake
3 W1  A 300 RainyLake
4 W1  A 335 RainyLake
5 W1  B   3 RainyLake
6 W1  B 100 RainyLake
7 W1  B 150 RainyLake
8 W1  B 350 RainyLake

解決

1) sqldf Try this:

library(sqldf)
sqldf("select L.ID, L.Yr, L.Dy, P.Pack from LOC L left join PACK P
       on L.Yr = P.Yr and L.ID = P.ID and L.Dy between P.EnterDay and P.ExitDay")

giving:

  ID Yr  Dy      Pack
1 W1  A   4 SunnyLake
2 W1  A 200 SunnyLake
3 W1  A 300 RainyLake
4 W1  A 335 RainyLake
5 W1  B   3 RainyLake
6 W1  B 100 RainyLake
7 W1  B 150 RainyLake
8 W1  B 350 RainyLake

2) dplyr or

library(dplyr)

left_join(LOC, PACK, by = c("ID", "Yr")) %.% 
    filter((Dy >= EnterDay & Dy <= ExitDay) | is.na(Pack.y)) %.% 
    select(ID:Dy, Pack.y)

giving

  ID Yr  Dy    Pack.y
1 W1  A   4 SunnyLake
2 W1  A 200 SunnyLake
3 W1  A 300 RainyLake
4 W1  A 335 RainyLake
5 W1  B   3 RainyLake
6 W1  B 100 RainyLake
7 W1  B 150 RainyLake
8 W1  B 350 RainyLake

ADDED: Added second solution and improved both. Fixed dplyr solution to not drop LOC rows witih no PACK row.

他のヒント

First, don't create the Pack column in LOC at the beginning; it's not necessary.

LOC <- data.frame(matrix(NA, nrow = 8, ncol = 3))
colnames(LOC) <- c("ID", "Yr", "Dy")  # NOTE: No Pack column
LOC$ID <- "W1"
LOC$Yr <- rep(c("A", "B"), each = 4)
LOC$Dy <- c(4, 200, 300, 335, 3, 100, 150, 350)

Here's a method that uses data tables, which is likely to be much faster with large datasets.

library(data.table)
LOC      <- data.table(LOC,   key="ID,Yr")
PACK     <- data.table(PACK, key="ID,Yr")
LOC$Pack <-LOC[PACK,all=T][Dy>=EnterDay & Dy<=ExitDay,Pack]
LOC
#   ID Yr  Dy      Pack
# 1 W1  A   4 SunnyLake
# 2 W1  A 200 SunnyLake
# 3 W1  A 300 RainyLake
# 4 W1  A 335 RainyLake
# 5 W1  B   3 RainyLake
# 6 W1  B 100 RainyLake
# 7 W1  B 150 RainyLake
# 8 W1  B 350 RainyLake

And here's a method that uses data frames (and does not require sqldf).

M <- merge(LOC,PACK,by=c("ID","Yr"))
is.between <- function(x,low,hi)return(x>=low & x<=hi)
LOC$Pack <- with(M,M[is.between(Dy,EnterDay, ExitDay),]$Pack)

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow