Вопрос

I want to establish a cohort of new users of drugs (Ray 2003). My original dataset is huge approx 19 million rows, so a loop is proving inefficient. Here is a dummy dataset (done with fruits instead of drugs):

    df2

   names      dates age sex  fruit
1    tom 2010-02-01  60   m  apple
2   mary 2010-05-01  55   f orange
3    tom 2010-03-01  60   m banana
4   john 2010-07-01  57   m   kiwi
5   mary 2010-07-01  55   f  apple
6    tom 2010-06-01  60   m  apple
7   john 2010-09-01  57   m  apple
8   mary 2010-07-01  55   f orange
9   john 2010-11-01  57   m banana
10  mary 2010-09-01  55   f  apple
11   tom 2010-08-01  60   m   kiwi
12  mary 2010-11-01  55   f  apple
13  john 2010-12-01  57   m orange
14  john 2011-01-01  57   m  apple

I have identified people who were prescribed an apple between 04-2010 and 10-2010:

temp2

  names      dates age sex fruit
6   tom 2010-06-01  60   m apple
5  mary 2010-07-01  55   f apple
7  john 2010-09-01  57   m apple

I would like to make a new column in the original DF called "index" which is the first date that a person was prescribed a drug in the the defined date range. This is what I have tried to get the dates from temp into df$index:

df2$index<-temp2$dates    
df2$index<-df2$dates == temp2$dates
df2$index<-df2$dates %in% temp2$dates
df2$index<-ifelse(as.Date(df$dates)==as.Date(temp2$dates), as.Date(temp2$dates),NA)

I'm not doing this right - as none of these work. This is the desired output.

    df2

   names      dates age sex  fruit      index
1    tom 2010-02-01  60   m  apple       <NA>
2   mary 2010-05-01  55   f orange       <NA>
3    tom 2010-03-01  60   m banana       <NA>
4   john 2010-07-01  57   m   kiwi       <NA>
5   mary 2010-07-01  55   f  apple 2010-07-01
6    tom 2010-06-01  60   m  apple 2010-06-01
7   john 2010-09-01  57   m  apple 2010-09-01
8   mary 2010-07-01  55   f orange       <NA>
9   john 2010-11-01  57   m banana       <NA>
10  mary 2010-09-01  55   f  apple       <NA>
11   tom 2010-08-01  60   m   kiwi       <NA>
12  mary 2010-11-01  55   f  apple       <NA>
13  john 2010-12-01  57   m orange       <NA>
14  john 2011-01-01  57   m  apple       <NA>

Once I have the desired output, I want to trace back from the index date to see if any person had an apple in the previous 180 days. if they did not have an apple - I want to keep them. If they did have an apple (e.g., tom) I want to discard him. This is the code i have tried on the desired output:

df4<-df2[df2$fruit!='apple' & df2$index-180,]
df4<-df2[df2$fruit!='apple' & df2$dates<=df2$index-180,] ##neither work for me

I would appreciate any guidance at all on these questions - even a direction to what I should read to help me learn how to do this. Perhaps my logic is flawed and my method won't work - please tell me if thats the case! Thank you in advance.

Here is my df:

names<-c("tom", "mary", "tom", "john", "mary",
 "tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01", 
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
 "2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01", 
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
 "apple", "apple", "apple", "orange", "banana", "apple",
 "kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
 "f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2

Here is temp2:

data1<-df2[df2$fruit=="apple"& (df2$dates >= "2010-04-01" & df2$dates<  "2010-10-01"), ]
index <- with(data1, order(dates))
temp<-data1[index, ] 
dup<-duplicated(temp$names)
temp1<-cbind(temp,dup)
temp2<-temp1[temp1$dup!=TRUE,]
temp2$dup<-NULL

SOLUTION

df2 <- df2[with(df2, order(names, dates)), ]
df2$first.date <- ave(df2$date, df2$name, df2$fruit, 
       FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1])                   ##DWin code for assigning index date for each fruit in the pre-period

df2$x<-df2$fruit=='apple' & df2$dates>df2$first.date-180 & df2$dates<df2$first.date    ##assigns TRUE to row that tom is not a new user
ids <- with(df2, unique(names[x == "TRUE"]))                                           ##finding the id which has one value of true
new_users<-subset(df2, !names %in% ids)                                                       ##gets rid of id that has at least one value of true
Это было полезно?

Решение

First order by name and date:

df <- df[with(df, order(names, dates)), ]

Then just pick the first date within each name:

df$first.date <- ave(df$date, df$name, FUN="[", 1)

Now that you have will see "the power of the fully operational Death Star \w\w", er, the ave-function. You are ready to pick out the first date within individual 'names' and 'fruits' within that date-range:

> df$first.date <- ave(df$date, df$name, df$fruit, 
         FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1] )
> df
   names      dates age sex  fruit first.date
4   john 2010-07-01  57   m   kiwi 2010-07-01
7   john 2010-09-01  57   m  apple 2010-09-01
9   john 2010-11-01  57   m banana       <NA>
13  john 2010-12-01  57   m orange       <NA>
14  john 2011-01-01  57   m  apple 2010-09-01
2   mary 2010-05-01  55   f orange 2010-05-01
5   mary 2010-07-01  55   f  apple 2010-07-01
8   mary 2010-07-01  55   f orange 2010-05-01
10  mary 2010-09-01  55   f  apple 2010-07-01
12  mary 2010-11-01  55   f  apple 2010-07-01
1    tom 2010-02-01  60   m  apple 2010-06-01
3    tom 2010-03-01  60   m banana       <NA>
6    tom 2010-06-01  60   m  apple 2010-06-01
11   tom 2010-08-01  60   m   kiwi 2010-08-01

Другие советы

Since you have 19 million rows , I think you should try a data.table solution. Here my attempt. The result is slightly different from @Dwin result since I filter my data between (begin,end) and then I create a new index variable which is the min dates occurring in this chosen range for each (names,fruits)

library(data.table)
DT <- data.table(df2,key=c('names','dates'))
DT[,dates := as.Date(dates)]
DT[between(dates,as.Date("2010-04-01"),as.Date("2010-10-31")),
   index := as.character(min(dates))
,   by=c('names','fruit')]
##     names      dates age sex  fruit      index
##  1:  john 2010-07-01  57   m   kiwi 2010-07-01
##  2:  john 2010-09-01  57   m  apple 2010-09-01
##  3:  john 2010-11-01  57   m banana         NA
##  4:  john 2010-12-01  57   m orange         NA
##  5:  john 2011-01-01  57   m  apple         NA
##  6:  mary 2010-05-01  55   f orange 2010-05-01
##  7:  mary 2010-07-01  55   f  apple 2010-07-01
##  8:  mary 2010-07-01  55   f orange 2010-05-01
##  9:  mary 2010-09-01  55   f  apple 2010-07-01
## 10:  mary 2010-11-01  55   f  apple         NA
## 11:   tom 2010-02-01  60   m  apple         NA
## 12:   tom 2010-03-01  60   m banana         NA
## 13:   tom 2010-06-01  60   m  apple 2010-06-01
## 14:   tom 2010-08-01  60   m   kiwi 2010-08-01
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top