Question

Hello I have a dataset with multiple patients, each with multiple observations.
I want to select the earliest observation for each patient.

Example: 

Patient ID    Tender    Swollen    pt_visit
101             1         10          6
101             6         12          12
101             4         3           18
102             9         5           18
102             3         6           24
103             5         2           12
103             2         1           18
103             8         0           24

The pt_visit variable is the number of months the patient was in the study at the time of the observation. What I need is the first observation from each patient based on the lowest number of months in the pt_visit column. However I need the earliest observation for each patient ID.

My desired results:

Patient ID    Tender    Swollen    pt_visit
101             1         10          6
102             9         5           18
103             5         2           12
Was it helpful?

Solution

Assuming your data frame is called df, use the ddply function in the plyr package:

require(plyr)
firstObs <- ddply(df, "PatientID", function(x) x[x$pt_visit == min(x$pt_visit), ])

OTHER TIPS

I would use the data.table package:

Data <- data.table(Data)
setkey(Data, Patient_ID, pt_visit)
Data[,.SD[1], by=Patient_ID]

Assuming that the Patient ID column is actually named Patient_ID, here are a few approaches. DF is assumed to be the name of the input data frame:

sqldf

library(sqldf)

sqldf("select Patient_ID, Tender, Swollen, min(pt_visit) pt_visit 
   from DF 
   group by Patient_ID")

or

sqldf("select *, min(pt_visit) pt_visit from DF group by Patient_ID")[-ncol(DF)]

Note: The above two alternatives use an extension to SQL only found in SQLite so be sure you are using the SQLite backend. (SQLite is the default backend for sqldf unless RH2, RProgreSQL or RMYSQL is loaded.)

subset/ave

subset(DF, ave(pt_visit, Patient_ID, FUN = rank) == 1)

Note: This makes use of the fact that there are no duplicate pt_visit values within the same Patient_ID. If there were we would need to specify the ties= argument to rank.

I almost think they should be a subset parameter named "by" that would do the same as it does in data.table. This is a base-solution:

do.call(rbind,  lapply( split(dfr, dfr$PatientID), 
                  function(x) x[which.min(x$pt_visit),] ) )

    PatientID Tender Swollen pt_visit
101       101      1      10        6
102       102      9       5       18
103       103      5       2       12

I guess you can see why @hadley built 'plyr'.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top