Applying an aggregate function over multiple different slices

https://stackoverflow.com/questions/4998846

14-11-2019
|

Frage

I have a data array that contains some information about people and projects as such:

person_id | project_id | action | time
--------------------------------------
        1 |          1 |      w |    1
        1 |          2 |      w |    2
        1 |          3 |      w |    2
        1 |          3 |      r |    3
        1 |          3 |      w |    4
        1 |          4 |      w |    4
        2 |          2 |      r |    2
        2 |          2 |      w |    3

I'd like to augment this data with a couple of more fields called "first_time" and "first_time_project" that collectively identify first time any action by that person was seen and the first time that developer saw any action on the project. In the end, the data should look like this:

person_id | project_id | action | time | first_time | first_time_project
------------------------------------------------------------------------
        1 |          1 |      w |    1 |          1 |                  1
        1 |          2 |      w |    2 |          1 |                  2
        1 |          3 |      w |    2 |          1 |                  2
        1 |          3 |      r |    3 |          1 |                  2
        1 |          3 |      w |    4 |          1 |                  2
        1 |          4 |      w |    4 |          1 |                  4
        2 |          2 |      r |    2 |          2 |                  2
        2 |          2 |      w |    3 |          2 |                  2

My naive way of doing this to write a couple of loops:

for (pid in unique(data$person_id)) {
    data[data$pid==pid, "first_time"] = min(data[data$pid==pid, "time"])
    for (projid in unique(data[data$pid==pid, "project_id"])) {
        data[data$pid==pid & data$project_id==projid, "first_time_project"] = min(data[data$pid==pid & data$project_id==projid, "time"]
    }
}

Now, it doesn't take a genius to see that this is going to be glacially slow with the doubly nested loops. However, I can't figure out a way to handle this in R. I'm kinda emulating the group by option for SQL. I know that by might be able to help, but I can't figure out how to do multiple slices.

Any hints on how to take my code from glacially slow to something a bit faster? I'd be happy with a snail right now.

Lösung

Try ave :

transform(data, 
   first_time = ave(time, person_id, FUN = min),
   first_time_project = ave(time, person_id, project_id, drop = TRUE, FUN = min)
)

Andere Tipps

The combination of Hadley's plyr and transform() is powerful. If I correctly understand your question, then:

foo <- ddply(foo, .(person_id), transform, first_time=min(time))
foo <- ddply(foo, .(person_id, project_id), transform, 
  first_time_project=min(time))

If speed is what you are looking for, then data.table is the way to go.

library(data.table)
DT <- data.table(foo)
DT[, first_time := min(time), by = person_id]
DT[, first_time_project := min(time), by = list(person_id, project_id)]

Quick and dirty solution with no loops

library(plyr)


# function to get first time by any person/project
fp <- function(dat) 
{
dat$first_time=min(dat$time)
ftp <- function(d) { d$first_time_project=min(d$time); return (d) }
dat=ddply(dat, .(project_id), ftp)
return (dat)
}


#this single call should give you the result you want
result=ddply(data, .(person_id), fp)

A quick way I can think of:

foo <- data.frame(
       person_id=rep(1:5,each=6),
       project_id=sample(1:5,30,T),
       time=sample(1:30))

first_time <- aggregate(foo$time, list(foo$person_id), min)

foo$first_time <- first_time[ match(foo$person_id,first_time[,1]),2]

bar <- subset(foo, time==first_time)

foo$first_time_project <- bar$project_id[match(foo$person_id, bar$person_id)]

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow