Question

Facing difficulties with subset calculations. I am able to get overall stats like average purchase by customer (factor) using ave, tapply, ddply but I am not able to calculate visit by visit stats for each customer. Some simplified data below to illustrate my data and ideal results.

Current Dataframe: (Note that visit #1 is the most recent visit)

  customer  visit      date    purchase_amt
    sarah          2    2013-08-09      5
    sarah          3    2013-07-21      8
    sarah          4    2013-06-23      9
    sarah          5    2013-06-02      1
    sarah          1    2013-08-20      8
    henry          1    2013-07-04      4
    che            1    2013-08-27      2
    che            2    2013-07-27      1
    che            3    2013-07-05      8
    che            4    2013-06-14      3
    dt             3    2013-04-05      9
    dt             2    2013-06-07      1
    dt             1    2013-07-11      6

These are the results I am seeking:

customer  visit    date purchase_amt    days since  amt_diff
sarah       2   2013-08-09  5               19        -3
sarah       3   2013-07-21  8               28        -1
sarah       4   2013-06-23  9               21         8
sarah       5   2013-06-02  1               NA        NA
sarah       1   2013-08-20  8               11         3
henry       1   2013-07-04  4               NA        NA
che         1   2013-08-27  2               31         1
che         2   2013-07-27  1               22        -7
che         3   2013-07-05  8               21         5
che         4   2013-06-14  3               NA        NA
dt          3   2013-04-05  9               NA        NA
dt          2   2013-06-07  1               63        -8
dt          1       2013-07-11    6         34         5

So in summary, I would like to find most recent visit of a customer and the attributes of it, then find the next visit attributes and calculate various stats on the two. Return "NA" when there are no more previous visits.

Was it helpful?

Solution 2

This solution only uses the base of R and retains the original order of the input:

# Sort, calculate differences and unsort.
# r is row indexes to use, order.by is ordering vector, col is vector to difference

diffs <- function(r, order.by, col) {
    order.by <- order.by[r]
    col <- col[r]
    o <- order(order.by)
    replace(r, o, c(NA, diff(col[o])))
}

# fun specialized to arguments after first, i.e. subsequent arguments curried

curry <- function (fun, ...) function(r) fun(r, ...)

ix <- 1:nrow(DF)
transform(DF, 
    days_since = ave(ix, customer, FUN = curry(diffs, date, date)),
    amt_diff = ave(ix, customer, FUN = curry(diffs, date, purchase_amt))
)

The result is:

   customer visit       date purchase_amt days_since amt_diff
1     sarah     2 2013-08-09            5         19       -3
2     sarah     3 2013-07-21            8         28       -1
3     sarah     4 2013-06-23            9         21        8
4     sarah     5 2013-06-02            1         NA       NA
5     sarah     1 2013-08-20            8         11        3
6     henry     1 2013-07-04            4         NA       NA
7       che     1 2013-08-27            2         31        1
8       che     2 2013-07-27            1         22       -7
9       che     3 2013-07-05            8         21        5
10      che     4 2013-06-14            3         NA       NA
11       dt     3 2013-04-05            9         NA       NA
12       dt     2 2013-06-07            1         63       -8
13       dt     1 2013-07-11            6         34        5

UPDATE: minor improvements to code.

OTHER TIPS

Something like this? Assuming your data is called df:

library(plyr)

# convert dates to class 'Date'
df$date <- as.Date(df$date)

# order by customer and date
df <- df[order(df$customer, df$date), ]
# or since plyr is loaded anyway:
df <- arrange(df, customer, date) 

# per customer, calculate differences in date and purchase, between consecutive visits
# pad differences with a leading NA
df2 <- ddply(.data = df, .variables = .(customer), mutate,
      days_since = c(NA, diff(date)),
      amt_diff = c(NA, diff(purchase_amt)))

df2
# customer visit       date purchase_amt days_since amt_diff
# 1       che     4 2013-06-14            3         NA       NA
# 2       che     3 2013-07-05            8         21        5
# 3       che     2 2013-07-27            1         22       -7
# 4       che     1 2013-08-27            2         31        1
# 5        dt     3 2013-04-05            9         NA       NA
# 6        dt     2 2013-06-07            1         63       -8
# 7        dt     1 2013-07-11            6         34        5
# 8     henry     1 2013-07-04            4         NA       NA
# 9     sarah     5 2013-06-02            1         NA       NA
# 10    sarah     4 2013-06-23            9         21        8
# 11    sarah     3 2013-07-21            8         28       -1
# 12    sarah     2 2013-08-09            5         19       -3
# 13    sarah     1 2013-08-20            8         11        3

Here is the data.table solution in line with @Henrik:

    df<-structure(list(customer = structure(c(4L, 4L, 4L, 4L, 4L, 3L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("che", "dt", "henry", 
"sarah"), class = "factor"), visit = c(2L, 3L, 4L, 5L, 1L, 1L, 
1L, 2L, 3L, 4L, 3L, 2L, 1L), date = structure(c(15926, 15907, 
15879, 15858, 15937, 15890, 15944, 15913, 15891, 15870, 15800, 
15863, 15897), class = "Date"), purchase_amt = c(5L, 8L, 9L, 
1L, 8L, 4L, 2L, 1L, 8L, 3L, 9L, 1L, 6L)), .Names = c("customer", 
"visit", "date", "purchase_amt"), row.names = c(NA, -13L), class =  
"data.frame")

library(data.table)
 df<-data.table(df)
df[,list(visit=visit,date=date, purchase_amt=purchase_amt,days_since = c(NA, diff(date)),amt_diff = c(NA, diff(purchase_amt))),keyby="customer"]
    customer visit       date purchase_amt days_since amt_diff
 1:      che     1 2013-08-27            2         NA       NA
 2:      che     2 2013-07-27            1        -31       -1
 3:      che     3 2013-07-05            8        -22        7
 4:      che     4 2013-06-14            3        -21       -5
 5:       dt     3 2013-04-05            9         NA       NA
 6:       dt     2 2013-06-07            1         63       -8
 7:       dt     1 2013-07-11            6         34        5
 8:    henry     1 2013-07-04            4         NA       NA
 9:    sarah     2 2013-08-09            5         NA       NA
10:    sarah     3 2013-07-21            8        -19        3
11:    sarah     4 2013-06-23            9        -28        1
12:    sarah     5 2013-06-02            1        -21       -8
13:    sarah     1 2013-08-20            8         79        7
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top