Recombining artificially split sessions in sales data based on multiple conditions

StackOverflow https://stackoverflow.com/questions/22869840

  •  27-06-2023
  •  | 
  •  

Question

We have sales data that comes from a touchscreen vending-style machine. When a customer puts money in the machine, it starts a session, counting those sessions using a sequence of numbers unique to that machine. Most of the time, the system starts and stops sessions when it should. However, ~7% of the time, it artificially starts a new session when there is still money left in the machine to be spent.

So,

session available.spend actual.spend
    1         20            20
    2         25            17
    3          0             8
    4         15            15
    5         14             7
    6          0             7
    7         59            50
    8          0             9
    9         15            15
    10        21            21

where available.spend is a sum of all the different columns indicating money or vouchers were inserted into the machine, and actual.spend is a sum of all the money spent during that session.

So, most of the time they equal one another. However, in session 2, $25 was inserted and only $17 was spent. Session 3 shows no money available to be spent, but $8 actually spent, which balances out the first session.

I'd like to have R combine those sessions and create an indicator column telling me the new session is a result of combining sessions.

How would I have R look to see if a session balanced, then, if it does not, check the next session to see if:

  1. there was no available.spend;
  2. there was actual.spend; and,
  3. the actual.spend from both sessions == the available.spend from the first session

Then, if (and only if) all three criteria are met, those two session are combined (using either session number or a new, made-up one), and a new column with a 1 showing that the new session is a result of combining other sessions.

Here's the dput() for my made-up sample:

mydt<-    structure(list(session = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), available.spend = c(20, 
      25, 0, 15, 14, 0, 59, 0, 15, 21), actual.spend = c(20, 17, 8, 
      15, 7, 7, 50, 9, 15, 21)), .Names = c("session", "available.spend", 
      "actual.spend"), row.names = c(NA, -10L), class = c("data.table", 
      "data.frame"), .internal.selfref = <pointer: 0x0000000000300788>)

Here's what I'd like the output to look like:

session available.spend actual.spend    newsess
    1          20           20             0
    2          25           25             1
    4          15           15             0
    5          14           14             1
    7          59           59             1
    9          15           15             0
    10         21           21             0

and the dput():

mynew.dt<- structure(list(session = c(1, 2, 4, 5, 7, 9, 10), available.spend = c(20, 
            25, 15, 14, 59, 15, 21), actual.spend = c(20, 25, 15, 14, 59, 
            15, 21), newsess = c(0, 1, 0, 1, 1, 0, 0)), .Names = c("session", 
            "available.spend", "actual.spend", "newsess"), row.names = c(NA, 
            -7L), class = c("data.table", "data.frame"), .internal.selfref = <pointer:                     0x0000000000300788>)

I've been trying to find ways to this in data.table (the dataset is very large) and/or with ifelse, but I can't figure out how to check three conditions and only perform an action if it meets all three, while also deleting the old columns and creating a dummy variable column. Whew

One more wrinkle: these session IDs can (though it happens rarely) occur on more than one day. So, the code would have to either look for the very next line in the data.frame or, if it looked for the session that comes next sequentially, it would need to make sure the dates on the two sessions matched.

Thanks for any help.

Was it helpful?

Solution

If your available.spend is always zero in these cases, you can use that to group the rows (I'm assuming you sometimes have more than one of these 0's in a row, otherwise you could trivially just take actual.spend, shift it by 1 and sum back to compare):

dt[, list(session = session[1],
          available.spend = sum(available.spend),
          actual.spend = sum(actual.spend)),
     by = cumsum(available.spend != 0)]
#   cumsum session available.spend actual.spend
#1:      1       1              20           20
#2:      2       2              25           25
#3:      3       4              15           15
#4:      4       5              14           14
#5:      5       7              59           59
#6:      6       9              15           15
#7:      7      10              21           21

From this point on you should have all the info you need to proceed.

Perhaps, more generally, it would be better to group by cumsum(available.spend >= actual.spend).

OTHER TIPS

So not quite the same result data.frame you want. I am using cumsum (Cumulative Sum) on available spend and actual spend. Then I check which ones are match up, and only for those that matchup I put "1" in the new.session column.

mydt$spend.sum <-cumsum(mydt$actual.spend) #Cumulative sum of actual 
mydt$avail.sum <-cumsum(mydt$available.spend) #Cumulative sum of actual 

now make a new column and make it all NA's

mydt$new.session <-NA

Check which cumulative sums match up and replace NA's with 1's

mydt$new.session[with(mydt, which(spend.sum == avail.sum))]<-1

If you only want data.frame with the 1's in the new.session column

do this

mydt[complete.cases(my.dt),]

This is kind of a clunky solution, but given the narrow parameters and desired outcome, I can't think of a better way to do this except piece-by-piece.

mismatches <- mydt[available.spend != actual.spend, which=TRUE]
zeros <- mydt[available.spend == 0, which=TRUE]
x <- setdiff(mismatches, zeros)
followcheck <- mydt[x+1, session == mydt[zeros, session] & actual.spend > 0]
following.zeros <- zeros[followcheck]
sumthing <- mydt[x, available.spend==actual.spend + mydt[following.zeros, actual.spend]]
x <- x[sumthing]
y <- x + 1
mydt[x, actual.spend:=actual.spend + mydt[y, actual.spend]]
# Caution here, data.table gave a warning about needing to copy the table in memory to do this next line.
mydt[, newsess:=0]
mydt[x, newsess:=1]
mydt <- mydt[-y,]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top