We have sales data that comes from a touchscreen vending-style machine. When a customer puts money in the machine, it starts a session, counting those sessions using a sequence of numbers unique to that machine. Most of the time, the system starts and stops sessions when it should. However, ~7% of the time, it artificially starts a new session when there is still money left in the machine to be spent.
So,
session available.spend actual.spend
1 20 20
2 25 17
3 0 8
4 15 15
5 14 7
6 0 7
7 59 50
8 0 9
9 15 15
10 21 21
where available.spend
is a sum of all the different columns indicating money or vouchers were inserted into the machine, and actual.spend
is a sum of all the money spent during that session.
So, most of the time they equal one another. However, in session 2, $25 was inserted and only $17 was spent. Session 3 shows no money available to be spent, but $8 actually spent, which balances out the first session.
I'd like to have R combine those sessions and create an indicator column telling me the new session is a result of combining sessions.
How would I have R look to see if a session balanced, then, if it does not, check the next session to see if:
- there was no available.spend;
- there was actual.spend; and,
- the actual.spend from both sessions == the available.spend from the first session
Then, if (and only if) all three criteria are met, those two session are combined (using either session number or a new, made-up one), and a new column with a 1 showing that the new session is a result of combining other sessions.
Here's the dput()
for my made-up sample:
mydt<- structure(list(session = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), available.spend = c(20,
25, 0, 15, 14, 0, 59, 0, 15, 21), actual.spend = c(20, 17, 8,
15, 7, 7, 50, 9, 15, 21)), .Names = c("session", "available.spend",
"actual.spend"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000300788>)
Here's what I'd like the output to look like:
session available.spend actual.spend newsess
1 20 20 0
2 25 25 1
4 15 15 0
5 14 14 1
7 59 59 1
9 15 15 0
10 21 21 0
and the dput()
:
mynew.dt<- structure(list(session = c(1, 2, 4, 5, 7, 9, 10), available.spend = c(20,
25, 15, 14, 59, 15, 21), actual.spend = c(20, 25, 15, 14, 59,
15, 21), newsess = c(0, 1, 0, 1, 1, 0, 0)), .Names = c("session",
"available.spend", "actual.spend", "newsess"), row.names = c(NA,
-7L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000300788>)
I've been trying to find ways to this in data.table (the dataset is very large) and/or with ifelse, but I can't figure out how to check three conditions and only perform an action if it meets all three, while also deleting the old columns and creating a dummy variable column. Whew
One more wrinkle: these session IDs can (though it happens rarely) occur on more than one day. So, the code would have to either look for the very next line in the data.frame or, if it looked for the session that comes next sequentially, it would need to make sure the dates on the two sessions matched.
Thanks for any help.