I am struggling with a problem related to merging two data.tables (although it could be data.frames as well) based on a timestamp (POSIXct) that is unequal.
Based on a certain timestamp in table A I'd like R to return me the entry in table B that occurs prior to the time in A.
An example:
I have table A that contains data about activities at a certain point in time.
EDIT: This is different data to the original post that better reflects the problem: I need to 'lookup' based on timestamp and a grouping variable which I call Station ID. Apologies for not being clear in the first place..
Start.Time Start.Station.ID
1: 2014-04-06 18:24:32 238
2: 2014-04-06 18:20:30 238
3: 2014-04-06 01:04:13 373
4: 2014-04-06 01:03:36 373
5: 2014-04-06 01:03:37 373
6: 2014-04-06 01:03:01 373
7: 2014-04-06 01:02:42 373
8: 2014-04-06 01:02:31 373
I want to add a column to that table A that indicates what the status for that station was at a certain point in time in terms of 'availability'. These status can be found in table B.
status_dt station_id availability
1: 2014-04-06 00:29:02 238 0.9354839
2: 2014-04-06 00:29:02 373 1.0000000
3: 2014-04-06 01:29:03 238 1.0000000
4: 2014-04-06 01:29:03 373 0.6111111
5: 2014-04-06 02:59:03 238 0.9354839
6: 2014-04-06 02:59:03 373 0.6666667
...
41: 2014-04-06 17:59:03 238 0.8387097
42: 2014-04-06 17:59:03 373 0.4444444
43: 2014-04-06 18:59:03 238 0.9032258
44: 2014-04-06 18:59:03 373 0.5000000
45: 2014-04-06 20:29:03 238 0.7741935
status_dt station_id availability
The timestamps do not match, therefore I'd like to add to table A the status from table B at the observation prior to timestamp in table A.
The expected result would be for example column 'availability':
status_dt station_id availability
1: 2014-04-06 18:24:32 238 0.8387097
2: 2014-04-06 18:20:30 238 0.8387097
3: 2014-04-06 01:04:13 373 1.0000000
4: 2014-04-06 01:03:36 373 1.0000000
5: 2014-04-06 01:03:37 373 1.0000000
6: 2014-04-06 01:03:01 373 1.0000000
7: 2014-04-06 01:02:42 373 1.0000000
8: 2014-04-06 01:02:31 373 1.0000000
BodieG's proposal works if the entries in Start.Station.ID/station_id are unique, but applying his suggestion to this data gives
status_dt station_id availability Start.Station.ID
1: 2014-04-06 18:24:32 373 0.4444444 238
2: 2014-04-06 18:20:30 373 0.4444444 238
3: 2014-04-06 01:04:13 373 1.0000000 373
4: 2014-04-06 01:03:36 373 1.0000000 373
5: 2014-04-06 01:03:37 373 1.0000000 373
6: 2014-04-06 01:03:01 373 1.0000000 373
7: 2014-04-06 01:02:42 373 1.0000000 373
8: 2014-04-06 01:02:31 373 1.0000000 373
Where the entries in the first two rows are not what I would have expected (or rather hoped for): they refer to the 'availability' in station 373 instead of 238.
I guess the code just has to be adapted to reflect the timestamp AND the station ID, but I'm banging my head against the wall here....
Also I could not figure out whether using the suggested xts-package would help, because clearly I have duplicated timesteps here...
Again, any hint is very appreciated.
Thanks in advance!
For reproducibility:
Table A:
structure(list(Start.Time = structure(c(1396808672, 1396808430,
1396746253, 1396746216, 1396746217, 1396746181, 1396746162, 1396746151
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Start.Station.ID = c(238,
238, 373, 373, 373, 373, 373, 373)), .Names = c("Start.Time",
"Start.Station.ID"), class = c("data.table", "data.frame"), row.names = c(NA,
-8L))
Table B:
structure(list(status_dt = structure(c(1396744142, 1396744142,
1396747743, 1396747743, 1396753143, 1396753143, 1396754942, 1396754942,
1396756743, 1396756743, 1396758542, 1396758542, 1396760343, 1396760343,
1396765743, 1396765743, 1396767542, 1396767542, 1396772943, 1396772943,
1396778402, 1396778402, 1396781943, 1396781943, 1396785542, 1396785542,
1396787342, 1396787342, 1396790942, 1396790942, 1396794543, 1396794543,
1396798143, 1396798143, 1396799943, 1396799943, 1396801743, 1396801743,
1396805343, 1396805343, 1396807143, 1396807143, 1396810743, 1396810743,
1396816143, 1396816143, 1396817942, 1396817942, 1396821542, 1396821542,
1396826942, 1396826942), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
station_id = c(238, 373, 238, 373, 238, 373, 238, 373, 238,
373, 238, 373, 238, 373, 238, 373, 238, 373, 238, 373, 238,
373, 238, 373, 238, 373, 238, 373, 238, 373, 238, 373, 238,
373, 238, 373, 238, 373, 238, 373, 238, 373, 238, 373, 238,
373, 238, 373, 238, 373, 238, 373), availability = c(0.935483870967742,
1, 1, 0.611111111111111, 0.935483870967742, 0.666666666666667,
0.967741935483871, 0.666666666666667, 0.967741935483871,
0.666666666666667, 0.935483870967742, 0.666666666666667,
0.967741935483871, 0.666666666666667, 0.967741935483871,
0.611111111111111, 0.967741935483871, 0.611111111111111,
1, 0.444444444444444, 0.870967741935484, 0.5, 0.806451612903226,
0.5, 0.774193548387097, 0.388888888888889, 0.709677419354839,
0.388888888888889, 0.67741935483871, 0.333333333333333, 1,
0.5, 0.903225806451613, 0.444444444444444, 0.935483870967742,
0.444444444444444, 0.903225806451613, 0.444444444444444,
0.870967741935484, 0.444444444444444, 0.838709677419355,
0.444444444444444, 0.903225806451613, 0.5, 0.774193548387097,
0.611111111111111, 0.766666666666667, 0.611111111111111,
0.774193548387097, 0.555555555555556, 0.870967741935484,
0.666666666666667)), .Names = c("status_dt", "station_id",
"availability"), class = c("data.table", "data.frame"), row.names = c(NA,
-52L), sorted = "status_dt")