Question

I have a vertically arranged (stacked) pooled time series data.frame that looks like this:

date    item    qty_sold
day_1   orange  0
day_2   orange  0
day_3   orange  0
day_4   orange  0
day_5   orange  5
day_6   orange  0
day_7   orange  8
day_8   orange  0
day_1   hammer  0
day_2   hammer  0
day_3   hammer  3
day_4   hammer  0
day_5   hammer  70
day_6   hammer  70
day_7   hammer  0
Day_8   hammer  80

In each "item's" sub-series/sub-group, I need to identify and remove *all observations prior to the day on which the first positive qty_sold was observed*. For example, for the "orange" series, this means striking out days 1 through 4 and for the "hammer" series this means striking out the first 2 days.


(In case the explanation above is not clear): From each sub-series in the dataset, I need to to remove the all the days from date = Day_1 to date = Day_k, such that for each day in the interval 1...k qty_sold = 0, AND retain all rows where date = Day_k+1 qty_sold >= 0)

Can anyone kindly give an idea on how to go about this? The actual dataset contains about a million rows. I would also welcome suggestions in accomplishing this using SAS apart from R.

Was it helpful?

Solution

I totally agree with @joran's point there. I'll give a(n) (R) answer here even though this question doesn't show any research effort. For the future, show us the code you've tried as well.

For your problem, the first step to do is to use a base function or a nice package that'll help you split your data.frame to groups, apply whatever function you want to apply to each split group and combine the results (typically called as split-apply-combine strategy). There are couple of nice (external) packages out there, namely, plyr and data.table. Although, I prefer data.table for data.frame-like operations as it's generally lot faster.

So, first we'll convert your data.frame to a data.table. If you don't have this package installed, then you can do it by doing install.packages("data.table").

require(data.table) # load package
dt <- data.table(df) # convert data.frame to data.table

Now, to split a data.table into groups, we can use the argument by within the data.table. And our apply function will be cummax, because this'll give you 0's only for the first consecutive zeros and non-zeros after (if you don't have negative values in your data, which I assume here). Then, the results are automatically combined. So, let's do this:

dt[, .SD[cummax(qty_sold) > 0], by = item]

      item  date qty_sold
 1: orange day_5        5
 2: orange day_6        0
 3: orange day_7        8
 4: orange day_8        0
 5: hammer day_3        3
 6: hammer day_4        0
 7: hammer day_5       70
 8: hammer day_6       70
 9: hammer day_7        0
10: hammer Day_8       80

To sum up:

require(data.table)
dt <- data.table(df)
dt[, .SD[cummax(qty_sold)>0], by = item]

Some more explanation on the syntax. Let's consider first by = item. This the part that internally split's the data for you by item (that is, the whole data.table for item= orange will be considered first, followed by the part for item = hammer etc..).

The middle part .SD[cummax(qty_sold) > 0] is where the magic happens - the apply function equivalent. Here, .SD is just the split-part (corresponding to item taken one at a time. To see more clearly what's in .SD everytime, do: dt[, print(.SD), by = item].

This'll basically remove the rows which have a contiguous 0's just at the start and retaining everything else (the solution is guaranteed as long as there are no negative values).

OTHER TIPS

The SAS approach would be something like: keep track in a retained variable whether you already encountered positive values for your item. If not, you do not output. If yes, make note of it in the variable used to keep track of it. After the last line of an item, reset your tracking variable. E.g.: (sort if necessary)

data RESULT (drop=found_first_positive);
    set DATASET;
    by item date;
    retain found_first_positive 0;
    if quantity>0 then found_first_positive=1;
    if found_first_positive;
    if last.item then found_first_positive=0;
run;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top