Question

I'm trying to run some basic statistics (and deeper ones later) on a data frame that has categorical variables for sales. In addition to sales, it tracks things like area (where the merchant is located), the day of the week, the time of the day (Lunch, After Work, etc), and various other things.

Here is a small, random subset of the data: (Note that this is a basic representation- the actual data frame has 38 columns- I just took most of the non-applicable ones out)

    structure(list(dayofweek = structure(c(4L, 7L, 3L, 7L, 3L, 2L, 
2L, 7L, 3L, 3L, 2L, 7L, 5L, 5L, 2L, 5L, 1L, 3L, 7L, 3L, 4L, 1L, 
3L, 5L, 7L), .Label = c("Friday", "Monday", "Saturday", "Sunday", 
    "Thursday", "Tuesday", "Wednesday"), class = "factor"), timeofday = structure(c(6L, 
4L, 5L, 5L, 2L, 6L, 6L, 5L, 6L, 3L, 6L, 3L, 5L, 4L, 1L, 3L, 5L, 
6L, 5L, 4L, 6L, 6L, 3L, 2L, 5L), .Label = c("After Work", "Early AM", 
     "Evening", "Late AM", "Lunch", "MidAfternoon", "Overnight"), class = "factor"), 
 area = c(6L, 4L, 4L, 5L, 5L, 1L, 4L, 2L, 3L, 2L, 7L, 3L, 
 7L, 5L, 7L, 4L, 1L, 4L, 1L, 4L, 5L, 7L, 1L, 3L, 7L), totsales = c(40, 
 6, 5, 10, 1, 0, 0, 3, 5, 3, 10, 30, 2, 1, 2, 22, 8, 1, 1, 
 5, 11, 20, 0, 1, 5)), .Names = c("dayofweek", "timeofday", 
     "area", "totsales"), class = "data.frame", row.names = c(192278L, 
     140773L, 121051L, 157984L, 154299L, 258034L, 108031L, 43760L, 
     78005L, 42103L, 95603L, 98431L, 30252L, 165303L, 40713L, 108252L, 
     304549L, 137041L, 268473L, 124599L, 161253L, 12897L, 240815L, 
     89439L, 21032L))

The first thing I am doing is trying to get the mean and median sales in each area and at each time of day. I would like to have R go through a list of each and return all the values. I tried this:

vallist<-list(a= c("Early AM", "Late AM", "Lunch", "MidAfternoon", "After Work", 
         "Evening", "Overnight"),
          b= c(1,2,3,4,5,6,7))

sapply(vallist[['b']], function(x)
    mapply(function(a,b) mean(data$totsales[which(data$timeofday==a & data$area==b)]),
          vallist[['a']], vallist[['b']])
 )

But, it only applies the mean to each timeofday segment in area 1, not each time of day segment in areas 1-7. So, my results look like this:

                  [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
Early AM      9.192847  9.192847  9.192847  9.192847  9.192847  9.192847  9.192847
Late AM       8.020678  8.020678  8.020678  8.020678  8.020678  8.020678  8.020678
Lunch        10.096277 10.096277 10.096277 10.096277 10.096277 10.096277 10.096277
MidAfternoon 11.503961 11.503961 11.503961 11.503961 11.503961 11.503961 11.503961
After Work    8.206124  8.206124  8.206124  8.206124  8.206124  8.206124  8.206124
Evening      11.457599 11.457599 11.457599 11.457599 11.457599 11.457599 11.457599
Overnight    11.415667 11.415667 11.415667 11.415667 11.415667 11.415667 11.415667

which are the correct answers for Area 1, but you can see they are the same values for each area. How do I get R to apply the function to multiple lists and return all the combinations of values?

The next steps will be to apply medians, and to evaluate at district levels and for different weekdays, but I assume the same idea will apply to all of the different combinations.

Was it helpful?

Solution 2

Converting my comment to an answer....

It seems like you might be interested in aggregate (though there are many ways to aggregate data in R).

out <- aggregate(totsales ~ timeofday + area, data, mean)
out
#       timeofday area totsales
# 1       Evening    1      0.0
# 2         Lunch    1      4.5
# 3  MidAfternoon    1      0.0
# 4       Evening    2      3.0
# 5         Lunch    2      3.0
# 6      Early AM    3      1.0
# 7       Evening    3     30.0
# 8  MidAfternoon    3      5.0
# 9       Evening    4     22.0
# 10      Late AM    4      5.5
# 11        Lunch    4      5.0
# 12 MidAfternoon    4      0.5
# 13     Early AM    5      1.0
# 14      Late AM    5      1.0
# 15        Lunch    5     10.0
# 16 MidAfternoon    5     11.0
# 17 MidAfternoon    6     40.0
# 18   After Work    7      2.0
# 19        Lunch    7      3.5
# 20 MidAfternoon    7     15.0

If you want to go from there to a wide format, you can then use reshape (like: reshape(out, direction = "wide", idvar="timeofday", timevar="area")).

OTHER TIPS

For this particular case you can reproduce your result with:

library(reshape2)
dcast(data[-1], timeofday ~ area, fun.aggregate=mean, fill=0)

which produces:

     timeofday   1 2  3    4  5  6    7
1   After Work 0.0 0  0  0.0  0  0  2.0
2     Early AM 0.0 0  1  0.0  1  0  0.0
3      Evening 0.0 3 30 22.0  0  0  0.0
4      Late AM 0.0 0  0  5.5  1  0  0.0
5        Lunch 4.5 3  0  5.0 10  0  3.5
6 MidAfternoon 0.0 0  5  0.5 11 40 15.0

I'm pretty sure the discrepancy from your result is due to the data you posted being as subset of the whole.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top