R: grouping/splitting a dataset by categories in combination with apply.weekly()

https://stackoverflow.com/questions/9131561

22-04-2021
|

Question

Intro

I am not an R expert yet so please excuse another question which I probably should be embarassed of to ask. In another question I asked on stackoverflow I got some very helpful comments on how to aggregate unregularly daily data of an xts object to weekly values by the apply.weekly() function. Unfortunately I didn't find a function like tapply(), ddply(), by() or aggregate() which allows to split up by categories which works together with the apply.weekly() function.

My Data

This is my example dataset. I already posted in the other question. For illustration purposes I am taking the liberty to also post it here:

example <- as.data.frame(structure(c(" 1", " 2", " 1", " 2", " 1", " 1", " 2", " 1", " 2", 
" 1", " 2", " 3", " 1", " 1", " 2", " 2", " 3", " 1", " 2", " 2", 
" 1", " 2", " 1", " 1", " 2", NA, " 2", NA, NA, " 1", " 3", " 1", 
" 3", " 3", " 2", " 3", " 3", " 3", " 2", " 2", " 2", " 3", " 3", 
" 3", " 2", " 2", " 3", " 3", " 3", " 3", " 1", " 2", " 1", " 2", 
" 2", " 1", " 2", " 1", " 2", " 2", " 2", " 3", " 1", " 1", " 2", 
" 2", " 3", " 3", " 2", " 2", " 1", " 2", " 1", " 1", " 2", NA, 
" 2", NA, NA, " 1", " 3", " 2", " 3", " 2", " 0", " 3", " 3", 
" 3", " 2", " 0", " 2", " 3", " 3", " 3", " 0", " 2", " 2", " 3", 
" 3", " 0", "12", " 5", " 9", "14", " 5", "tra", "tra", "man", 
"inf", "agc", "07-2011", "07-2011", "07-2011", "07-2011", "07-2011" 
), .indexCLASS = c("POSIXlt", "POSIXt"), .indexTZ = "", class = c("xts", 
"zoo"), .indexFORMAT = "%U-%Y", index = structure(c(1297642226, 
1297672737, 1297741204, 1297748893, 1297749513), tzone = "", tclass = c("POSIXlt", 
"POSIXt")), .Dim = c(5L, 23L), .Dimnames = list(NULL, c("rev_sit", 
"prof_sit", "emp_nr_sit", "inv_sit", "ord_home_sit", "ord_abr_sit", 
"emp_cost_sit", "usage_cost_sit", "tax_cost_sit", "gov_cost_sit", 
"rev_exp", "prof_exp", "emp_nr_exp", "inv_exp", "ord_home_exp", 
"ord_abr_exp", "emp_cost_exp", "usage_cost_exp", "tax_cost_exp", 
"gov_cost_exp", "land", "nace", "index"))))

The columns

"rev_sit", "prof_sit", "emp_nr_sit", "inv_sit", "ord_home_sit", "ord_abr_sit", "emp_cost_sit", "usage_cost_sit", "tax_cost_sit", "gov_cost_sit","rev_exp", "prof_exp", "emp_nr_exp", "inv_exp", "ord_home_exp","ord_abr_exp", "emp_cost_exp", "usage_cost_exp","tax_cost_exp","gov_cost_exp",

refer to questions in a survey. There were three answering possibilities codes as "1", "2", and "3".

The columns

"land", "nace"

are categories with 16 and 8 unique factors respectively.

My goal My goal is to count the occurrence of "1", "2", and "3" each by week for each combination of the category factors in "nace" and "land". My idea was to create binary vectors for each answering possibility {1,2,3} beforehand (example_1,example_2,example_2) and then apply something like:

apply.weekly(example_1, function(d){ddply(d,list(example$nace,example$land),sum)})

But this doesn't work neither with ddply, aggregate, by etc.

My goal

My unprofessional work around initially was not to create a time series, just a date vector example$date with the given time column coded as weekly via %V an then to use i.e:

tapply(example_1[,5], list(example$date,example$nace,example$land),sum)

which I would of course than have to do for every out of the above displayed twenty questions. I then get i.e. for example_1:

week1, nace1.land1, nace1.land2, nace1.land3, ..., nace1.land16, nace2.land1,..,nace8.land16 week2, nace1.land1, nace1.land2, nace1.land3, ..., nace1.land16, nace2.land1,..,nace8.land16 ... ... weekn, nace1.land1, nace1.land2, nace1.land3, ..., nace1.land16, nace2.land1,..,nace8.land16

The same I would have to do for 2 (example_2) and 3 (example_3) and this for each of the 20 questions to produce all in all 16*8*3*20=7680 columns. This extreme and additionally with this method the product is not a time series and thus it is not ordered correctly by week.

Summary

So can anyone teach me or give me a hint how to use the function apply.weekly() in combination with functions the sort of tapply(), ddply(), by(), split(), unstack() etc. or any other method to achieve grouping like described above. Every hint is really appreciated. I am so frustrated already thinking about to abandon my R experiment and changing back to stata where so many things are much more intuitive with collapse() and by() etc... But don't understand me wrong: I am keen to learn so please help me!

Solution

thank you very much for all your help. I was busy with some other stuff in the meanwhile but now I was working on my problem again, and with the help of your great comments I have found a solution:

I gave up working directly with time series, postponing this step to the end of my analysis. Therefore I take the date vector and transform it into weeks:

library(ISOweek) d$index <- ISOweek(d$date)

(i do this with ISOweek since I am using Windows)

then I use a combination of tapply and lapply. The following function calculates the number of positive answers in the survey (coded by 1) for every calendar week (d$index = t[[22]]) and every combination of the two categorical columns t[[21]], t[[22]]. In the same step the whole thing is transformed into a data frame:

groupweeksums <- function(x,t){as.data.frame(tapply((x==1)*1,list(t[[23]],t[[21]],t[[22]]), function(d)sum(d,na.rm=TRUE)))}

==>x stands for the specific column, t for the data frame (i didn't know how to do that otherwise because at one point i have to address a column at the other the data frame and I wanted to avoid lots of typing) ; if d is the data frame then:

df <- groupweeksums(d,d)

in order that I don't have to repeat this procedure for all of my 20 questions is use lapply:

df <- as.data.frame(lapply(df[,1:20],function(d)groupweeksums(d,euwifo)))

This gives me a beautiful data frame with all I need for further analysis. Thanks for your help, with your helpful comments I came closer and closer to the solution!!!

P.S. I will also post this answer to the other question I posted on stackoverflow which was connected to this one. I hope this is no problem or against any rules here.

OTHER TIPS

I would add a "week" column, as you suggest, but convert the data to a tall format before processing -- you can convert it back to a time series afterwards, if needed.

library(reshape2)
d <- melt(example, id.vars=c("land", "nace", "index"))
# You apparently want one of the followings
dcast( d, land + nace + index ~ value, length )
dcast( d, land + nace + index + variable ~ value, length )
dcast( d, land + nace + index ~ variable + value, length )

Equivalently, you could use ddply:

library(plyr)
d <- melt(example, id.vars=c("land", "nace", "index"))
ddply( d, 
  c("land", "nace", "index", "value"), 
  summarize, 
  number=length(value)  # The argument "value" does not play any role
)

Your index column contains the number of the week in the current year (%Y-%U): this will only work if all the dates are within the same calendar year. It may be safer to use an actual date instead of the week number, for instance, the Sunday at the start of the current week -- it also makes it easier to turn the result into a time series.

week_start <- function(u) as.Date(u) - as.numeric(format(u, "%u"))
example$index <- weekstart( as.POSIXct(rownames(example)) )
# The following may also work.
example$index <- format( as.POSIXct(rownames(example)), "%G-%V" )

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow