Getting sums by multiple columns of factors of a data frame in R

https://stackoverflow.com/questions/22930875

29-06-2023
|

Domanda

I have a data frame of three columns that looks like this:

> head(d)
  YYYYMM State   Weight
1 200812    AL 1876.356
2 200812    AL 2630.503
3 200812    AL 2763.981
4 200812    AL 2693.110
5 200812    AL 2905.784
6 200812    AL 3511.313

It has 51 states and and goes from 2008-04 until 2010-04, so it has 25 unique YYYYMM elements:

 > levels(factor(d$YYYYMM))
 [1] "200804" "200805" "200806" "200807" "200808" "200809" "200810"
 [8] "200811" "200812" "200901" "200902" "200903" "200904" "200905"
[15] "200906" "200907" "200908" "200909" "200910" "200911" "200912"
[22] "201001" "201002" "201003" "201004"

Using the table(d$YYYYMM,d$State) I get a contingency table of counts:

  head(table(d$YYYYMM,d$State))

           ME   NH   VT   MA   RI   CT   NY   NJ   PA   OH   IN   IL ...
  200804 2018 2340 1501 1651 1781 2373 4550 2181 3328 2949 1631 3242 ...
  200805 2002 2332 1556 1648 1770 2360 4521 2217 3294 2936 1671 3193 ...
  200806 1999 2369 1552 1676 1803 2390 4578 2221 3331 2997 1642 3181 ...
  200807 1988 2354 1605 1601 1769 2362 4530 2165 3318 2973 1592 3271 ...
  200808 1998 2348 1649 1667 1812 2411 4417 2191 3302 2975 1627 3198 ...
  200809 2032 2343 1679 1670 1865 2367 4599 2185 3320 2914 1625 3155 ...
  ...

However, instead of counts I want those numbers to be the sum of weights. In other words, for example, for 200804 and state ME I want not the counts, but the sum of weights:

> sum(d[d$YYYYMM==200804 & d$State=="ME",]$Weight)
[1] 1063323

I tried using the "for" loop to calculate that, but it was taking way too much time. Is there a way to modify the table() function to accomplish that? If not what other options are there? Eventually, I want to calculate the percentages, but it is trivial once I know how to get the sums of weights by YYYYMM and state. Thank you. Below is the summary of the data if you need it. Let me know in case more clarification is necessary.

> summary(d)
     YYYYMM           State             Weight     
 Min.   :200804   CA     : 221244   Min.   :    0  
 1st Qu.:200810   TX     : 132650   1st Qu.: 1176  
 Median :200904   NY     : 114282   Median : 2496  
 Mean   :200887   FL     : 106116   Mean   : 2226  
 3rd Qu.:200910   PA     :  82482   3rd Qu.: 3139  
 Max.   :201004   IL     :  80816   Max.   :16822  
                  (Other):1906523

Soluzione

I think tapply is the function you're looking for:

tapply(d$weight, list(d$State, d$YYYYMM), sum)

Altri suggerimenti

First, reshape your dataframe into wide format:

require(reshape2)
df <- dcast(d, YYYYMM ~ State, value.var="Weight")

after that you can sum the values by month with:

aggregate(df[,-1], df$YYYYMM, FUN = sum)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow