Domanda

I am working with cross-national daily data (which I produced a year variable for) with well over 270,000 observations, and plenty of missing values for the variable of interest in this discussion (PartyCode). The data looks as follows:

Data <- data.frame(
  Observation = 1:6,
  PartyCountry = c("CHN", "CHN", "GER", "GER", "USA", "USA"), 
  Year = c(1999, 2000, 2000, 2001, 1999, 1999),
  PartyCode=c(20, NA, 20, 22, NA, 21) 
  )


Observation     PartyCountry   PartyYear    PartyCode
      1              CHN       1999             20
      2              CHN       2000             NA
      3              GER       2000             20
      4              GER       2001             22
      5              USA       1999             NA
      6              USA       1999             21

And I want to change this data into annual data, with the country-year format:

Observation PartyCountry PartyYear PartyCode20Count PartyCode21Count PartyCode22Count
    1        CHN          1999            100             100             100
    2        CHN          2000            100             100             100
    3        CHN          2001            300             300             300
    4        GER          1999            300             300             300
    5        GER          2000            140             140             140
    6        GER          2001            212             212             200

My question is multifaceted:

1) How do I extract values from the categorical PartyCode variable to produce the count variables (for each category) I want above?

Notably, this dataset has lots of missing values for the categorical variable, PartyCode.

È stato utile?

Soluzione

It sounds like you should explore dcast from "reshape2":

library(reshape2)
dcast(DF, PartyCountry + PartyYear ~ PartyCode, value.var="PartyCode")
# Aggregation function missing: defaulting to length
#   PartyCountry PartyYear 20 21 22
# 1          CHN      1999  1  0  0
# 2          CHN      2000  2  0  0
# 3          CHN      2001  0  0  1
# 4          GER      1999  3  0  0
# 5          USA      2000  0  2  0
# 6          USA      2001  2  0  2

Here, we've just "counted" (using length as the aggregation variable), but you can also use different functions (for example, sum or mean) if they were more meaningful.


Generally, "collapsing" data suggests looking at one of the many "aggregation" functions in R. Then, transforming from the "long" format that you start with to the "wide" format you want to end up with usually suggests looking at one of the "reshaping" functions.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top