R data.frame operations

https://stackoverflow.com/questions/19159251

30-06-2022
|

문제

I am trying to compute inter-survey estimates of population numbers between years 1981 and 1986 for each YEAR (1982...1985) and each AGE group (10-14,..., 55-59). The complicating factor in my dataset is that I have 52 provinces and around 600 ZONA91OK (districts) and each PROVINCE has a different number of districts.

The formula I would like to apply in order to obtain vectors containing the information for each missing YEAR, NATIONALITY, PROVINCE and for each district (ZONA91OK) is the following:

ex. for 1982

value for 1982, age group 10-14: x(10,1982)=[(x(10,1981)-x(15,1986))/5]-x(10,1982)
x(15,1982)=[(x(15,1981)-x(20,1986))/5]-x(15,1982)
x(20,1982)=[(x(20,1981)-x(25,1986))/5]-x(20,1982)
...
x(55,1982)=[(x(55,1981)-x(55,1986))/5]-x(55,1982) -exception-

Any help on the issue is very much appreciated!

Here is the reproducible sample (a subset of the whole database as it is very big)

mydata<-structure(list(YEAR = c(1981, 1981, 1981, 1981, 1981, 1981, 1981, 
1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 
1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 
1981, 1981, 1981, 1981, 1981, 1981, 1981, 1986, 1986, 1986, 1986, 
1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 
1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 
1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986), 
PROVINCE = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
ZONA91OK = c(101, 101, 101, 101, 101, 101, 101, 101, 101, 
102, 102, 102, 102, 102, 102, 102, 102, 102, 1036, 1036, 
1036, 1036, 1036, 1036, 1036, 1036, 1036, 1059, 1059, 1059, 
1059, 1059, 1059, 1059, 1059, 1059, 101, 101, 101, 101, 101, 
101, 101, 101, 101, 102, 102, 102, 102, 102, 102, 102, 102, 
102, 1036, 1036, 1036, 1036, 1036, 1036, 1036, 1036, 1036, 
1059, 1059, 1059, 1059, 1059, 1059, 1059, 1059, 1059), AGE5 = c(10, 
15, 20, 25, 30, 35, 40, 45, 50, 10, 15, 20, 25, 30, 35, 40, 
45, 50, 10, 15, 20, 25, 30, 35, 40, 45, 50, 10, 15, 20, 25, 
30, 35, 40, 45, 50, 10, 15, 20, 25, 30, 35, 40, 45, 50, 10, 
15, 20, 25, 30, 35, 40, 45, 50, 10, 15, 20, 25, 30, 35, 40, 
45, 50, 10, 15, 20, 25, 30, 35, 40, 45, 50), NATIONALITY = structure(c(9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L), .Label = c("España", 
"UE-15 y PD", "Resto Europa", "Magreb", "África Sub-sahariana", 
"Latinoamérica", "Asia", "Resto del Mundo", "No computable"
), class = "factor"), FREQUENCY = c(993.8141, 994.907, 894.0322, 
845.8348, 659.6786, 577.2588, 540.6329, 684.9917, 673.6348, 
910.511, 1068.9258, 936.9949, 763.547, 643.4404, 572.72, 
536.6591, 665.975, 768.6866, 967.694100000002, 980.340100000001, 
811.637500000001, 746.058500000001, 769.820600000001, 722.398000000001, 
730.371600000001, 690.084600000001, 501.9178, 8243.04149999997, 
7785.02419999994, 7505.78429999991, 7464.74579999992, 7663.47079999997, 
6700.90559999997, 5203.31959999996, 5582.66059999997, 4837.30459999996, 
869.1754, 982.7461, 945.5031, 904.2817, 813.7127, 663.955, 
577.2896, 544.1257, 689.9815, 780.3824, 879.7538, 1025.5724, 
882.475, 716.0049, 627.3571, 579.4372, 525.4546, 679.9666, 
1035.6544, 952.521599999999, 962.537599999999, 832.3296, 
733.1696, 726.1568, 704.1248, 700.1136, 667.0624, 9023.05139999993, 
8285.31719999994, 8080.95919999994, 8175.28479999993, 7786.53429999994, 
7796.56439999994, 6842.11639999996, 5239.83509999998, 5616.95939999997
)), .Names = c("YEAR", "PROVINCE", "ZONA91OK", "AGE5", "NATIONALITY", 
"FREQUENCY"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 
23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 
36L, 8173L, 8174L, 8175L, 8176L, 8177L, 8178L, 8179L, 8180L, 
8181L, 8182L, 8183L, 8184L, 8185L, 8186L, 8187L, 8188L, 8189L, 
8190L, 8191L, 8192L, 8193L, 8194L, 8195L, 8196L, 8197L, 8198L, 
8199L, 8200L, 8201L, 8202L, 8203L, 8204L, 8205L, 8206L, 8207L, 
8208L), class = "data.frame")

해결책 2

If you can guarantee that there is no missing data -- that is, every year contains exactly the same number of rows, and the same combinations of province, age, nationality, and zone, then here is a simple solution:

df<-mydata[with(mydata,order(YEAR,NATIONALITY,PROVINCE,ZONA91OK,AGE5)),]
splitdata<-split(df,df$YEAR)
for(i in 1982:1985){
  chi<-as.character(i)
  splitdata[[chi]]<-splitdata$`1981`
  splitdata[[chi]]$YEAR<-i
  splitdata[[chi]]$FREQUENCY<-splitdata$`1986`$FREQUENCY*(i-1981)/5+
    splitdata$`1981`$FREQUENCY*(1986-i)/5
}

newdata<-do.call(rbind,splitdata)
newdata

On edit: The ddply method posted by @shadow is a more "correct" R way to do this, and using ddply will make it easier to deal with missing data. If you don't have missing data, though, then the split method seems to be more efficient:

Unit: milliseconds
        expr      min       lq    median        uq      max neval
 splitmethod 2.573737 2.638159  2.676954  2.735601 235.2619   100
 ddplymethod 9.812680 9.989192 10.148241 17.128667 243.7309   100

다른 팁

Probably easiest to use ?ddply from the plyr package. This should get you started...

require(plyr)
df <- ddply(mydata, c("PROVINCE", "NATIONALITY", "ZONA91OK"), function(x){
  x1981 <- x[x[,"YEAR"]==1981,]
  x1986 <- x[x[,"YEAR"]==1986,]
  #1982
  x1982 <- x1981
  x1982[,"YEAR"] <- 1982
  # This seems strange. Should probably be "+" instead of "-"
  x1982[,"FREQUENCY"] <- (x1981[,"FREQUENCY"]-x1986[,"FREQUENCY"])/5 - x1981[,"FREQUENCY"]
  # Add additional years here...
  rbind(x1981, x1982, x1986)
})
# perhaps reorder
df[order(df[,"YEAR"]),]

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow