Question

Currently I am just getting to learn about R statistic language and facing one problem which I can't solve for days. Hope you guys can give me a hand. Here is the idea:

  • I have a data set named DF which includes hundred thousand of records. The data set is built up by 5 columns which look like bellow: DF<-cbind(ProvinceID,CityID,House,Person,WorkingStatus)
  • The CityID is formatted as 5 characters, the first 2 characters are the ProvinceID, combine with the other 3 characters to uniquely identify each City.
  • Each House is uniquely identified by the combination of ProvinceID, CityID and House.
  • Person is formatted as 6 characters, the first 4 is their House, combine with the other 2 characters to uniquely identify each Person

-Here is the generate code for the example data:

ProvinceID<-c(10,10,10,20,20,20,30,30,40,40,40,40,50)
CityID<-c(10001,10001,10002,20001,20002,20002,30001,30001,40001,40001,40001,40001,50001)
House<-c(0001,0001,0001,0001,0001,0002,0001,0002,0001,0001,0001,0002,0001)
Person<-c(000101,000102,000101,000101,000101,000101,000101,000101,000101,000102,000103,000101,000101)
WorkingStatus<-c(1,0,0,0,1,1,0,0,1,1,0,0,1)
DF<-cbind(ProvinceID,CityID,House,Person,WorkingStatus)

DF <-as.data.frame(DF)

My problem is, to create one variable named "HouseIncome" that takes the value of "1" if at least one member of the household is currently working (at least one "Person" of the house have WorkingStatus ==1). Since each House is only identical if we combine 3 columns: "ProvinceID", "CityID" and "House", I just wonder if there are any way to subset the data into houses, and is there any function in R to perform "if at least"?

The results should look like:

ProvinceID<-c(10,10,20,20,20,30,30,40,40,50)
CityID<-c(10001,10002,20001,20002,20002,30001,30001,40001,40001,50001)
House<-c(0001,0001,0001,0001,0002,0001,0002,0001,0002,0001)
HouseIncome<-c(1,0,0,1,1,0,0,1,0,1)

DF1<-cbind(ProvinceID,CityID,House,HouseIncome)
Was it helpful?

Solution

this is easy using the data.table package:

library(data.table)
dt <-data.table(DF) # your DF
setkeyv(dt, c( "ProvinceID", "CityID", "House") )

dt[, list(HouseIncome = as.integer(sum(WorkingStatus)>0)), by=key(dt)]


   ProvinceID CityID House HouseIncome
 1:         10  10001     1           1
 2:         10  10002     1           0
 3:         20  20001     1           0
 4:         20  20002     1           1
 5:         20  20002     2           1
 6:         30  30001     1           0
 7:         30  30001     2           0
 8:         40  40001     1           1
 9:         40  40001     2           0
10:         50  50001     1           1

Very nice answer from @ChristianBorck, +1. Just couple of tips on improving it further.

setDT(DF)[, list(HouseIncome = any(WorkingStatus == 1L)*1L), 
                    by=list(ProvinceID, CityID, House)]

1) You can use setDT instead of as.data.table(.) or data.table(.), which'll convert your data.frame to data.table by reference (without copying) and therefore avoids unnecessary memory usage and is also therefore instant.

2) And, you can, but don't have to use setkey for aggregation/grouping, unless you really'd like to get the data sorted.

OTHER TIPS

It's quite easy with the plyr package (or any functions that offer split-apply-combine functionality):

library(plyr)
ddply(DF, .(ProvinceID, CityID, House), 
        summarise, HouseIncome=as.numeric(any(WorkingStatus==1)))
#    ProvinceID CityID House HouseIncome
# 1          10  10001     1           1
# 2          10  10002     1           0
# 3          20  20001     1           0
# 4          20  20002     1           1
# 5          20  20002     2           1
# 6          30  30001     1           0
# 7          30  30001     2           0
# 8          40  40001     1           1
# 9          40  40001     2           0
# 10         50  50001     1           1

To complete the set, here's an answer with dplyr. First, I'll create the data a safer way - you should never use cbind() to make data frames because it coerces all inputs to the same type:

df <- data.frame(
  ProvinceID = c(10, 10, 10, 20, 20, 20, 30, 30, 40, 40, 40, 40, 50),
  CityID = c(10001, 10001, 10002, 20001, 20002, 20002, 30001, 30001, 40001, 40001, 40001, 40001, 50001),
  House = c(0001, 0001, 0001, 0001, 0001, 0002, 0001, 0002, 0001, 0001, 0001, 0002, 0001),
  Person = c(000101, 000102, 000101, 000101, 000101, 000101, 000101, 000101, 000101, 000102, 000103, 000101, 000101),
  WorkingStatus = c(1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1)
)

With dplyr, you use group_by() to set up the grouping, and mutate() to add a new column. I think you're better off leaving the variable as a logical vector, rather than converting it to 0/1.

library(dplyr)
df %.% 
  group_by(ProvinceID, CityID, House) %.%
  mutate(HouseIncome = any(WorkingStatus == 1))
#> Source: local data frame [13 x 6]
#> Groups: ProvinceID, CityID, House
#> 
#>    ProvinceID CityID House Person WorkingStatus HouseIncome
#> 1          10  10001     1    101             1        TRUE
#> 2          10  10001     1    102             0        TRUE
#> 3          10  10002     1    101             0       FALSE
#> 4          20  20001     1    101             0       FALSE
#> 5          20  20002     1    101             1        TRUE
#> 6          20  20002     2    101             1        TRUE
#> 7          30  30001     1    101             0       FALSE
#> 8          30  30001     2    101             0       FALSE
#> 9          40  40001     1    101             1        TRUE
#> 10         40  40001     1    102             1        TRUE
#> 11         40  40001     1    103             0        TRUE
#> 12         40  40001     2    101             0       FALSE
#> 13         50  50001     1    101             1        TRUE

Something like this perhaps, which will return a True/False results instead of the 1/0 that you desire -

library(data.table) ## >= 1.9.2
setDT(DF)[, list(HouseIncome = sum(WorkingStatus) > 0), 
                       by = list(ProvinceID,CityID,House)]

#    ProvinceID CityID House HouseIncome
#  1:         10  10001     1       FALSE
#  2:         10  10002     1       FALSE
#  3:         20  20001     1       FALSE
#  4:         20  20002     1       FALSE
#  5:         20  20002     2       FALSE
#  6:         30  30001     1       FALSE
#  7:         30  30001     2       FALSE
#  8:         40  40001     1        TRUE
#  9:         40  40001     2       FALSE
# 10:         50  50001     1       FALSE
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top