Sub setting panel data

https://stackoverflow.com/questions/10202638

01-06-2021
|

Question

Very new, so let me know if this is asking too much. I am trying to sub set panel data, in R, into two different categories; one that has complete information for variables and one that has incomplete information for variables. My data looks like this:

Person     Year Income Age Sex
    1      2003  1500   15  1
    1      2004  1700   16  1
    1      2005  2000   17  1
    2      2003  1400   25  0
    2      2004  1900   26  0
    2      2005  2000   27  0

What I need to do is go through each column ( not columns 1 and 2 ) and if the data is full for the variable ( variables are defined by the id in the first column and then the column name, in the picture above an example is person1Income) return that to a data set. Else put it in a different data set. Here is my meta code and an example of what it should do given the above data. Note: I call variables by their id name then the column name, for instance the variable person1Income would be the first three rows in column three.

for(each variable in all columns except 1 and 2 in data set) if (variable = FULL) { return to data set "completes" }
else {put in data set "incompletes"}
completes = person1Income, person2Income, person1Age, person2Age, person1Sex, person2 sex
incompletes = {empty because the above info is full}

I understand if someone can't answer this question completely, but any help is appreciated. Also if my goal is not clear, let me know and I will try to clarify.

tl;dr I can't yet explain it in one sentence so...sorry.

Edit: visualization of what I mean by complete and incomplete variables. screenshot

La solution

Using your picture, here's a stab at what you want. It may be long-winded and others may have a more elegant way of doing it, but it gets the job done:

library("reshape2")

con <- textConnection("Person Year Income Age Sex
  1      2003  1500   15  1
  1      2004  1700   16  1
  1      2005  2000   17  1
  2      2003  1400   25  0
  2      2004  1900   NA  0
  2      2005  2000   27  0
  3      2003  NA   25  0
  3      2004  1900   NA  0
  3      2005  2000   27  0")
pnls <- read.table(con, header=TRUE)

# reformat table for easier processing
pnls2 <- melt(pnls, id=c("Person"))
# and select those rows that relate to values
# of income and age
pnls2 <- subset(pnls2,
              variable == "Income" | variable == "Age")

# create column of names in desired format (e.g Person1Age etc)
pnls2$name <- paste("Person", pnls2$Person, pnls2$variable, sep="")

# Collect full set of unique names
name.set <- unique(pnls2$name)
# find the incomplete set
incomplete <- unique( pnls2$name[ is.na(pnls2$value) ]) 
# then find the complement of the incomplete set
complete <- setdiff(name.set, incomplete) 

# These two now contain list of complete and incomplete variables
complete
incomplete

If you are not familiar with melting and the reshape2 package, you may want to run it line by line, and examine the value of pnls2 at different stages to see how this works.

EDIT: adding code to compile the values as requested by @bstockton. I am sure there is a much more appropriate R idiom to do this, but once again, in the absence of better answers: this works

# use these lists of complete and incomplete variable names
# as keys to collect lists of values for each variable name
compile <- function(keys) {
    holder = list()
    for (n in keys) {
        holder[[ n ]] <- subset(pnls2, pnls2$name == n)[,3]
    }
    return( as.data.frame(holder) )
}

complete.recs <- compile(complete)
incomplete.recs <- compile(incomplete)

Autres conseils

Let's assume this is in a data.frame with name == 'dfrm'

completes <- dfrm[ complete.cases(dfrm[-(1:2)]) ,]
incompletes <- dfrm[ !complete.cases(dfrm[-(1:2)]) ,]

Thanks to @WojciechSobala for noticing my missing parens. For the question of identifying which column the missing values are in one could create a list: The list of id's is simple. The identification of which columns have missing values is also fairly easy to provide, but I have no idea what you mean by "the values in that column that correspond to the id variable" since they are all NA. For the identification step, you can use:

apply(incompletes, 1, function(x) c(x[1], x[2], which(is.na(x[-(1:2)]))))

I see now what you are asking. I don't have a solution yet but let me show you a couple of R functions that might help when it comes to enumerating and working with the categories that are formed by cross-classifying on two column values:

dat <- structure(list(Person = c(1L, 1L, 1L, 2L, 2L, 2L), Year = c(2003L, 
2004L, 2005L, 2003L, 2004L, 2005L), Income = c(1500L, NA, 2000L, 
1400L, 1900L, 2000L), Age = c(15L, 16L, 17L, 25L, 26L, 27L), 
    Sex = c(1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("Person", "Year", 
"Income", "Age", "Sex"), row.names = c(NA, -6L), class = "data.frame")

completes <-  lapply( split(dat[ , 3:5], dat$Person), function(x)  sapply(x, function(y) { if( all( !is.na(y)) ) { y } else { NA} })  )

$`1`
$`1`$Income
[1] NA

$`1`$Age
[1] 15 16 17

$`1`$Sex
[1] 1 1 1


$`2`
     Income Age Sex
[1,]   1400  25   0
[2,]   1900  26   0
[3,]   2000  27   0

 incompletes <- lapply( split(dat[ , 3:5], dat$Person), function(x)  sapply(x, function(y) { if( !all( !is.na(y)) ) { y } else { NA} }) )

$`1`
$`1`$Income
[1] 1500   NA 2000

$`1`$Age
[1] NA

$`1`$Sex
[1] NA


$`2`
Income    Age    Sex 
    NA     NA     NA

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow