Extract only first line in a data frame from several subgroups that satisfy a conditional

StackOverflow https://stackoverflow.com/questions/22428220

  •  15-06-2023
  •  | 
  •  

Domanda

I have a data frame similar to the dummy example here:

df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))

In the original data frame, there are many more groups, each with 10 values. For each group (a,b or c) I would like to extract the first line where value!=NA, but only the first line where this is true. As in a group there could be several values different from NA and from each other I can't simply subset.

I was imagining something like this using plyr and a conditional, but I honestly have no idea what the conditional should take:

ddply<-(df,.(Group),function(sub_data){
    for(i in 1:length(sub_data$value)){  
    if(sub_data$Value!='NA'){'take value but only for the first non NA')
    return(first line that satisfies)
 })

Maybe this is easy with other strategies that I don't know of Any suggestion is very much appreciated!

È stato utile?

Soluzione 2

Since you suggested plyr in the first place:

ddply(subset(df, !is.na(Value)), .(Group), head, 1L)

That assumes you have NAs and not 'NA's. If the latter (not recommended), then:

ddply(subset(df, Value != 'NA'), .(Group), head, 1L)

Note how concise this is. I would agree with using plyr.

Altri suggerimenti

I know this has been answered but for this you should be looking at the data.table package. It provides a very expressive and terse syntax for doing what you ask:

df<-data.table(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))

> df[ Value != "NA", .SD[1], by=Group ]
    Group Value
 1:     a    10
 2:     b     4
 3:     c     2

Do youself a favor and learn data.table

Some other notes:

  • You can easily convert data.frames to data.tables
  • I think that you don't want "NA" but simply NA in your example, in that case the syntax is:

    df[ ! is.na(Value), .SD[1], by=Group ]

If you're willing to use actual NA's vs strings, then the following should give you what you're looking for:

df <- (Group=rep(letters[1:3], each=3),
       Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))

print(df)

##   Group Value
## 1     a  <NA>
## 2     a  <NA>
## 3     a    10
## 4     b  <NA>
## 5     b     4
## 6     b     8
## 7     c  <NA>
## 8     c  <NA>
## 9     c     2

df.1 <- by(df, df$Group, function(x) {
  head(x[complete.cases(x),], 1)
})

print(df.1)

## df$Group: a
##   Group Value
## 3     a    10
## ------------------------------------------------------------------------ 
## df$Group: b
##   Group Value
## 5     b     4
## ------------------------------------------------------------------------ 
## df$Group: c
##   Group Value
## 9     c     2

First you should take care of NA's:

options(stringsAsFactors=FALSE)
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))

And then maybe something like this would do the trick:

for(i in unique(df$Group)) {
  for(j in df$Value[df$Group==i]) {
    if(!is.na(j)) {
      print(paste(i,j))
      break
    }
  }
}

Assuming that Value is actually numeric, not character.

> df <- data.frame(Group=rep(letters[1:3],each=3),
                   Value=c(NA, NA, 10, NA, 4, 8, NA, NA, 2)

> do.call(rbind, lapply(split(df, df$Group), function(x){
      x[ is.na(x[,2]) == FALSE, ][1,]
      }))

##   Group Value
## a     a    10
## b     b     4
## c     c     2

I don't see any solutions using aggregate(...), which would be the simplest:

df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
aggregate(Value~Group,df[df$Value!="NA",],head,1)
#   Group Value
# 1     a    10
# 2     b     4
# 3     c     2

If your df contains actual NA, and not "NA" as in your example, then use this:

df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
aggregate(Value~Group,df[!is.na(df$Value),],head,1)
  Group Value
1     a    10
2     b     4
3     c     2

Your life would be easier if you marked missing values with NA and not as a character string 'NA'; the former is really missing to R and it has tools to work with such missingness. The latter ('NA') is really not missing except for the meaning that this string has to you alone; R cannot divine that information directly. Assuming you correct this, then the solution below is one way to go about doing this.

Similar in spirit to @hrbrmstr's by() but to my eyes aggregate() gives nicer output:

> foo <- function(x) head(x[complete.cases(x)], 1)
> aggregate(Value ~ Group, data = df, foo)
  Group Value
1     a    10
2     b     4
3     c     2
> aggregate(df$Value, list(Group = df$Group), foo)
  Group  x
1     a 10
2     b  4
3     c  2
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top