Question

I have been searching Stackoverflow for hours hoping to find something I guessed was self-evident but nobody seemed to have asked (which might mean it is indeed self-evident).

I want to use tapply or by, to find the first time a specific event occurs in a dataframe (first non-zero value). The way I did this before was via

max.col(df, ties.method = c("first"))

But somehow this does not work when used in conjunction with either tapply or by. Here's some examplary data

FIRM<-as.vector(sample(c("a","b","c","d"),100,replace=T))
MOMENT<-as.vector(sample((1990:1995),100,replace=T))
EVENT<-as.vector(sample(c("x12","x43","x35","y71","y81","xy1","xy67","yy123","xx901"),100,replace=T))
OCCURENCE<-as.vector(sample(c(0,1),100,replace=T))
m<-as.data.frame(cbind(FIRM,MOMENT,EVENT,OCCURENCE))

So here is what I tried and did not work

  1. tapply(m[,4],m[,3],max.col) # This gives just 1s for every EVENT with the length of the resulting vector equal to number of EVENTs mentioned in the dataset
  2. tapply(m[,4],m[,3],max.col(m, ties.method=c("first"))) # Error in match.fun(FUN) : 'max.col(m, ties.method = c("first"))' is not a function, character or symbol In addition: Warning message: In max.col(m, ties.method = c("first")) : NAs introduced by coercion

Number 2 is really the crux of the problem. For reasons unclear to me, max.col is not recognised as a function once you change the default tie-breaking method (i.e. "random") to to one I need (i.e. "first").

Additionally, I'd want to be able to find the year in which the non-zero occurs. I think a sensible alternative would be to multiply the MOMENT column with the OCCURENCE column (call that ID) and look for the first non-zero value in ID (for each factor EVENT) keep that ID value and turn the other values into zero

m$MOMENT<-as.numeric(as.character(m$MOMENT))
m$OCCURENCE<-as.numeric(as.character(m$OCCURENCE))    
m[,"ID"]<-m$MOMENT * m$OCCURENCE

I have tried to code this with a function containing a when and if statement and using break but it does not work

tapply(m$ID,m$EVENT, function(x) m$ID[i]<- while (m$ID[i] == 0) {m$ID[i]
                  if (m$ID[i]>0) {m$YEAR[i] && break }})

The idea here was to iterate the function over EVENT while m$ID == 0 and then to change the value and break once m$ID > 0. Didn't work...

Any ideas on how to fix this (or much simpler solutions)?

Was it helpful?

Solution

The FUN argument of tapply must be a function but the code in the question supplies an expression, not a function. Try this:

tapply(m[,4], m[,3], max.col, ties.method =  "first")

This will give a logical indicator of the first row in each event which has 1 in the OCCURENCE column and the second line will select those rows:

o <- order(m$EVENT, m$MOMENT) # omit this and next line if already ordered
m <- m[o,]

is.first <- ave(m$OCCURENCE == 1, m$EVENT, FUN = function(x) x & !duplicated(x))
m[is.first, ]

REVISED

  • Ordered by event and year.

  • Note that if its possible that there are events with only zeros then such events will be omitted entirely from m[is.first, ] .

OTHER TIPS

I'm not quite sure what you are trying to achieve, so here is only some coding advice.

First of all, you need to read help("tapply") to lear how to pass arguments to the function that is passed to tapply:

tapply(m[,4],m[,3],max.col, ties.method="first")

However, I doubt this does what you need. Maybe something like this would be useful:

m<-data.frame(FIRM,MOMENT,EVENT,OCCURENCE)
#note how I create the data.frame in a different way 
#in order to avoid coercing all columns to factors


tapply(m[,4],m[,3],which.max)
#  x12   x35   x43 xx901   xy1  xy67   y71   y81 yy123 
#    2     1     2     3     1     1     3     1     1 

tapply(m[,4],m[,3],function(x) m[which.max(x), "MOMENT"])
#  x12   x35   x43 xx901   xy1  xy67   y71   y81 yy123 
# 1995  1995  1995  1991  1995  1995  1991  1995  1995 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top