Pergunta

I’m analyzing response time (RT) data from experiments. In these experiments, each person completes a certain number of trials of various trial types. RT data from only correct trials is used; therefore the amount of RTs to be analyzed per trial type per subject differs. I'm trying to create an outlier function that applies a standard deviation cutoff that is dependent on the number of trials to be analyzed (Van Selst & Jolicoeur, 1994). For example, if the first subject has 100 trials of trial type A, I want to calculate the mean and standard deviation for that subject’s A trials then apply a standard deviation cutoff (e.g., trials greater than the absolute value of the mean plus or minus the number of standard deviations indicated are scored a 0).

The standard deviation cutoffs I'd like to use are listed below:

#n = # of trials
if n < 4 then SDout=3
if n == 4 then SDout=1.458
if n == 5 then SDout =1.68
if n == 6 then SDout=1.841
if n == 7 then SDout=1.961
if n == 8 then SDout=2.050
if n == 9 then SDout=2.12
if n == 10 then SDout=2.173
if n == 11 then SDout=2.22
if n == 12 then SDout=2.246
if n == 13 then SDout=2.274
if n == 14 the SDout=2.31
if n >= 15 & if n < 20 then SDout=2.326
if n >= 20 &if n < 25 then SDout=2.391
if n >= 25 & if n < 30 then SDout=2.41
if n >= 30 & if n < 35 then SDout=2.4305
if n >= 35 & if n < 50 then SDout=2.45  
if n >= 50 & if n < 100 then SDout=2.48
if n >= 100 then SDout=2.5

My data has 3 columns: id (subject identifier), ttype (trial type), and RT.

In essence what I need this function to do is: get the RT mean, SD, and number of trials for each subject for each trial type, then test the RTs against the value that results from multiplying SDout by the SD and adding that to the mean RT. Finally, I’d like the function to create a new column where outlying trials are scored 0 and “good” trials are scored 1.

One way I can think to implement this is to use nested loops with trial types being nested within subjects. However, writing this function is beyond my skill level, so I’m asking for help with creating it. If anyone has suggestions or tips, or non-loopish ways to accomplish this I’d be very appreciative.

Thanks

Foi útil?

Solução 2

If you're really hell bent on doing this...

# This function does the basic outlier rejection
vjout<- function(x){
    n <- length(na.omit(x))
    if (n > 3) {
        CriterionSD <- c(0.0, 0.0, 0.0, 1.458, 1.68, 1.841, 1.961, 2.05, 2.12, 2.173, 2.22, 2.246, 2.274, 2.31, 2.326)
        m <- mean(x, na.rm=T)
        if (n > 15) c <- CriterionSD[15] else c <- CriterionSD[n]
        c <- c*sd(x, na.rm=T)
        x[abs(x-m) > c] <- NA
    }
    return(x)
}

# in order to use vjout to generate a new column of data assign it using 
# the following with subject and condition.  Then, you select the data
# excluding NA cells in the new column
vjoutlier<- function(rt, subj, cond){
    ave (rt, subj:cond, FUN=vjout)
}

Outras dicas

So you're looking to do a Van Selst and Jolicoeur (1994) type of outlier check. There has been a lot of work on RT since then ranging from a strong argument that any such outlier rejection is untenable (e.g. Ulrich & Miller, 1994) to suggestions for how to correct issues in other ways like transforming the distribution. Also, some suggest analyzing the distributions under an ex-gaussian hypothesis and ascribing different meanings to what happens in the normal and exponential parts of the distribution.

In general, any kind of data you collect will have values beyond a couple of standard deviations from time to time. And the number of them you see will be exaggerated in skewed distribution like RT. Usually outlier removal like you're attempting gets rid of 3% of the values. About 3% is about what's expected for RTs with those SD cutoffs (I think it was a Miller paper that showed that). Therefore, you aren't actually removing outliers but real data that's part of the distribution of responses.

May I suggest that you don't do this. You have two issues in the RT. One is that you can have genuine problematic outliers. The other is that the distribution is skewed (which is removed in the mean RTs with large enough numbers of RTs / subject due to CLT). Correcting the latter with outlier rejection causes lots of problems. Correcting the former requires outlier rejection techniques that help identify the genuine outliers.

Typically you'll also have accuracy measures. Functions of accuracy given reaction time have a characteristic pattern. RTs typically rise in accuracy very quickly and then stay that way for a period and fall off at a later point (even if the stimulus is constantly available). You can use an analysis of this function to get rid of RTs that are not outliers per se but rather don't reflect what you wish you analyze. The early RTs below a certain accuracy threshold won't actually be responses to the stimulus. They'll be guesses and anticipations. The later RT's, above accuracy threshold but after accuracy starts to fall off, won't reflect a response to the onset of the stimulus but a decision made at a later time. Typically both of these will be a small number of RTs (although object identification has a surprising pattern).

(There are of course many cases you have to vary these types of assessments. If it's a response compatibility task early RTs may be the only thing driving your effect. If the task has very large effects, such as in search functions, such an analysis might be untenable. And, of course, if you don't have an accuracy measure it's difficult. Consider different methods in those different cases. But don't just go blindly tossing RTs because they have a large z-score.)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top