So you're looking to do a Van Selst and Jolicoeur (1994) type of outlier check. There has been a lot of work on RT since then ranging from a strong argument that any such outlier rejection is untenable (e.g. Ulrich & Miller, 1994) to suggestions for how to correct issues in other ways like transforming the distribution. Also, some suggest analyzing the distributions under an ex-gaussian hypothesis and ascribing different meanings to what happens in the normal and exponential parts of the distribution.
In general, any kind of data you collect will have values beyond a couple of standard deviations from time to time. And the number of them you see will be exaggerated in skewed distribution like RT. Usually outlier removal like you're attempting gets rid of 3% of the values. About 3% is about what's expected for RTs with those SD cutoffs (I think it was a Miller paper that showed that). Therefore, you aren't actually removing outliers but real data that's part of the distribution of responses.
May I suggest that you don't do this. You have two issues in the RT. One is that you can have genuine problematic outliers. The other is that the distribution is skewed (which is removed in the mean RTs with large enough numbers of RTs / subject due to CLT). Correcting the latter with outlier rejection causes lots of problems. Correcting the former requires outlier rejection techniques that help identify the genuine outliers.
Typically you'll also have accuracy measures. Functions of accuracy given reaction time have a characteristic pattern. RTs typically rise in accuracy very quickly and then stay that way for a period and fall off at a later point (even if the stimulus is constantly available). You can use an analysis of this function to get rid of RTs that are not outliers per se but rather don't reflect what you wish you analyze. The early RTs below a certain accuracy threshold won't actually be responses to the stimulus. They'll be guesses and anticipations. The later RT's, above accuracy threshold but after accuracy starts to fall off, won't reflect a response to the onset of the stimulus but a decision made at a later time. Typically both of these will be a small number of RTs (although object identification has a surprising pattern).
(There are of course many cases you have to vary these types of assessments. If it's a response compatibility task early RTs may be the only thing driving your effect. If the task has very large effects, such as in search functions, such an analysis might be untenable. And, of course, if you don't have an accuracy measure it's difficult. Consider different methods in those different cases. But don't just go blindly tossing RTs because they have a large z-score.)