The following code assumes your data frame is called dat
. The code adds a new variable difficulty.scaled
equal to the deviation of difficulty
from 1.5, then groups the data by values of X and Y, and then selects the observations within each group with absolute value of difficulty.scaled
closest to 0 (i.e., difficulty
closest to 1.5).
You can adjust the probs
argument to the quantile
function to select whatever percentage of each subgroup that you want. In this case, I've selected 50% of the rows in each subgroup (that is, 50% of the rows representing each combination of X
and Y
).
library(dplyr) # Install the dplyr package if you don't already have it
dat2 = dat %.%
mutate(difficulty.scaled=difficulty - 1.5) %.%
group_by(X, Y) %.%
filter(abs(difficulty.scaled) < quantile(abs(difficulty.scaled), .5))
For the data you pasted in above (where I've converted the trial number to a variable), here's the output:
tnum difficulty X Y difficulty.scaled
1 trial2 1.4 1 male -0.1
2 trial4 1.5 1 female 0.0
3 trial6 1.2 2 male -0.3
4 trial8 1.6 2 female 0.1
5 trial10 1.4 3 male -0.1
6 trial12 1.5 3 female 0.0
7 trial14 1.2 4 male -0.3
8 trial16 1.6 4 female 0.1
The data you provided has equal numbers of observations for each combination of X
and Y
. If your real data are unbalanced on these variables, then instead of selecting a percentage of the rows in each sub-group, you can select a specific number of rows. The code below selects the n
rows with the lowest absolute value of difficulty.scaled
in each sub-group. That way your subset will be balanced even if your full data set is not (as long as you have at least n
rows of data for each combination of X
and Y
).
n=1
dat2 = dat %.%
mutate(difficulty.scaled=difficulty - 1.5) %.%
group_by(X, Y) %.%
filter(rank(abs(difficulty.scaled), ties.method="first") <= n)
ties.method="first"
ensures that exactly n
rows will be returned, even if there is more than one row with the same absolute value of difficulty.scaled
.
Update: How to divide subsetted data into training and test sets.
Assuming dat2
is your balanced subset, you can divide it into training and test subsets as follows:
# Note that you need to use %>% instead of %.%
train = dat2 %>%
do(sample_n(., 10))
This will return 10 randomly sampled rows per sub-group. Just set this value to whatever number of rows per sub-group you want in your training sample. Notice that you don't need to group by X and Y to create the training sample. This is because when you created dat2
, dplyr
added grouping attributes to dat2
that dplyr
continues to recognize. Do str(dat2)
to see this.
do
is a generic function that allows you to perform arbitrary operations on a data frame from within dplyr
. The period .
is kind of a "pronoun" that represents the data frame (dat2
in this case). This will only work with %>%
instead of %.%
. (dplyr
is in active development and is transitioning from %.%
to %>%
for chaining operations, so it's probably best to just use %>%
from now on.)
# The test set then includes all rows that are not part of train.
# Since tnum has a unique value for each row, use tnum to select all rows that
# are not part of train.
test = dat2[!(dat2$tnum %in% train$tnum), ]