Create a subset that is balanced across multiple variables

Question

The following code assumes your data frame is called dat. The code adds a new variable difficulty.scaled equal to the deviation of difficulty from 1.5, then groups the data by values of X and Y, and then selects the observations within each group with absolute value of difficulty.scaled closest to 0 (i.e., difficulty closest to 1.5).

You can adjust the probs argument to the quantile function to select whatever percentage of each subgroup that you want. In this case, I've selected 50% of the rows in each subgroup (that is, 50% of the rows representing each combination of X and Y).

library(dplyr)  # Install the dplyr package if you don't already have it
dat2 = dat %.%
         mutate(difficulty.scaled=difficulty - 1.5) %.%
         group_by(X, Y) %.%
         filter(abs(difficulty.scaled) < quantile(abs(difficulty.scaled), .5))

For the data you pasted in above (where I've converted the trial number to a variable), here's the output:

     tnum difficulty X      Y difficulty.scaled
1  trial2        1.4 1   male              -0.1
2  trial4        1.5 1 female               0.0
3  trial6        1.2 2   male              -0.3
4  trial8        1.6 2 female               0.1
5 trial10        1.4 3   male              -0.1
6 trial12        1.5 3 female               0.0
7 trial14        1.2 4   male              -0.3
8 trial16        1.6 4 female               0.1

The data you provided has equal numbers of observations for each combination of X and Y. If your real data are unbalanced on these variables, then instead of selecting a percentage of the rows in each sub-group, you can select a specific number of rows. The code below selects the n rows with the lowest absolute value of difficulty.scaled in each sub-group. That way your subset will be balanced even if your full data set is not (as long as you have at least n rows of data for each combination of X and Y).

n=1
dat2 = dat %.%
         mutate(difficulty.scaled=difficulty - 1.5) %.%
         group_by(X, Y) %.%
         filter(rank(abs(difficulty.scaled), ties.method="first") <= n)

ties.method="first" ensures that exactly n rows will be returned, even if there is more than one row with the same absolute value of difficulty.scaled.

Update: How to divide subsetted data into training and test sets.

Assuming dat2 is your balanced subset, you can divide it into training and test subsets as follows:

# Note that you need to use %>% instead of %.%
train = dat2 %>%
  do(sample_n(., 10))

This will return 10 randomly sampled rows per sub-group. Just set this value to whatever number of rows per sub-group you want in your training sample. Notice that you don't need to group by X and Y to create the training sample. This is because when you created dat2, dplyr added grouping attributes to dat2 that dplyr continues to recognize. Do str(dat2) to see this.

do is a generic function that allows you to perform arbitrary operations on a data frame from within dplyr. The period . is kind of a "pronoun" that represents the data frame (dat2 in this case). This will only work with %>% instead of %.%. (dplyr is in active development and is transitioning from %.% to %>% for chaining operations, so it's probably best to just use %>% from now on.)

# The test set then includes all rows that are not part of train. 
# Since tnum has a unique value for each row, use tnum to select all rows that 
# are not part of train.
test = dat2[!(dat2$tnum %in% train$tnum), ]