Scikit-learn balanced subsampling

Question 1

Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)

This function creates single random balanced subsample.

edit: The subsample size now samples down minority classes, this should probably be changed.

def balanced_subsample(x,y,subsample_size=1.0):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:

Replace the np.random.shuffle line with

this_xs = this_xs.reindex(np.random.permutation(this_xs.index))
Replace the np.concatenate lines with

xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')

Question 2

There now exists a full-blown python package to address imbalanced data. It is available as a sklearn-contrib package at https://github.com/scikit-learn-contrib/imbalanced-learn

Question 3

I found the best solutions here

And this is the one I think it's the simplest.

dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)

then you can use X_rus, y_rus data

For versions 0.4<=:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus= rus.fit_sample(X, y)

Then, indices of the samples randomly selected can be reached by sample_indices_ attribute.

Question 4

A version for pandas Series:

import numpy as np

def balanced_subsample(y, size=None):

    subsample = []

    if size is None:
        n_smp = y.value_counts().min()
    else:
        n_smp = int(size / len(y.value_counts().index))

    for label in y.value_counts().index:
        samples = y[y == label].index.values
        index_range = range(samples.shape[0])
        indexes = np.random.choice(index_range, size=n_smp, replace=False)
        subsample += samples[indexes].tolist()

    return subsample

Question 5

This type of data splitting is not provided among the built-in data splitting techniques exposed in sklearn.cross_validation.

What seems similar to your needs is sklearn.cross_validation.StratifiedShuffleSplit, which can generate subsamples of any size while retaining the structure of the whole dataset, i.e. meticulously enforcing the same unbalance that is in your main dataset. While this is not what you are looking for, you may be able to use the code therein and change the imposed ratio to 50/50 always.

(This would probably be a very good contribution to scikit-learn if you feel up to it.)

Question 6

Below is my python implementation for creating balanced data copy. Assumptions: 1. target variable (y) is binary class (0 vs. 1) 2. 1 is the minority.

from numpy import unique
from numpy import random 

def balanced_sample_maker(X, y, random_seed=None):
    """ return a balanced data set by oversampling minority class 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx

    # oversampling on observations of positive label
    sample_size = uniq_counts[0]
    over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
    balanced_copy_idx = groupby_levels[0] + over_sample_idx
    random.shuffle(balanced_copy_idx)

    return X[balanced_copy_idx, :], y[balanced_copy_idx]

Question 7

Here is a version of the above code that works for multiclass groups (in my tested case group 0, 1, 2, 3, 4)

import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
    """ return a balanced data set by sampling all classes with sample_size 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.iteritems():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)

This also returns the indices so they can be used for other datasets and to keep track of how frequently each data set was used (helpful for training)

Question 8

Simply select 100 rows in each class with duplicates using the following code. activity is my classes (labels of the dataset)

balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))

Question 9

Here my 2 cents. Assume that we have the following unbalanced dataset:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
                   'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
                   'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})
print(df.head())

The first rows:

  Category  Sentiment Gender
0        C          1      M
1        B          0      M
2        B          0      M
3        B          0      M
4        A          0      M

Assume now that we want to get a balanced dataset by Sentiment:

df_grouped_by = df.groupby(['Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced
print(df_balanced.head())

The first rows of the balanced dataset:

  Category  Sentiment Gender
0        C          0      F
1        C          0      M
2        C          0      F
3        C          0      M
4        C          0      M

Let's verify that it is balanced in terms of Sentiment

df_balanced.groupby(['Sentiment']).size()

We get:

Sentiment
0    369
1    369
dtype: int64

As we can see we ended up with 369 positive and 369 negative Sentiment labels.

Question 10

A short, pythonic solution to balance a pandas DataFrame either by subsampling (uspl=True) or oversampling (uspl=False), balanced by a specified column in that dataframe that has two or more values.

For uspl=True, this code will take a random sample without replacement of size equal to the smallest stratum from all strata. For uspl=False, this code will take a random sample with replacement of size equal to the largest stratum from all strata.

def balanced_spl_by(df, lblcol, uspl=True):
    datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
    lsz = [f.shape[0] for f in datas_l ]
    return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1)

This will only work with a Pandas DataFrame, but that seems to be a common application, and restricting it to Pandas DataFrames significantly shortens the code as far as I can tell.

Question 11

A slight modification to the top answer by mikkom.

If you want to preserve ordering of the larger class data, ie. you don't want to shuffle.

Instead of

    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

do this

        if len(this_xs) > use_elems:
            ratio = len(this_xs) / use_elems
            this_xs = this_xs[::ratio]

Question 12

My subsampler version, hope this helps

def subsample_indices(y, size):
    indices = {}
    target_values = set(y_train)
    for t in target_values:
        indices[t] = [i for i in range(len(y)) if y[i] == t]
    min_len = min(size, min([len(indices[t]) for t in indices]))
    for t in indices:
        if len(indices[t]) > min_len:
            indices[t] = random.sample(indices[t], min_len)
    return indices

x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
j = subsample_indices(x, 2)
print j
print [x[t] for t in j[-1]]
print [x[t] for t in j[1]]

Question 13

Here is my solution, which can be tightly integrated in an existing sklearn pipeline:

from sklearn.model_selection import RepeatedKFold
import numpy as np


class DownsampledRepeatedKFold(RepeatedKFold):

    def split(self, X, y=None, groups=None):
        for i in range(self.n_repeats):
            np.random.seed()
            # get index of major class (negative)
            idxs_class0 = np.argwhere(y == 0).ravel()
            # get index of minor class (positive)
            idxs_class1 = np.argwhere(y == 1).ravel()
            # get length of minor class
            len_minor = len(idxs_class1)
            # subsample of major class of size minor class
            idxs_class0_downsampled = np.random.choice(idxs_class0, size=len_minor)
            original_indx_downsampled = np.hstack((idxs_class0_downsampled, idxs_class1))
            np.random.shuffle(original_indx_downsampled)
            splits = list(self.cv(n_splits=self.n_splits, shuffle=True).split(original_indx_downsampled))

            for train_index, test_index in splits:
                yield original_indx_downsampled[train_index], original_indx_downsampled[test_index]

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        self.n_splits = n_splits
         super(DownsampledRepeatedKFold, self).__init__(
        n_splits=n_splits, n_repeats=n_repeats, random_state=random_state
    )

Use it as usual:

    for train_index, test_index in DownsampledRepeatedKFold(n_splits=5, n_repeats=10).split(X, y):
         X_train, X_test = X[train_index], X[test_index]
         y_train, y_test = y[train_index], y[test_index]

Question 14

Here's a solution which is:

simple (< 10 lines code)
fast (besides one for loop, pure NumPy)
no external dependencies other than NumPy
is very cheap to generate new balanced random samples (just call np.random.sample()). Useful for generating different shuffled & balanced samples between training epochs

def stratified_random_sample_weights(labels):
    sample_weights = np.zeros(num_samples)
    for class_i in range(n_classes):
        class_indices = np.where(labels[:, class_i]==1)  # find indices where class_i is 1
        class_indices = np.squeeze(class_indices)  # get rid of extra dim
        num_samples_class_i = len(class_indices)
        assert num_samples_class_i > 0, f"No samples found for class index {class_i}"
        
        sample_weights[class_indices] = 1.0/num_samples_class_i  # note: samples with no classes present will get weight=0

    return sample_weights / sample_weights.sum()  # sum(weights) == 1

Then, you use re-use these weights over and over to generate balanced indices with np.random.sample():

sample_weights = stratified_random_sample_weights(labels)
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

Full example:

# generate data
from sklearn.preprocessing import OneHotEncoder

num_samples = 10000
n_classes = 10
ground_truth_class_weights = np.logspace(1,3,num=n_classes,base=10,dtype=float)  # exponentially growing
ground_truth_class_weights /= ground_truth_class_weights.sum()  # sum to 1
labels = np.random.choice(list(range(n_classes)), size=num_samples, p=ground_truth_class_weights)
labels = labels.reshape(-1, 1)  # turn each element into a list
labels = OneHotEncoder(sparse=False).fit_transform(labels)


print(f"original counts: {labels.sum(0)}")
# [  38.   76.  127.  191.  282.  556.  865. 1475. 2357. 4033.]

sample_weights = stratified_random_sample_weights(labels)
sample_size = 1000
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

print(f"rebalanced counts: {labels[chosen_indices].sum(0)}")
# [104. 107.  88. 107.  94. 118.  92.  99. 100.  91.]