Improving classifcation when some are less represented?

https://datascience.stackexchange.com/questions/60194

02-11-2019
|

Question

I have a multi-class classification problem. It performs quite well but on the least represented classes it doesn't. Indeed, here is the distribution :

And here are the classification results (I took the numbers off the labels):

Therefore how to improve classifcation when some are less represented ?

I thought of duplicating a few rows of the classes it doesn't predict well in the train sample. But maybe this assumption is entirely false, maybe it is not because they are less represented that are badly classified. Maybe I should have a look on the feature selection I did by hand and rather do a PCA ?

Update

class weight with inverted frequency

I passed the class_weight parameter in model.fit() which is a list of the inverted frequency of the classes on the dataset:

>>> lossWeights = df['grade'].value_counts(normalize=True)
>>> lossWeights = lossWeights.sort_index().tolist()
>>> print(lossWeights)
[0.204064039408867, 0.2954361054766734, 0.29536185163720663, 0.13638619240799768, 0.04878839466821211, 0.014684149521877717, 0.0052792668791654595]
weights = {0: 1 / 0.204064,
       1: 1 / 0.295436, 
       2: 1 / 0.295362,
       3: 1 / 0.136386, 
       4: 1 / 0.048788,
       5: 1 / 0.014684,
       6: 1 / 0.005279}

history = model.fit(x_train.as_matrix(),
                y_train.as_matrix(),
                validation_split=0.2,
                epochs=epochs,  
                batch_size=batch_sz, # Can I tweak the batch here to get evenly distributed data ?
                verbose=2,
                class_weight = weights,
                callbacks=[checkpoint])

It diminished on the test Set Accuracy: 86.57% (it was 88.54% before) but better balanced the results on the confusion matrix :

class weight with inverted frequency + focal loss

Focal loss is designed to address class imbalance by down-weighting inliers (easy examples) such that their contribution to the total loss is small even if their number is large. It focuses on training a sparse set of hard examples.

def focal_loss(gamma=2., alpha=4.):

    gamma = float(gamma)
    alpha = float(alpha)

    def focal_loss_fixed(y_true, y_pred):
        """Focal loss for multi-classification
        FL(p_t)=-alpha(1-p_t)^{gamma}ln(p_t)
        Notice: y_pred is probability after softmax
        gradient is d(Fl)/d(p_t) not d(Fl)/d(x) as described in paper
        d(Fl)/d(p_t) * [p_t(1-p_t)] = d(Fl)/d(x)
        Focal Loss for Dense Object Detection
        https://arxiv.org/abs/1708.02002

        Arguments:
            y_true {tensor} -- ground truth labels, shape of [batch_size, num_cls]
            y_pred {tensor} -- model's output, shape of [batch_size, num_cls]

        Keyword Arguments:
            gamma {float} -- (default: {2.0})
            alpha {float} -- (default: {4.0})

        Returns:
            [tensor] -- loss.
        """
        epsilon = 1.e-9
        y_true = tf.convert_to_tensor(y_true, tf.float32)
        y_pred = tf.convert_to_tensor(y_pred, tf.float32)

        model_out = tf.add(y_pred, epsilon)
        ce = tf.multiply(y_true, -tf.log(model_out))
        weight = tf.multiply(y_true, tf.pow(tf.subtract(1., model_out), gamma))
        fl = tf.multiply(alpha, tf.multiply(weight, ce))
        reduced_fl = tf.reduce_max(fl, axis=1)
        return tf.reduce_mean(reduced_fl)
    return focal_loss_fixed

model.compile(loss=focal_loss(alpha=1),
              optimizer='nadam',
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=3, batch_size=1000)

I just had to add it to my model:

def create_model(input_dim, output_dim):
    print(output_dim)
    # create model
    model = Sequential()
    # input layer
    model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))

    # hidden layer
    model.add(Dense(60, activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))

    # output layer
    model.add(Dense(output_dim, activation='softmax'))

    # Compile model
    # model.compile(loss='categorical_crossentropy', loss_weights=None, optimizer='adam', metrics=['accuracy'])
    model.compile(loss=focal_loss(alpha=1), loss_weights=None, optimizer='adam', metrics=['accuracy'])

    return model

It gave me an overall accuracy of 88%. However it gave me back a very bad classification on the least represented class:

focal loss

It has a decent test Set Accuracy: 88.27% and the classification is better balanced :

Now I have to questions. I'm still not satisfied. How to improve this classification ? Which model should I use between the first and the last updates ?

Resample

I tried to down sample the majority classes

import os
from sklearn.utils import resample

# rebalance data
#df = resample_data(df)

if True:

    count_class_A, count_class_B,count_class_C, count_class_D,count_class_E, count_class_F, count_class_G = df.grade.value_counts()
    count_df = df.shape[0] 
    class_dict = {"A": count_class_A,"B" :count_class_B,"C": count_class_C,"D": count_class_D,"E": count_class_E, "F": count_class_F, "G": count_class_G}
    counts = [count_class_A, count_class_B,count_class_C, count_class_D,count_class_E, count_class_F, count_class_G]
    median = statistics.median(counts)


    for key in class_dict:
        if class_dict[key]>median:
            print(key)
            df[df.grade == key] = df[df.grade == key].sample(int(count_df/7), replace = False) 
                                             #replace=False,    # sample without replacement
                                             #n_samples=int(count_df/7),     # to match minority class
                                             #random_state=123)
# Divide the data set into training and test sets
x_train, x_test, y_train, y_test = split_data(df, APPLICANT_NUMERIC + CREDIT_NUMERIC,
                  APPLICANT_CATEGORICAL,
                  TARGET,
                  test_size = 0.2,
                  #row_limit = os.environ.get("sample"))
                  row_limit = 552160)

# Inspect our training data
print("x_train contains {} rows and {} features".format(x_train.shape[0], x_train.shape[1]))
print("y_train contains {} rows and {} features".format(y_train.shape[0], y_train.shape[1]))

print("x_test contains {} rows and {} features".format(x_test.shape[0], x_test.shape[1]))
print("y_test contains {} rows and {} features".format(y_test.shape[0], y_test.shape[1]))

# Loan grade has been one-hot encoded
print("Sample one-hot encoded 'y' value: \n{}".format(y_train.sample()))

However it the results where catastrophic. The model accuracy and the model loss looked to have some issues :

And everything was classified in "A" on the test set.

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange