Improving classifcation when some are less represented?
Question
I have a multi-class classification problem. It performs quite well but on the least represented classes it doesn't. Indeed, here is the distribution :
And here are the classification results (I took the numbers off the labels):
Therefore how to improve classifcation when some are less represented ?
I thought of duplicating a few rows of the classes it doesn't predict well in the train sample. But maybe this assumption is entirely false, maybe it is not because they are less represented that are badly classified. Maybe I should have a look on the feature selection I did by hand and rather do a PCA ?
Update
class weight with inverted frequency
I passed the class_weight
parameter in model.fit()
which is a list of the inverted frequency of the classes on the dataset:
>>> lossWeights = df['grade'].value_counts(normalize=True)
>>> lossWeights = lossWeights.sort_index().tolist()
>>> print(lossWeights)
[0.204064039408867, 0.2954361054766734, 0.29536185163720663, 0.13638619240799768, 0.04878839466821211, 0.014684149521877717, 0.0052792668791654595]
weights = {0: 1 / 0.204064,
1: 1 / 0.295436,
2: 1 / 0.295362,
3: 1 / 0.136386,
4: 1 / 0.048788,
5: 1 / 0.014684,
6: 1 / 0.005279}
history = model.fit(x_train.as_matrix(),
y_train.as_matrix(),
validation_split=0.2,
epochs=epochs,
batch_size=batch_sz, # Can I tweak the batch here to get evenly distributed data ?
verbose=2,
class_weight = weights,
callbacks=[checkpoint])
It diminished on the test Set Accuracy: 86.57% (it was 88.54% before) but better balanced the results on the confusion matrix :
class weight with inverted frequency + focal loss
Focal loss is designed to address class imbalance by down-weighting inliers (easy examples) such that their contribution to the total loss is small even if their number is large. It focuses on training a sparse set of hard examples.
def focal_loss(gamma=2., alpha=4.):
gamma = float(gamma)
alpha = float(alpha)
def focal_loss_fixed(y_true, y_pred):
"""Focal loss for multi-classification
FL(p_t)=-alpha(1-p_t)^{gamma}ln(p_t)
Notice: y_pred is probability after softmax
gradient is d(Fl)/d(p_t) not d(Fl)/d(x) as described in paper
d(Fl)/d(p_t) * [p_t(1-p_t)] = d(Fl)/d(x)
Focal Loss for Dense Object Detection
https://arxiv.org/abs/1708.02002
Arguments:
y_true {tensor} -- ground truth labels, shape of [batch_size, num_cls]
y_pred {tensor} -- model's output, shape of [batch_size, num_cls]
Keyword Arguments:
gamma {float} -- (default: {2.0})
alpha {float} -- (default: {4.0})
Returns:
[tensor] -- loss.
"""
epsilon = 1.e-9
y_true = tf.convert_to_tensor(y_true, tf.float32)
y_pred = tf.convert_to_tensor(y_pred, tf.float32)
model_out = tf.add(y_pred, epsilon)
ce = tf.multiply(y_true, -tf.log(model_out))
weight = tf.multiply(y_true, tf.pow(tf.subtract(1., model_out), gamma))
fl = tf.multiply(alpha, tf.multiply(weight, ce))
reduced_fl = tf.reduce_max(fl, axis=1)
return tf.reduce_mean(reduced_fl)
return focal_loss_fixed
model.compile(loss=focal_loss(alpha=1),
optimizer='nadam',
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=3, batch_size=1000)
I just had to add it to my model:
def create_model(input_dim, output_dim):
print(output_dim)
# create model
model = Sequential()
# input layer
model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
# hidden layer
model.add(Dense(60, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
# output layer
model.add(Dense(output_dim, activation='softmax'))
# Compile model
# model.compile(loss='categorical_crossentropy', loss_weights=None, optimizer='adam', metrics=['accuracy'])
model.compile(loss=focal_loss(alpha=1), loss_weights=None, optimizer='adam', metrics=['accuracy'])
return model
It gave me an overall accuracy of 88%. However it gave me back a very bad classification on the least represented class:
focal loss
It has a decent test Set Accuracy: 88.27% and the classification is better balanced :
Now I have to questions. I'm still not satisfied. How to improve this classification ? Which model should I use between the first and the last updates ?
Resample
I tried to down sample the majority classes
import os
from sklearn.utils import resample
# rebalance data
#df = resample_data(df)
if True:
count_class_A, count_class_B,count_class_C, count_class_D,count_class_E, count_class_F, count_class_G = df.grade.value_counts()
count_df = df.shape[0]
class_dict = {"A": count_class_A,"B" :count_class_B,"C": count_class_C,"D": count_class_D,"E": count_class_E, "F": count_class_F, "G": count_class_G}
counts = [count_class_A, count_class_B,count_class_C, count_class_D,count_class_E, count_class_F, count_class_G]
median = statistics.median(counts)
for key in class_dict:
if class_dict[key]>median:
print(key)
df[df.grade == key] = df[df.grade == key].sample(int(count_df/7), replace = False)
#replace=False, # sample without replacement
#n_samples=int(count_df/7), # to match minority class
#random_state=123)
# Divide the data set into training and test sets
x_train, x_test, y_train, y_test = split_data(df, APPLICANT_NUMERIC + CREDIT_NUMERIC,
APPLICANT_CATEGORICAL,
TARGET,
test_size = 0.2,
#row_limit = os.environ.get("sample"))
row_limit = 552160)
# Inspect our training data
print("x_train contains {} rows and {} features".format(x_train.shape[0], x_train.shape[1]))
print("y_train contains {} rows and {} features".format(y_train.shape[0], y_train.shape[1]))
print("x_test contains {} rows and {} features".format(x_test.shape[0], x_test.shape[1]))
print("y_test contains {} rows and {} features".format(y_test.shape[0], y_test.shape[1]))
# Loan grade has been one-hot encoded
print("Sample one-hot encoded 'y' value: \n{}".format(y_train.sample()))
However it the results where catastrophic. The model accuracy and the model loss looked to have some issues :
And everything was classified in "A" on the test set.
No correct solution