Question

I have produced a large heatmap-like confusion matrix and am seeing horizontal and vertical lines on it, so I'm trying to determine:

  1. What they mean
  2. Why they are there
  3. How I can improve on this

Confusion Matrix Heatmap


Approach

I am relatively new to ML and in the early stages of of a multi-class text classification problem. I may be a little verbose so you can ensure I'm on track and my question isn't due to a flaw in my approach.

I have 90,000+ samples that I'd like to be able to classify into one of 412 classes. I've taken a basic look at the data in terms of its class distribution and the unigrams and bigrams that are selected for each class. Continuing exploration, I trained 4 classifiers on the data, receiving the following levels of accuracy:

LinearSVC                 0.547190
LogisticRegression        0.530063
MultinomialNB             0.368121
RandomForestClassifier    0.200568

Having had a lot of trouble plotting a confusion matrix this large with Seaborn or Matplotlib, I used used the following python code to produce a confusion matrix in CSV:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

def make_confusion_matrix(a,p,c):
    cm = pd.DataFrame(0,index=c,columns=c)
    for count in range(len(p)):
        cm[int(a[count])][int(p[count])]+=1
    return cm

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df['DetailedDescription'])

model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, df['BreakdownAgency'], df.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)    

cm = make_confusion_matrix(y_test.tolist(),y_pred,labels_df['TOOCS Breakdown Agency'])
cm.to_csv('ConfusionMatrix.csv')

I was finally able to view the confusion matrix in a heatmap style by using Excel conditional formatting, which produced the matrix above.

Interpretation

Given that the X axis is actual and y axis is predicted:

I interpret the horizontal lines as showing incorrect bias of predictions towards a class with a disproportionately large number of samples?

I interpret the vertical lines as showing incorrect predictions away from a class with a disproportionately large number of samples?

Does this show that the model is both overfitting and underfitting the data? Or that the samples within my classes are overly diverse?

Action

I'm contemplating:

  1. Manually adding samples to the classes that have very few (a minimum of 10?).
  2. Using SMOTE to oversample small classes (knn=6).
  3. Potentially removing some samples that are atypical or incorrect.

Any help on my Interpretation or Action would be greatly appreciated!

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top