
I have produced a large heatmap-like confusion matrix and am seeing horizontal and vertical lines on it, so I'm trying to determine:

  1. What they mean
  2. Why they are there
  3. How I can improve on this

Confusion Matrix Heatmap


I am relatively new to ML and in the early stages of of a multi-class text classification problem. I may be a little verbose so you can ensure I'm on track and my question isn't due to a flaw in my approach.

I have 90,000+ samples that I'd like to be able to classify into one of 412 classes. I've taken a basic look at the data in terms of its class distribution and the unigrams and bigrams that are selected for each class. Continuing exploration, I trained 4 classifiers on the data, receiving the following levels of accuracy:

LinearSVC                 0.547190
LogisticRegression        0.530063
MultinomialNB             0.368121
RandomForestClassifier    0.200568

Having had a lot of trouble plotting a confusion matrix this large with Seaborn or Matplotlib, I used used the following python code to produce a confusion matrix in CSV:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

def make_confusion_matrix(a,p,c):
    cm = pd.DataFrame(0,index=c,columns=c)
    for count in range(len(p)):
    return cm

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df['DetailedDescription'])

model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, df['BreakdownAgency'], df.index, test_size=0.33, random_state=0), y_train)
y_pred = model.predict(X_test)    

cm = make_confusion_matrix(y_test.tolist(),y_pred,labels_df['TOOCS Breakdown Agency'])

I was finally able to view the confusion matrix in a heatmap style by using Excel conditional formatting, which produced the matrix above.


Given that the X axis is actual and y axis is predicted:

I interpret the horizontal lines as showing incorrect bias of predictions towards a class with a disproportionately large number of samples?

I interpret the vertical lines as showing incorrect predictions away from a class with a disproportionately large number of samples?

Does this show that the model is both overfitting and underfitting the data? Or that the samples within my classes are overly diverse?


I'm contemplating:

  1. Manually adding samples to the classes that have very few (a minimum of 10?).
  2. Using SMOTE to oversample small classes (knn=6).
  3. Potentially removing some samples that are atypical or incorrect.

Any help on my Interpretation or Action would be greatly appreciated!

