LSTM Multi-class classification for large number of classes

https://datascience.stackexchange.com/questions/68725

09-12-2020
|

Question

I want to build a model that classifies 473 classes -product categories-, but I'm facing a problem with loss not decreasing.

Data

I have almost 3,000 data points for each class -473 classes- (data size is almost 1.5 million) The data is a sequence of 5 words [iPhone, Pro, Max, 0, 0] and of course they're numbered [345, 344, 123, 0, 0]

Examples:

Input: [iPhone, Pro, Max, 0, 0]
Output: iPhone

Input: [Go, Pro, Camera, New, 0]  
Output: GoPro

Input: [LG, TV, 50, Inches, Used]
Output: LG_TV

Input: [Apple, Watch, 42, mm, 0]
Output: Apple_Watch

Loss

Epoch: 1, Loss: 5.607430, Val Loss: 5.538741
Epoch: 2, Loss: 5.493465, Val Loss: 5.516405 
Epoch: 3, Loss: 5.487641, Val Loss: 5.513667
Epoch: 4, Loss: 5.474956, Val Loss: 5.508683
Epoch: 5, Loss: 5.472722, Val Loss: 5.508304
Epoch: 6, Loss: 5.472691, Val Loss: 5.510557
Epoch: 7, Loss: 5.472782, Val Loss: 5.508627
Epoch: 8, Loss: 5.472320, Val Loss: 5.533378
Epoch: 9, Loss: 5.472340, Val Loss: 5.520573

I've tried to train it for 50 epochs, but the loss still not decreasing.

Model

I'm using PyTorch

LSTMClassifier(
  (embedding): Embedding(15278, 200)
  (lstm): LSTM(200, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (dense): Linear(in_features=256, out_features=473, bias=True)
)

The loss function is: CrossEntropyLoss

Hyperparameters

Batch size: 512
Embedding Dim: 200
Vocabualry size: 15,278
LSTM Layers: 2
Hidden Dims: 256
Optimizer: Adam
Learning rate: 0.002

Can you please direct me to the problem? Is the model weak? Or is it the data having a problem?

Solution

One way to approach this is to increase model capacity and see if/when your LSTM is able to learn the train data. In this first step it is ok to overfit (also see this recent question and answer for this approach) since you can add regularization/decrease model capacity later.

These are the parameters I would tweak:

find a good learning rate along a log-scale (try at least 0.1, 0.01 and 0.0001)
increase the number of LSTM layers, e.g. to 5
increase the hidden dimension, e.g. to 1024
slightly increase the embedding dimension, e.g. to 300

If training this model takes too long, tune the learning rate based on your current model and then make the other adjustments. However, I suggest to not make the other adjustments in smaller steps or one by one as that can take a lot of time.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange