Question

I am a bit new to machine learning, and I'm trying to get the basics working towards a bigger project using a very simple encoder-decoder model. It looks like this:

embedding_dim = 300
lstm_layer_size_1 = 300
lstm_layer_size_2 = 300

model = Sequential()
model.add(Embedding(self.max_input_vocab, embedding_dim,
    input_length=self.max_input_length, mask_zero=True))
model.add(LSTM(lstm_layer_size_1)) # encoder
model.add(RepeatVector(self.max_output_length))
model.add(LSTM(lstm_layer_size_2, return_sequences=True)) # decoder
model.add(TimeDistributed(Dense(self.max_output_vocab, activation='softmax')))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

It takes in a sequence of words encoded as integers, with 0 padding up to max_input_length. And outputs a one-hot-encoded version of the output for words up to max_output_length.

For example, with a max ouput length of 115, and an expected output of length 20, the network should predict 20 integers in the range max_output_vocab, followed by 95 predicted 0's.

My problem:

I've been running into the issue that the network trains way too much off of the zero tokens in the output, as many of the target sequences have output lengths far below the max output length. The network ends up learning it can get the most accuracy by just predicting almost all 0's for most of the output.

I want to try to make a custom loss function that won't train on any output that comes after the first 0 token, but I'm not sure how I would go about doing this properly.

I know it will look similar to the keras.backend categorical_crossentropy, but would it be as simple as continuing to use a version of that function, but only feeding it the portion of the output sequence I want (everything before the first 0 token in the expected output)?

Was it helpful?

Solution

The issue is very easy to solve assumming you still want to use crossentropy as your loss.

  1. Tell your model to use temporal sample weights. You can do it like this by model.compile(sample_weight_mode='temporal', **other_params)

  2. Generate your sample weights, I think you are smart enough to write your own implementation. The idea is as you said apply weights of 1 if it is supposed to be counted and apply 0 if it is not supposed to be counted. For example you have a sequence [3,5,1,-3,2,0,0,0,0] then your sample weights will be [1,1,1,1,1,0,0,0,0].

  3. Supply the sample weights during fitting. Simply use model.fit(X,y,sample_weights=sample_weights, **other_fit_params).

Done. Now loss is only counted on non-zero entries of your output.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top