Why this TensorFlow Transformer model has Linear output instead of Softmax?
-
16-12-2020 - |
Question
I am checking this official TensorFlow tutorial on a Transformer model for Portuguese-English translation.
I am quite surprised that when the Transformer is created, their final output is a Dense layer with linear activation, instead of Softmax. Why is that the case? In the original paper Attention is All You Need the image is pretty clear, there is a Softmax layer just at the end (Fig.1, p. 3).
How can you justify this difference, when your task involves building a language model and your Loss is based on sparse categorical crossentropy?
Solution
The key is precisely in the definition of the loss:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')
As you can see, the loss is created with the flag from_logits=True
which means that the input to the loss is not a probability distribution, but unnormalized log probabilities, namely "logits", which is precisely the result of the final projection, before any softmax.
When from_logits
is true
, the softmax itself is handled inside the loss, combining it with the sparse categorical cross-entropy into a more numerically stable form.
From the docs:
from_logits: Whether y_pred is expected to be a logits tensor. By default, we assume that y_pred encodes a probability distribution.
Note - Using from_logits=True may be more numerically stable.