Question

I am checking this official TensorFlow tutorial on a Transformer model for Portuguese-English translation.

I am quite surprised that when the Transformer is created, their final output is a Dense layer with linear activation, instead of Softmax. Why is that the case? In the original paper Attention is All You Need the image is pretty clear, there is a Softmax layer just at the end (Fig.1, p. 3).

How can you justify this difference, when your task involves building a language model and your Loss is based on sparse categorical crossentropy?

Was it helpful?

Solution

The key is precisely in the definition of the loss:

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

As you can see, the loss is created with the flag from_logits=True which means that the input to the loss is not a probability distribution, but unnormalized log probabilities, namely "logits", which is precisely the result of the final projection, before any softmax.

When from_logits is true, the softmax itself is handled inside the loss, combining it with the sparse categorical cross-entropy into a more numerically stable form.

From the docs:

from_logits: Whether y_pred is expected to be a logits tensor. By default, we assume that y_pred encodes a probability distribution.

Note - Using from_logits=True may be more numerically stable.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top