Need for Dense layer in Text Classifcation

https://datascience.stackexchange.com/questions/76139

12-12-2020
|

Question

While creating a model for text classification, what is the need for a Dense Layer? I noticed in multiple examples the following is the structure. A softmax is what required right instead of the Dense Layer?

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(1)
])

Consider the following sentence in 5 class classification:

"movie is good" . The model structure could be:

a = activation_unit
emb= embedding_vector(word)

a0 -> emb("movie") ->a1->emb("is") ->a2->emb("good") ->a3, and

sample_y = softmax(np.dot(Wya,a3))  

and
sample_y = [0.1,0.2,0.2,0.4,0.1]

which says the sentence belongs to "class 4". So where is the need for a "Dense Layer"? Can anyone please explain this

Solution

In neural networks meant for classification, you need a linear layer before the softmax to project the internal representation, which has some dimensionality $d_i$, to the output space, which has dimensionality $d_o$ equal the number of choices (5 in your case).

So you either place a Dense(5) layer after the BiLSTM or you take the output of the BiLSTM "manually" and implement the projection.

The code above has some strange things:

Uses numpy.dot to multiply the output of the BiLSTM. Is this a typo and you actually meant tf.dot or tf.matmul?
The model ends with a tf.keras.layers.Dense(1), maybe because it was originally meant for binary classification.
Has both a Dense layer and then a dot product (i.e. matrix multiplication). These two operations are equivalent to a single Dense layer, so it is pointless to have both.

So yo answer your question: assuming that the np.dot actually means a tf matrix multiplication, then the Dense layer in the model is pointless.

OTHER TIPS

Softmax is simply an activation function. In your example of 5 class classification, you will need a dense layer with 5 output neurons on which you can then apply softmax to obtain the probability for each class.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange