How to understand Inconsistent and ambiguous dimensions of matrices used in the Attention layer?

https://datascience.stackexchange.com/questions/75314

11-12-2020
|

Question

Attention-scoring mechanism seems to be a commonly-used component in various seq2seq models, and I was reading about the original "Location-based Attention" in Bahadanau well-known paper at https://arxiv.org/pdf/1506.07503.pdf. (it seems this attention is used in various forms of GNMT and text-to-speech sythesizers like tacotron-2 https://github.com/Rayhane-mamah/Tacotron-2).

Even after repeated readings of this paper and other articles about Attention-mechanism, I'm confused about the dimensions of the matrices used, as the paper doesn't seem to describe it. My understanding is:

If I have decoder hidden dim 1024, that means ($s_{i-1}$} vector is 1024 length.
If I have encoder output dim 512, that means $h_{j}$ vector is 512 length.
If the maximum number of inputs to the encoder is 256, then the number of $j$ can be from 1 to 256.
Since $W \times S_{i-1}$ is a matrix multiply, it seems $\text{cols}(W)$ should match $\text{rows}(S_{i-1})$, but $\text{rows}(W)$ still remain undefined. The same seems true for matrices $V$, $U$, $w$, $b$.

This is page-3/4 from the paper above that describes Attention-layer:

I'm unsure how to make sense of this. Am I missing something, or can someone explain this?

What I don't understand is:

What is the dimension of previous alignment (denoted by $\alpha_{i-1}$)? Shouldn't it be total values of $j$ in $h_{j}$ (which is 256 and means total different encoder output states)?
What is the dimension of $f_{i,j}$ and convolution filter $F$? (the paper says F belongs to $k\times r$ shape but doesn't define $r$ anywhere). What is $r$ and what does $k \times r$ mean here?
How are these unknown dimensions for matrices $V$, $U$, $w$, $b$ described above determined in this model?

Solution

First to equation (7): $s_{i-1}$ is a vector, not a matrix. When you multiply it with a matrix $W$, you get another vector of which as the length of the intermediate attention projection, let us call it $d_a$. The shape of $W$ is thus $1024 \times d_a$. Similarly, the shape of $V$ is $512 \times d_a$ and bias $b$ is a vector of length $d_a$. The vector $w$ has $d_a$ dimensions and projects the intermediate projection into a scalar number.

You would probably like to compute everything with large matrix multiplications, which is possible only at training time, and which might the reason for your confusion about the dimensions.

If you have all decoder states in a matrix $S$ of shape $n \times 1024$. You can project it using $W$ into the shape $n \times d_a$.
You can also project all encoder states $H$ of shape $m \times 512$ using $V$ and get a matrix of shape $m \times d_a$ for input and output lengths $n$ and $m$.*
Now, you need to get a tensor of size $n \times m \times d_a$ for the values of $e_{i,j}$.

There is a simple way to implement it, you just reshape the first matrix to $n \times 1 \times d_a$, the second one to $1 \times m \times d_a$ and sum them together. The deep learning/linear algebra libraries will broadcast the tensor (i.e., copy along the singleton dimension), so it will give you a tensor of shape $n \times m \times d_a$. When you multiply this with $w$, you get a table of attention energies of shape $n \times m$, after applying softmax on the last dimension, you get $\alpha_{i,j}$ There is no unknown dimension anywhere.

Equation (8) is just a fancy formulation of the weighted average. In $F$, you have $k$ vectors of dimension $r$, the distribution $\alpha_i$ has length $k$:

$$f_i = \sum_{j=0}^k f_{i,j} \cdot \alpha_{i-1,j}$$

Note that $\alpha_{i-1,j}$ is scalar. This is the context vector form the previous decoding step, i.e., a weighted average of $h_j$ based on the probability distribution $a_{i-1,j}$.

Equation (9) is basically the same as (7).

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange