Question

In attention, the context vector ($c$) is derived from the sum of the attention weights ($\alpha$) multiplied by the encoder hidden states ($h$), where the weights are obtained by multiplying the decoder hidden state and the encoder states.

$c_i = \Sigma_j^{T_x} \alpha_{ij} h_j$

My question is, why calculate this context vector and simply not forward the attention weights, as these could indicate how much to focus on each of the encoder states.

Could somebody explain the intuition behind this?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top