what is the difference between positional vector and attention vector used in transformer model?

https://datascience.stackexchange.com/questions/77093

12-12-2020
|

문제

what is the difference between positional vector and attention vector used in transformer model ? , i saw a video in youtue and the defintion for positional vector was give as :* "vector that gives context based on postion of word in sentence "* defintion for attention vector was give as "For ever word we can have attention vector generated which captures contextual relationship between words in sentence"

Capturing context information based on distance(postional vector) and attention (attention vector ) sounds same right? or is it different ?

해결책

So the question asks between the difference between an attention vector and a positional vector.

To answer this question, will give some context into how the transformer differs from a sequential model, such as RNNs and LSTMs. In the case of RNNs and LSTMs, data is fed sequentially "one-by-one" into the model to predict the output (whether that is at each time step or after observing the whole sequence. This is irrelevant though in the context of the question.)

In a transformer model, the whole sequence is fed into the model, just like as you would with a conventional neural network. However, the problem is that, unlike with RNNs/LSTMs, there is no way for the transformer model to understand the ordering of the instances in the sequence as the whole sequence is fed into the model. Therefore, we need positional embeddings (positional vector, in your terminology) to add information to the individual instances which tells the model the ordering of the instances in the sequence.

Attention, in the context of transformers, works on the basis that it assigns higher coefficients to instances in the sequences which are most relevant to decoding the hidden representation from the encoder. Unlike a basic encoder-decoder model, with attention, we are able to flexibly assign which input instances in the sequences have the most "say" in predicting the next output instance in the output sequence.

I hope this clarifies some understanding. If not, there is a great article on transformers here: http://www.peterbloem.nl/blog/transformers

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange