What are the hidden states in the Transformer-XL? Also, how does the recurrence wiring look like?

https://datascience.stackexchange.com/questions/80537

13-12-2020
|

문제

After exhaustively reading the many blogs and papers on Transformers-XL, I still have some questions before I can say that I understand Transformer-XL (and by extension XLNet). Any help in this regard is hugely appreciated.

when we say hidden states are transferred from one segment to another, what exactly is included in these hidden states? Are the weights of the networks implementing the attention mechanism (i.e. calculating the Q, K and V) included? Are the weights involved in calculating the input word embedding included in the hidden state?
When the hidden states are transferred during recurrence, is this transfer from the encoder of one segment to the encoder of the next segment? Or is it from the decoder of the current segment to the encoder of the next segment? Is the decoder involved at all in the hidden state transfer?
I see images like the following in the following in the papers and blogs. what do the dots represent? encoders? decoders? or an entire unit? I guess the answer to my second question will shed a light on this one too.

Thank you

해결책

By hidden states, they mean outputs of the layers, i.e., what you get after the feed-forward sub-layer. For Transformer-XL, it is important that these are also what you use as an input to the self-attention. Therefore, at inference time, if you want to compute the states recursively by segments (presumably because you cannot fit the entire input int he memory), this is the only thing you need to remember from the previous steps to continue the computation.

There is no encoder, you can imagine Transformer-XL as a decoder-only model. Transfering the states just means remembering them, so you can do the self-attention over them, but you can no longer back-propagate through them because you only remember the values and the entire graph telling how you got them.

The dots in the scheme correspond to the hidden states: one state per input subword and per layer. The lines between them are the self-attention links.

다른 팁

Based on what @Jindřich replied above, I made another literature survey. I found an architecture schema for Transformer-XL from the paper Language Modelling for Source Code Using Transformer-XL which I copy here to serve as a reference/guide for those with the same question in future:

Note that the paper doesn't appear to be peer-reviewed yet. So corrections (if any) suggested by the readers are most welcome.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange