Pergunta

I'm trying to understand the workings and output of BERT, and I'm wondering how/why each layer of BERT has a 'hidden state'.

I understand what RNN's have a 'hidden state' that gets passed to each time step, which is a representation of previous inputs. But I've read that BERT isn't a RNN - it's a CNN with attention.

But you can output the hidden state for each layer of a BERT model. How is it that BERT has hidden states if it's not a RNN?

Foi útil?

Solução

BERT is a transformer.

A transformer is made of several similar layers, stacked on top of each others.
Each layer have an input and an output. So the output of the layer n-1 is the input of the layer n.

The hidden state you mention is simply the output of each layer.


You might want to quickly look into this explanation of the Transformer architecture : https://jalammar.github.io/illustrated-transformer/

Note that BERT use only Encoders, no Decoders.

Licenciado em: CC-BY-SA com atribuição
scroll top