I'm trying to understand the workings and output of BERT, and I'm wondering how/why each layer of BERT has a 'hidden state'.

I understand what RNN's have a 'hidden state' that gets passed to each time step, which is a representation of previous inputs. But I've read that BERT isn't a RNN - it's a CNN with attention.

But you can output the hidden state for each layer of a BERT model. How is it that BERT has hidden states if it's not a RNN?

有帮助吗?

解决方案

BERT is a transformer.

A transformer is made of several similar layers, stacked on top of each others.
Each layer have an input and an output. So the output of the layer n-1 is the input of the layer n.

The hidden state you mention is simply the output of each layer.


You might want to quickly look into this explanation of the Transformer architecture : https://jalammar.github.io/illustrated-transformer/

Note that BERT use only Encoders, no Decoders.

许可以下: CC-BY-SA归因
scroll top