Why and how BERT can learn different attentions for each head?
Question
I read the blog above. It visualizes that different color/head has different attention of words.
Based on my understanding, the code implementation of each head is almost the same.
No correct solution
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange