Why and how BERT can learn different attentions for each head?

Question

I read the blog above. It visualizes that different color/head has different attention of words.

Based on my understanding, the code implementation of each head is almost the same.

No correct solution

Licensed under: CC-BY-SA with attribution