Question

https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77

I read the blog above. It visualizes that different color/head has different attention of words.

Based on my understanding, the code implementation of each head is almost the same.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top