Why softmax in YouTube’s DNN recommender

https://datascience.stackexchange.com/questions/85439

recommender-system

16-12-2020
|

Pergunta

I am confused about the softmax layer of YouTube’s DNN candidate generation. A user may interact with many videos. Softmax is assuming classes are exclusive. For example, logits = [[4.0, 4.0, 1.0]], labels = [[1.0, 1.0, 0.0]], the sigmoid cross entropy loss is 0.45 while softmax cross entropy loss is 1.43.

Is it because in the candidate generation stage, the relative order of items does not matter?

Solução

"Is it because in the candidate generation stage, the relative order of items does not matter?"

Yes this is exactly what appears to be going on, although it seems that youtube is using softmax in a untraditional way. The Candidate Generation model simply selects the few hundred candidate videos that are subsequently ranked by the ranking model.

Section 3 of the paper you referenced I think does a good job explaining whats going on:

"At serving time we need to compute the most likely N classes (videos) in order to choose the top N to present to the user... Since calibrated likelihoods from the softmax output layer are not needed at serving time, the scoring problem reduces to a nearest neighbor search in the dot product space for which general purpose libraries can be used."

As far as I can tell, this kind of recommender architecture is only beneficial at the scale of which an organization like youtube is operating, and has more to do with the practicality of organizing computational infrastructure rather than model performance. I'm sure their model performance vs. a more "traditional" architecture is negligible as far as things like map@k are concerned.

EDIT: Found the same question already answered with much more detail than shown here.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange