Pergunta

I have a set of resumes $R=\{{r_1,...,r_n\}}$, which I've transformed to a vector space using TF-IDF. Each resume has a label, which is the name of their current employer. Each of these labels comes from the set of possible employers $E = \{{e_1,...,e_m\}}$.

From this, I have trained a machine learning model. This model then takes some $r_i$ from the test set, and assigns a probability to each member of $E$. The results are then ranked, from highest probability to lowest probability.

E.g. $P(e_2|r_i)=0.56, P(e_{52}|r_i)=0.29, P(e_{29}|r_i)=0.14,...etc.$

The resume, $r_i$ belongs to some individual, so this ranking is used to inform the individual as to what companies the model believes are most likely to hire them, given the details of what their resume contains (their skills, past employers, education, personal summary). In this case, company $e_2$ is most likely, followed by $e_{52}$ and so on.

My question is, how do you evaluate the performance of this recommendation system? Where the information need of the user is to learn what companies their resume matches to the best.


My own ideas

My understanding from information retrieval is that we need to determine some measure of relevance. From this, it's possible to use some measure like mean average precision to measure performance. Determining relevance seems like the tricky part. For instance $e_2$ has a high probability, but is it actually relevant? Maybe $r_i$ is based on aeronautical engineering, but $e_2$ is a food store, which is clearly not relevant. My current idea is to take each $r_i$ in the training set belonging to the same label $e_j$, and then compute a single TF-IDF vector which is the average of the TF-IDF vectors belonging to each $r_i$ labelled as $e_j$.

E.g. (an unrealistic example) Suppose $r_2$ and $r_9$ are labelled as $e_4$. Now suppose $r_2$ has TF-IDF vector $[0.2, 0.1, 0.5, 0.2]$ and $r_9$ has TF-IDF vector $[0.22, 0.12, 0.44, 0.22]$. Then the average of these is $[0.21, 0.11, 0.47, 0.21]$. Repeating this process for all $e_j\in E$ results in $m$ of these vectors. From this, it's possible to compute the cosine similarity between some $e_i$ and $e_j$.

Returning to the first example, we can take the true label of $r_i$, and then find the cosine similarity between this label and each member of $E$. Then we set some threshold and evaluate whether $\text{cosineSim}(\text{true label}, e_j) < \text{some threshold}$. If the cosine similarity is above the threshold, then $e_j$ is relevant, otherwise, $e_j$ is not relevant.

I'm not sure if this is a sensible/valid approach (I wonder if it defeats the point of the machine learning, since I may as well just use the cosine similarity? That said, I cannot forgo the machine learning component in this project).

Maybe this is an over complication, and something like top k accuracy would be fine. I.e. is the true label in the top k suggestions?

I'm not sure, I'm interested to have some more informed perspective.

Foi útil?

Solução

To the extent possible you should try to evaluate based on your data rather than some ad-hoc measure. As you rightly noticed, there is a real risk that the ad-hoc measure would just confirm the predictions of the model, since it uses a somewhat similar method.

I would suggest that you split your data between a training set and test set (or even better use cross-validation), and indeed use top-K accuracy (or something similar) to evaluate on the test set. That would be the safe option for a proper evaluation, and then you could try to see if your ad-hoc measure correlates with it: if it does, then you have evidence that in the future it can be be used instead of a test set.

Side note: your instances don't contain any negative evidence such as resumes rejected by an employer. In case you could obtain this kind of data, it could probably improve the predictions.

Licenciado em: CC-BY-SA com atribuição
scroll top