Finding and ranking best semantic matches between two sets of phrases

https://datascience.stackexchange.com/questions/86627

17-12-2020
|

Question

I'm looking for a proper definition for what sort of problem this is, so I can further research it on my own - though I will, for sure, appreciate any specific advice on what are industry standard ways for working on this. While I'm fairly inexperienced with NLP or recommender systems, I've done a fair amount of classical ML before, in case that matters.

The problem - given a "search" query with a list of inputs and a list of expected outputs, retrieve and rank up to N best semantic match for each input. Constraints:

All inputs and outputs are anywhere between a word and a sentence.
Number of outputs >= number of inputs.
All inputs and outputs are unique within themselves, but there can be outputs that are identical to inputs.
Every input is guaranteed to have at least one "good enough" output present.
Every output is guaranteed to be the best match for at most one input.
Labelled data is available, i.e. human-tagged queries. Inputs are sparse, i.e. inputs for queries targeting identical or very similar output sets can be very different.
Data is in English, and I work in Python - if that matters for your suggestions.

An example:

Inputs

(1) Truck
(2) Assortment of lemons, limes, and oranges.
(3) Apples, pears, and oranges.
(4) A tool with broad blade, used for digging.

Outputs

(a) Citrus fruits
(b) Portable telephone that can make and receive calls over a radio frequency.
(c) Fruits traditionally grown in Germany.
(d) Vehicles used for cargo transportation.
(e) Shovel
(f) Motor vehicle used for transportation.

Desired result (numbers are arbitrary, for illustrative purposes)

1 - d (100%), f (75%)
2 - a (95%), c (60%)
3 - c (87%), a (45%)
4 - e (100%)

Similar question suggestions by Stack Exchange are not answering my question, and searching for the answer elsewhere just pits me into endless stream of articles about sentiment analysis of IDMB or Twitter datasets.

Solution

I'm experienced primarily in recommender systems, but I've done enough work in NLP to have some ideas on how to approach this problem.

I don't know of any formal name for the problem that you are proposing, but I do know that it's going to be really hard to train a model to learn from just that data, even if you had dense labels. There's just too much unstated human context embedded into the problem to train a model on that from scratch.

Like for your example of Fruits traditionally grown in Germany. You have to find some Knowledge graph or embedding that has learned the relationship of fruits and where they grow geographically. There is a limited set of things out there that would be capable of this.

So what you need to do is apply another model or language embedding to the data, and try to engineer a solution from that.

The first thing I thought of is any large-scale pre-trained skip-gram or CBOW embeddings (these embeddings are often called "word-vectors" or "thought vectors"). A resource to the basics of that here.

The idea here is to use some pre-trained language model and calculate embeddings for your inputs and outputs. Then you just do cosine similarity between the embeddings for each input and output, and see if you can get good matches.

Because you're using sentences you're going either going to have to use a model that embeds sentences or documents (doc2vec is an example), or you're going to have to find some way to aggregate token embeddings. I would pick the former if you try this approach.

But to build on that, once you have these pretrained embeddings, you could also train your own neural net on top of those embeddings to classify for matches. I would read up on the task of question-answer for neural networks for inspiration, as I imagine you might take some ideas from that for how to associate the queries and matches for a neural net (Google "QANet" for some leads on this).

Another approach, which depending on the data you're working with, would be a knowledge graph. This solution would be more complicated, but basically you would break up the documents into different pieces (POS/NER tagging) and then search for semantic equivalents in a knowledge graph. Some example KGs can be found here.

Another thing to google would be the field of "Ontology". Very much related to knowledge graphs but may get you some more niche results that could prove to be helpful.

Some more context about the exact problem you're trying to solve or what kind of dataset you're working with may help illuminate other solutions. Hope that some of the terms I gave you pushes you into the right direction. Good luck!

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange