How to identify similar words as input from a dictionary

https://datascience.stackexchange.com/questions/75867

12-12-2020
|

Pergunta

Let's say I have a CSV file (single column) with a list of words as input. Meaning the file looks like below

Input Terms

Subaaaash                         
Paruuru                          
Mettromin                        
Paracetamol                      
Crocedinin                        
Vehiclehcle                      
Buildding

Dict terms #this dict has around million records and I have to find a closest match for input term from this dictionary

Metformin 250 MG
Metformin
.....
Crocin
Vehcile 
Paru
Subash

Now, I expect my output to be like as shown below

As you can see that red colored Paru is not a correct match for paracetamol, but that's the closest match we can get for the input term paracetamol from the dictionary. So, I would like to do matching based on

1) Word sounds (when pronounced). Phenotics

2) Spelling mistake corrections

3) Find the best matching word from the dictionary for input terms

Can you let me know how can I do the above?

Solução

So, your question concerns how to effectively translate the input words into their proposed correct words (e.g. Paruuuu --> Paru) via phonetics and spelling mistake corrections.

My first idea on this would be to use a deep sequence to sequence model. In a sequence to sequence model, we encode the input word (e.g. Paruuuu) as a sequence of characters into an encoder (effectively an RNN / LSTM, etc.) into a "hidden representation".
Then you decode your hidden representation as a sequence of phonemes (which denote how we pronounce the word) with a decoder (again, another RNN / LSTM, etc.).
Then, we can take this sequence of phonemes as input into another encoder and then decode with a neural network, where the output layer is a softmax layer, which computes the probability distribution over all words in your vocabulary (in you Dict terms) and select the word the highest probability.
So overall, we have:

Encode input word --> Decode output phoneme sequence --> Encode output phoneme sequence --> Decode with neural network to classify word.

The proposed method is supervised, so of course, you will need examples of input words and their correct "translations" (e.g. ("Paruuu", "Paru"))

Here is an article which gives a good intuition behind sequence to sequence models which can classify your input words: https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange