How to identify similar words as input from a dictionary
Pergunta
Let's say I have a CSV file (single column) with a list of words as input. Meaning the file looks like below
Input Terms
Subaaaash
Paruuru
Mettromin
Paracetamol
Crocedinin
Vehiclehcle
Buildding
Dict terms #this dict has around million records and I have to find a closest match for input term from this dictionary
Metformin 250 MG
Metformin
.....
Crocin
Vehcile
Paru
Subash
Now, I expect my output to be like as shown below
As you can see that red colored Paru
is not a correct match for paracetamol
, but that's the closest match we can get for the input term paracetamol
from the dictionary. So, I would like to do matching based on
1) Word sounds (when pronounced). Phenotics
2) Spelling mistake corrections
3) Find the best matching word from the dictionary for input terms
Can you let me know how can I do the above?
Solução
So, your question concerns how to effectively translate the input words into their proposed correct words (e.g. Paruuuu --> Paru) via phonetics and spelling mistake corrections.
My first idea on this would be to use a deep sequence to sequence model. In a sequence to sequence model, we encode the input word (e.g. Paruuuu) as a sequence of characters into an encoder (effectively an RNN / LSTM, etc.) into a "hidden representation".
Then you decode your hidden representation as a sequence of phonemes (which denote how we pronounce the word) with a decoder (again, another RNN / LSTM, etc.).
Then, we can take this sequence of phonemes as input into another encoder and then decode with a neural network, where the output layer is a softmax layer, which computes the probability distribution over all words in your vocabulary (in you Dict terms) and select the word the highest probability.
So overall, we have:
Encode input word --> Decode output phoneme sequence --> Encode output phoneme sequence --> Decode with neural network to classify word.
The proposed method is supervised, so of course, you will need examples of input words and their correct "translations" (e.g. ("Paruuu", "Paru"))
Here is an article which gives a good intuition behind sequence to sequence models which can classify your input words: https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346