How to find appropliate algorithm to bulid a model for natural language based two data [closed]
-
16-12-2020 - |
Question
What I would like to do
I would like to create a model to infer nationality from name and created the below data frame combining two dataset from Kaggle.
Titanic: Machine Learning from Disaster (input/titanic/train.csv)
PassengerId Nationality Name
0 1 CelticEnglish Braund
1 2 CelticEnglish Cumings
2 3 Nordic Heikkinen
3 4 CelticEnglish Futrelle
....
Problem
How can I find algorithm to build a first model using these two data: Nationality and Name?
Since both natural language, so I can understand that it is essencial to make them vectors and this problem would be multi-value classification.
However, I have no idea how to find algorithm to train this dataset.
Solution
There's no algorithm intended specifically for this task, you need to design the process yourself (like for most tasks btw).
Given that the goal would be to use a person's name as an indication, I'd suggest you represent a name as a vector of characters n-grams in the features.
Example with bigrams ($n=2$):
"Braund" = [ #B, Br, ra, au, un, nd, d# ]
Intuitively the goal is for the model to find the sequences of letters which are more specific to a nationality. You could try with unigrams, bigrams or trigrams (the higher $n$, the more data you need for training).
Once the names are represented as features this way, you can train any type of supervised model, for example Decision Tree or Naive Bayes.