How to find appropliate algorithm to bulid a model for natural language based two data [closed]

https://datascience.stackexchange.com/questions/85637

16-12-2020
|

Question

What I would like to do

I would like to create a model to infer nationality from name and created the below data frame combining two dataset from Kaggle.

Titanic: Machine Learning from Disaster （input/titanic/train.csv）

titanic-nationalities

    PassengerId Nationality Name
0   1   CelticEnglish   Braund
1   2   CelticEnglish   Cumings
2   3   Nordic  Heikkinen
3   4   CelticEnglish   Futrelle
....

Problem

How can I find algorithm to build a first model using these two data: Nationality and Name?

Since both natural language, so I can understand that it is essencial to make them vectors and this problem would be multi-value classification.

However, I have no idea how to find algorithm to train this dataset.

Solution

There's no algorithm intended specifically for this task, you need to design the process yourself (like for most tasks btw).

Given that the goal would be to use a person's name as an indication, I'd suggest you represent a name as a vector of characters n-grams in the features.

Example with bigrams ($n=2$):

"Braund" = [ #B, Br, ra, au, un, nd, d# ]

Intuitively the goal is for the model to find the sequences of letters which are more specific to a nationality. You could try with unigrams, bigrams or trigrams (the higher $n$, the more data you need for training).

Once the names are represented as features this way, you can train any type of supervised model, for example Decision Tree or Naive Bayes.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange