How to preprocess data for Word2Vec?
-
09-12-2020 - |
Pergunta
I have text data which is crawled from websites. I am preprocessing data to train Word2Vec model. Should I remove stopwords and do lemmatization? How to preprocess data for Word2Vec?
Solução
Welcome to the community,
I do not know about other libraries, but gensim has a very good API to create word2vec models. In order to preprocess data, you have to decide first what things you are gonna keep in your vocab and whatnot. for ex:- Punctuations, numbers, alphanumeric words(ex - 42nd) etc.
In my knowledge, the most generic preprocessing pipeline is the following:-
1) Convert to lower 2) Remove punctuations/symbols/numbers (but it is your choice) 3) Normalize the words (lemmatize and stem the words)
Once this is done, now you can tokenize the sentence into uni/bi/tri-grams.
Have a look at this
The generic format to put data in gensim.models.word2vec()'s sentence parameter is : [[tokeneized sentence 1], [tokenized sentence 2].....and so on]
Hope it helps, thanks!!