How to preprocess data for Word2Vec?

https://datascience.stackexchange.com/questions/68016

09-12-2020
|

Pergunta

I have text data which is crawled from websites. I am preprocessing data to train Word2Vec model. Should I remove stopwords and do lemmatization? How to preprocess data for Word2Vec?

Solução

Welcome to the community,

I do not know about other libraries, but gensim has a very good API to create word2vec models. In order to preprocess data, you have to decide first what things you are gonna keep in your vocab and whatnot. for ex:- Punctuations, numbers, alphanumeric words(ex - 42nd) etc.

In my knowledge, the most generic preprocessing pipeline is the following:-

1) Convert to lower 2) Remove punctuations/symbols/numbers (but it is your choice) 3) Normalize the words (lemmatize and stem the words)

Once this is done, now you can tokenize the sentence into uni/bi/tri-grams.

Have a look at this

The generic format to put data in gensim.models.word2vec()'s sentence parameter is : [[tokeneized sentence 1], [tokenized sentence 2].....and so on]

Hope it helps, thanks!!

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange