Text classification with Word2Vec on a larger corpus

https://datascience.stackexchange.com/questions/77761

12-12-2020
|

Question

I am working on a small project and I would like to use the word2vec technique as a text representation method. I need to classify patents but I have only a few of them labelled and to increase the performance of my ML model, I would like to increase the corpus/vocabulary of my model by using a large amount of patents. The question is, once I have train my word embedding feature, how to use this larger corpus with my training data - my labelled data?

My data set is composed by 2000 patents which are labelled.

The patents used to train my word embedding corpus are 3 millions (some of my 2000 labelled patents are already included in this larger corpus) which I trained using Gensim.

Do you have any suggestions on how to do it?

Thank you very much in advance.

Rob

Solution

Use large amount of un-label data to finetune the BERT based model. You can train BERT in unsupervised manner. Then, use that bert to get embeddings of your input text of label data and train a classifier.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange