How do I load FastText pretrained model with Gensim?
Pergunta
I tried to load fastText pretrained model from here Fasttext model. I am using wiki.simple.en
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
But, it shows the following errors
Traceback (most recent call last):
File "nltk_check.py", line 28, in <module>
word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
File "P:\major_project\venv\lib\sitepackages\gensim\models\keyedvectors.py",line 206, in load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
File "P:\major_project\venv\lib\site-packages\gensim\utils.py", line 235, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte
Question 1 How do I load fasttext model with Gensim?
Question 2 Also, after loading the model, I want to find the similarity between two words
model.find_similarity('teacher', 'teaches')
# Something like this
Output : 0.99
How do I do this?
Solução
Here's the link for the methods available for fasttext implementation in gensim fasttext.py
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('wiki.simple')
print(model.most_similar('teacher'))
# Output = [('headteacher', 0.8075869083404541), ('schoolteacher', 0.7955552339553833), ('teachers', 0.733420729637146), ('teaches', 0.6839243173599243), ('meacher', 0.6825737357139587), ('teach', 0.6285147070884705), ('taught', 0.6244685649871826), ('teaching', 0.6199781894683838), ('schoolmaster', 0.6037642955780029), ('lessons', 0.5812176465988159)]
print(model.similarity('teacher', 'teaches'))
# Output = 0.683924396754
Outras dicas
For .bin use: load_fasttext_format()
(this typically contains full model with parameters, ngrams, etc).
For .vec use: load_word2vec_format
(this contains ONLY word-vectors -> no ngrams + you can't update an model).
Note:: If you are facing issues with the memory or you are not able to load .bin models, then check the pyfasttext model for the same.
Credits : Ivan Menshikh (Gensim Maintainer)
The FastText binary format (which is what it looks like you're trying to load) isn't compatible with Gensim's word2vec
format; the former contains additional information about subword units, which word2vec
doesn't make use of.
There's some discussion of the issue (and a workaround), on the FastText Github page. In short, you'll have to load the text format (available at https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md).
Once you've loaded the text format, you can use Gensim to save it in binary format, which will dramatically reduce the model size, and speed up future loading.
https://github.com/facebookresearch/fastText/issues/171#issuecomment-294295302