
I tried to load fastText pretrained model from here Fasttext model. I am using wiki.simple.en

from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)

But, it shows the following errors

Traceback (most recent call last):
  File "", line 28, in <module>
    word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
  File "P:\major_project\venv\lib\sitepackages\gensim\models\",line 206, in load_word2vec_format
     header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "P:\major_project\venv\lib\site-packages\gensim\", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Question 1 How do I load fasttext model with Gensim?

Question 2 Also, after loading the model, I want to find the similarity between two words

 model.find_similarity('teacher', 'teaches')
 # Something like this
 Output : 0.99

How do I do this?

Foi útil?


Here's the link for the methods available for fasttext implementation in gensim

from gensim.models.wrappers import FastText

model = FastText.load_fasttext_format('wiki.simple')

# Output = [('headteacher', 0.8075869083404541), ('schoolteacher', 0.7955552339553833), ('teachers', 0.733420729637146), ('teaches', 0.6839243173599243), ('meacher', 0.6825737357139587), ('teach', 0.6285147070884705), ('taught', 0.6244685649871826), ('teaching', 0.6199781894683838), ('schoolmaster', 0.6037642955780029), ('lessons', 0.5812176465988159)]

print(model.similarity('teacher', 'teaches'))
# Output = 0.683924396754

Outras dicas

For .bin use: load_fasttext_format() (this typically contains full model with parameters, ngrams, etc).

For .vec use: load_word2vec_format (this contains ONLY word-vectors -> no ngrams + you can't update an model).

Note:: If you are facing issues with the memory or you are not able to load .bin models, then check the pyfasttext model for the same.

Credits : Ivan Menshikh (Gensim Maintainer)

The FastText binary format (which is what it looks like you're trying to load) isn't compatible with Gensim's word2vec format; the former contains additional information about subword units, which word2vec doesn't make use of.

There's some discussion of the issue (and a workaround), on the FastText Github page. In short, you'll have to load the text format (available at

Once you've loaded the text format, you can use Gensim to save it in binary format, which will dramatically reduce the model size, and speed up future loading.

Licenciado em: CC-BY-SA com atribuição
scroll top