How to work with different Encoding for Foreign Languages

https://datascience.stackexchange.com/questions/77113

12-12-2020
|

Pergunta

I've got a Word Embedding File called model.txt. This contains 100 Dimensional vectors for over a million French words. These words contain accented characters such as é, â, î or ô.

Let me explain my problem with the following example: Consider these two words and their respective vectors, both of which are taken from model.txt:

etait -0.100460 -0.127720 ... 

était 0.094601 -0.266495 ...

Both words signify the same meaning but the former is without the accents while the later has accents.

Now I'm trying to load this word embedding using the gensim.models.KeyedVectors in the following way:

model = KeyedVectors.load_word2vec_format(open(model_location, 'r',
                                              encoding='utf8'),
                                          binary=False)
word_vectors = model.wv

To which I get the following error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-82-e17c33c552da> in <module>
     10 model = KeyedVectors.load_word2vec_format(open(model_location, 'r',
     11                                               encoding='utf8'),
---> 12                                           binary=False)
     13 
     14 word_vectors = model.wv

D:\Anaconda\lib\site-packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
   1547         return _load_word2vec_format(
   1548             cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors,
-> 1549             limit=limit, datatype=datatype)
   1550 
   1551     @classmethod

D:\Anaconda\lib\site-packages\gensim\models\utils_any2vec.py in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, binary_chunk_size)
    286                 vocab_size, vector_size, datatype, unicode_errors, binary_chunk_size)
    287         else:
--> 288             _word2vec_read_text(fin, result, counts, vocab_size, vector_size, datatype, unicode_errors, encoding)
    289     if result.vectors.shape[0] != len(result.vocab):
    290         logger.info(

D:\Anaconda\lib\site-packages\gensim\models\utils_any2vec.py in _word2vec_read_text(fin, result, counts, vocab_size, vector_size, datatype, unicode_errors, encoding)
    213 def _word2vec_read_text(fin, result, counts, vocab_size, vector_size, datatype, unicode_errors, encoding):
    214     for line_no in range(vocab_size):
--> 215         line = fin.readline()
    216         if line == b'':
    217             raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")

D:\Anaconda\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 7110-7111: invalid continuation byte

which I thought made sense if my file was encoded in a different format. However, using git I tried checking the encoding of the file using file * and got:

model.txt: UTF-8 Unicode text, with very long lines

Now, if I try to write the above code and have the encoding set to latin1, there isn't any problem to load this document but at the cost of not being able to access any of the words which contains an accent. Essentially throwing an out-of-vocab error upon executing: word_vectors.word_vec('était')

How am I supposed to approach the problem? I've also got the .bin file of the model, should I try to use that to load my words and their corresponding vectors?

Solução

Nevermind, the solution was trivial. Since I had the .bin file I could just open it in binary form. If somebody doesn't really have the .bin file, they could consider converting the .txt file to .bin and solve further.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange