How to work with different Encoding for Foreign Languages
-
12-12-2020 - |
Pergunta
I've got a Word Embedding File called model.txt
. This contains 100 Dimensional vectors for over a million French words. These words contain accented characters such as é, â, î or ô.
Let me explain my problem with the following example:
Consider these two words and their respective vectors, both of which are taken from model.txt
:
etait -0.100460 -0.127720 ...
était 0.094601 -0.266495 ...
Both words signify the same meaning but the former is without the accents while the later has accents.
Now I'm trying to load this word embedding using the gensim.models.KeyedVectors
in the following way:
model = KeyedVectors.load_word2vec_format(open(model_location, 'r',
encoding='utf8'),
binary=False)
word_vectors = model.wv
To which I get the following error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-82-e17c33c552da> in <module>
10 model = KeyedVectors.load_word2vec_format(open(model_location, 'r',
11 encoding='utf8'),
---> 12 binary=False)
13
14 word_vectors = model.wv
D:\Anaconda\lib\site-packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
1547 return _load_word2vec_format(
1548 cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors,
-> 1549 limit=limit, datatype=datatype)
1550
1551 @classmethod
D:\Anaconda\lib\site-packages\gensim\models\utils_any2vec.py in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, binary_chunk_size)
286 vocab_size, vector_size, datatype, unicode_errors, binary_chunk_size)
287 else:
--> 288 _word2vec_read_text(fin, result, counts, vocab_size, vector_size, datatype, unicode_errors, encoding)
289 if result.vectors.shape[0] != len(result.vocab):
290 logger.info(
D:\Anaconda\lib\site-packages\gensim\models\utils_any2vec.py in _word2vec_read_text(fin, result, counts, vocab_size, vector_size, datatype, unicode_errors, encoding)
213 def _word2vec_read_text(fin, result, counts, vocab_size, vector_size, datatype, unicode_errors, encoding):
214 for line_no in range(vocab_size):
--> 215 line = fin.readline()
216 if line == b'':
217 raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
D:\Anaconda\lib\codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 7110-7111: invalid continuation byte
which I thought made sense if my file was encoded in a different format. However, using git I tried checking the encoding of the file using file *
and got:
model.txt: UTF-8 Unicode text, with very long lines
Now, if I try to write the above code and have the encoding set to latin1
, there isn't any problem to load this document but at the cost of not being able to access any of the words which contains an accent. Essentially throwing an out-of-vocab error upon executing:
word_vectors.word_vec('était')
How am I supposed to approach the problem? I've also got the .bin
file of the model, should I try to use that to load my words and their corresponding vectors?
Solução
Nevermind, the solution was trivial. Since I had the .bin
file I could just open it in binary form. If somebody doesn't really have the .bin
file, they could consider converting the .txt
file to .bin
and solve further.