best approach to embed random length sequences of words as a fixed size vector without having a maximum length? [closed]

https://datascience.stackexchange.com/questions/80449

13-12-2020
|

Pergunta

I have a dataset of sentences in a non-English language like:

word1 word2 word3 word62
word5 word1 word2

and the length of each sentence is not fixed.

Now, I want to represent each sentence as a fixed sized vector and give it to my model and i want to keep as much information as possible in the embedding, and i don't want to have a maximum length for sentences because important information might happen in the end.

The only two approaches I can think of so far are:

Convert them to one hot vector and add them
Convert them to a word embedding and then add them

Is there any better way? What is the best approach to represent a variable length sentence without losing information from it (like having a maximum length for each sentence - I want all the words in the sentence to affect the embedding)?

Solução

May be Universal Sentence Encoder might work for you if it can encode words in language you want. https://tfhub.dev/google/universal-sentence-encoder/4

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange