Pergunta

I have a dataset of sentences in a non english language like :

  1. word1 word2 word3 word62

  2. word5 word1 word2

Now i want to turn each variable length sentence to a fixed size vector to give it to my model, and i want all the words in the sentences to have effect on the output

I thought maybe i can use an algorithm like word2vec and turn each word into a fixed size vector, and add all of them to represent the sentence, is this a meaningful approach? is this better than adding the hot one vectors of the words to represent the sentence? is there a better approach than these two?

EDIT1: basically i have a dataset of random variable length sentences and i want to embed them the best way possible, meaning keeping as much information as possible in the resulting embedded vectors (which all have the same size)

Foi útil?

Solução

So the question asks how to represent a series of words a uniform vector representation, which is not dependent on sequence.

The idea you suggested is definitely not a bad idea, you should try that out.

You should also try Doc2Vec, which works on the same principle as Word2Vec but this time it will output a vector which represents the meaning of a section of text that is longer than 1 word.

The main problem with this sort of representation is that you lose sequential information of the text and you treat words in sentences as a “bag of words”. If you happy to make that assumption, then continue using your approach of Doc2Vec.

Otherwise, you might better off using a sequential model architecture, such as RNN/LSTM, etc. Here you can input each word at a given time step initially as a one hot encoded vector and then you add an embedding layer before it goes into the sequential model to transform the one hot encoding into a word embedding.

Licenciado em: CC-BY-SA com atribuição
scroll top