Question

I have a dataset of sentences in a non english language like :

  1. word1 word2 word3 word62

  2. word5 word1 word2

Now i want to turn each variable length sentence to a fixed size vector to give it to my model, and i want all the words in the sentences to have effect on the output

I thought maybe i can use an algorithm like word2vec and turn each word into a fixed size vector, and add all of them to represent the sentence, is this a meaningful approach? is this better than adding the hot one vectors of the words to represent the sentence? is there a better approach than these two?

EDIT1: basically i have a dataset of random variable length sentences and i want to embed them the best way possible, meaning keeping as much information as possible in the resulting embedded vectors (which all have the same size)

Était-ce utile?

La solution

So the question asks how to represent a series of words a uniform vector representation, which is not dependent on sequence.

The idea you suggested is definitely not a bad idea, you should try that out.

You should also try Doc2Vec, which works on the same principle as Word2Vec but this time it will output a vector which represents the meaning of a section of text that is longer than 1 word.

The main problem with this sort of representation is that you lose sequential information of the text and you treat words in sentences as a “bag of words”. If you happy to make that assumption, then continue using your approach of Doc2Vec.

Otherwise, you might better off using a sequential model architecture, such as RNN/LSTM, etc. Here you can input each word at a given time step initially as a one hot encoded vector and then you add an embedding layer before it goes into the sequential model to transform the one hot encoding into a word embedding.

Licencié sous: CC-BY-SA avec attribution
scroll top