Question

I am working on a project that is about Natural Language Processing. However I am stuck at the point which is I have a ANN that has fixed size of input neurons.

I am trying to do sentiment analysis with using Imdb movie review set. To able to do that, firstly, I calculated the word embeddings for each word with creating a word-context matrix then applied SVD. So I have the word embedding matrix. But I do not know the best way to compress sentence's vector (which contains embeddings for each word in the sentence) into a fixed size to be able to feed the neural net. I tried PCA but result was not satisfying.

Any help?

Was it helpful?

Solution

The easiest way is to average the word- embeddings. This works quite well. Another thing you can try is to represent each document as a bag of words - i.e. - to have a vector in the size of your vocabulary, where each element in the vector represents the number of times a certain word had been mentioned in your document (for example, the first element in the vector will represent how many times the word a was mentioned, and so on). Afterwords, to reduce the size of the vector you can use techniques like LDA, SVD, or autoencoders.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top