Question

I would like to know how I can create a sentiment model from scratch. I have my data, list of texts, with no labels about sentiment.

Author     Quotes 
Dan Brown  “Everything is possible. The impossible just takes longer.” 
Dan Brown  “Great minds are always feared by lesser minds.” 
Dan Brown   “Men go to far greater lengths to avoid what they fear than to obtain what they desire.” 
Dan Brown   “Google' is not a synonym for 'research'.

etc...

I have 20000 of quotes like the above, from other authors too. My dataset is the only set of data that I have, so I would need to split into training (80%) and test set (20%). I should work on my training set to build the vocabulary of 'sentiment'. What I have thought is to clean the text removing stopwords, except negative words (like no, not, ...). Then look for adjectives and assign them a score manually (?). Once done this, I should have also consider n-grams, especially with negative stopwords, in order to create a small dictionary which could take into account also this. However I do not know if it could be the right approach, if it is something insane... Is there any way to build from scratch a sentiment model? How could the 'machine' learn from this?

I would need to develop a model from scratch as I would like to analyse texts in other languages (like Italian or Spanish) and there are no models (not good model at least) in Python to do that.

Was it helpful?

Solution

What you're describing is indeed the traditional approach for building a sentiment analysis system, so I'd say it looks like a reasonable approach to me.

I'm not up to date with the sentiment analysis task at all, but I think it would be worth studying the state of the art for several reasons:

  • There might be more recent, better approaches
  • There might be datasets in the languages you're interested in, and if there is that could save you a lot of time. Check if there are any shared tasks about this, they often provide annotated datasets.

OTHER TIPS

I would suggest to use a topic model first (like Latent Dirichlet Allocation) and assign score values to the topics instead of single words.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top