What is sentiment analysis?
The term has been around since 2000ish, and has been used to cover a variety of different phenomena:
Positive, negative, or neutral attitudes expressed in text:
Suffice to say, Skyfall is one of the best Bonds in the 50-year history of moviedom’s most successful franchise.
Skyfall abounds with bum notes and unfortunate compromises.
There is a breach of MI6. 007 has to catch the rogue agent.
Sometimes, factual statements imply sentiment:
Online giant Amazon’s shares have closed 9.8% higher
Building a sentiment analysis system
Version 1: cheap and cheerful
- collect lists of positive and negative words or phrases, from public domain lists or by mining them.
- given a text, count up the number of positives and negatives, and classify based on that.
- you would be surprised how many commercial systems seem to do no more than this.
- if number of positive = number of negatives, do we say ‘neutral’?
- Compositional sentiment: a phrase like ‘not wonderfully interesting’ is negative, even though ‘wonderfully’ and ‘interesting’ will be in the list of positive words.
- some words positive in some contexts, negative in others: ‘cold beer’ is good, ‘cold coffee’ is not. (This is actually a problem for all approaches.)
Version 2: better (what most commercial systems do)
A bag-of-words classifier:
- get a training corpus of texts human annotated for sentiment (e.g. pos/neg/neut).
- represent each text as a vector of counts of n-grams(*) of (normalised) words, and train your favourite classifier on these vectors.
- should capture some ‘compositional’ effects: e.g. ‘very interesting’ likely signal for positivity, whereas ‘not very’ a signal for negativity.
- will work for any language and domain where you can get accurately labelled training data.
- bag-of-words means structure is ignored:
“Airbus: orders slump but profits rise” wrongly =
“Airbus: orders rise but profits slump”
* n usually <= 3, and as n gets bigger, more training data is required
- Equally balanced texts will still be problematic,
- and richer compositional effects will still be missed:
clever, too clever, not too clever
fail to kill bacteria
never fail to kill bacteria
- difficult to give sentiment labels accurately to short units like sentences or phrases,
- or to pick out mixed sentiment:
“The display is amazingly sharp. However, the battery life is disappointing.”
- Complex compositional examples occur quite frequently in practice:
The Trout Hotel: This newly refurbished hotel could not fail to impress…
BT: it would not be possible to find a worse company to deal with…
Version 3: best – use linguistic analysis
- do as full a parse as possible on input texts.
- use the syntax to do ‘compositional’ sentiment analysis:
Sentiment logic rules
- kill + negative ⇒ positive (kill bacteria)
- kill + positive ⇒ negative (kill kittens)
- too + anything ⇒ negative (too clever, too red, too cheap)
- In our system (www.theysay.io) we have 75,000+ such rules…
- still need extra work for context-dependence (‘cold’, ‘wicked’, ‘sick’…)
- can’t deal with reader perspective: “Oil prices are down” is good for me, not for Chevron or Shell investors.
- can’t deal with sarcasm or irony: “Oh, great, they want it to run on Windows”
Current research is focused on machine learning for compositional approaches.