What is sentiment analysis?

The term has been around since 2000ish, and has been used to cover a variety of different phenomena:

Sentiment proper

Positive, negative, or neutral attitudes expressed in text:

Suffice to say, Skyfall is one of the best Bonds in the 50-year history of moviedom’s most successful franchise.

Skyfall abounds with bum notes and unfortunate compromises.

There is a breach of MI6. 007 has to catch the rogue agent.

Sometimes, factual statements imply sentiment:

Online giant Amazon’s shares have closed 9.8% higher

Building a sentiment analysis system

Version 1: cheap and cheerful

  • collect lists of positive and negative words or phrases, from public domain lists or by mining them.
  • given a text, count up the number of positives and negatives, and classify based on that.
  • you would be surprised how many commercial systems seem to do no more than this.


  • if number of positive = number of negatives, do we say ‘neutral’?
  • Compositional sentiment: a phrase like ‘not wonderfully interesting’ is negative, even though ‘wonderfully’ and ‘interesting’ will be in the list of positive words.
  • some words positive in some contexts, negative in others: ‘cold beer’ is good, ‘cold coffee’ is not. (This is actually a problem for all approaches.)

Version 2: better (what most commercial systems do)

A bag-of-words classifier:

  • get a training corpus of texts human annotated for sentiment (e.g. pos/neg/neut).
  • represent each text as a vector of counts of n-grams(*) of (normalised) words, and train your favourite classifier on these vectors.
  • should capture some ‘compositional’ effects: e.g. ‘very interesting’ likely signal for positivity, whereas ‘not very’ a signal for negativity.
  • will work for any language and domain where you can get accurately labelled training data.
  • bag-of-words means structure is ignored:

“Airbus: orders slump but profits rise” wrongly =

“Airbus: orders rise but profits slump”

* n usually <= 3, and as n gets bigger, more training data is required


  • Equally balanced texts will still be problematic,
  • and richer compositional effects will still be missed:

clever, too clever, not too clever


kill bacteria

fail to kill bacteria

never fail to kill bacteria

  • difficult to give sentiment labels accurately to short units like sentences or phrases,
  • or to pick out mixed sentiment:

The display is amazingly sharp. However, the battery life is disappointing.”

  • Complex compositional examples occur quite frequently in practice:

The Trout Hotel: This newly refurbished hotel could not fail to impress…

BT: it would not be possible to find a worse company to deal with…

Version 3: best – use linguistic analysis

  • do as full a parse as possible on input texts.
  • use the syntax to do ‘compositional’ sentiment analysis:
  • compositional sentiment analysis
    this product never fails to kill bacteria


Sentiment logic rules

  • kill + negativepositive (kill bacteria)
  • kill + positivenegative (kill kittens)
  • too + anything ⇒ negative (too clever, too red, too cheap)
  • In our system (www.theysay.io) we have 75,000+ such rules…


  • still need extra work for context-dependence (‘cold’, ‘wicked’, ‘sick’…)
  • can’t deal with reader perspective: “Oil prices are down” is good for me, not for Chevron or Shell investors.
  • can’t deal with sarcasm or irony: “Oh, great, they want it to run on Windows”

Current research is focused on machine learning for compositional approaches.

Simon Guest

Simon Guest