Overview

Sentiment analysis is one small part of the wider field of natural language processing, where we try to automatically identify and determine the emotion of a text. It has a number of real-world applications, ranging from politics to consumer research. For our purposes, we can ask the question: can we take a tweet and automatically quantify how positive or negative it is?

There are different ways of conducting sentiment analysis, and some methods are much more sophisticated than others. In this workshop, we’ll take a simple approach where the overall sentiment of a string of text directly reflects the sentiment of the individual words, without considering the syntactic relationship between those words. This can be problematic in some examples, which we’ll discuss in more detail later on.

1 Sentiment dictionaries

We can do some basic sentiment analysis without the need to install any new packages, the tidytext package you were introduced to earlier in Part 2 of this workshop already contains everything we need, including a sentiment dictionary!

A sentiment dictionary is just a list of words with a corresponding sentiment classification - we’ll be looking at two possible dictionaries today, which operationalise sentiment in different ways.

1.1 The ‘Bing’ lexicon

Bing Liu has a widely-used sentiment lexicon that is accessible through tidytext. Let’s read it into R and save it to an object called sent:

In this dictionary, sentiment is measured as a binary variable - words are either classified as positive or negative. Let’s look at a random sample of the positive words:

## # A tibble: 5 x 2
##   word         sentiment
##   <chr>        <chr>    
## 1 delicacy     positive 
## 2 eagerly      positive 
## 3 beckoned     positive 
## 4 enchantingly positive 
## 5 best         positive

And now the same for some negative words:

## # A tibble: 5 x 2
##   word         sentiment
##   <chr>        <chr>    
## 1 deficiencies negative 
## 2 smelled      negative 
## 3 breakup      negative 
## 4 confusing    negative 
## 5 taxing       negative

1.2 The ‘AFINN’ lexicon

An alternative to the Bing lexicon is the AFINN lexicon, developed by Finn Årup Nielsen and released in 2011. In this dictionary, words are not classified in a binary fashion but are instead assigned a numerical value reflecting how strongly positive or negative they are, ranging from -5 (the most negative) to +5 (the most positive).

Once again, let’s read it into R and name it sent (you’ll have to install the textdata package first to access the AFINN lexicon):

Now let’s take a random sample to get an idea of how words are evaluated - does it fit your intuitions?

## # A tibble: 5 x 2
##   word      value
##   <chr>     <dbl>
## 1 violates     -2
## 2 lunatics     -3
## 3 mistaking    -2
## 4 vicious      -2
## 5 relaxed       2

2 Calculating sentiment

To conduct sentiment analysis on a given string of text, we need to look up each word in the sentiment lexicon and assign it the appropriate value. This involves two things:

Tangent: left_join()

left_join() is a very useful function, so it’s important to understand how it works. Take the following example dataset of Lord of the Rings characters, which contains the names of individual characters (in character) and information about their type/race (in type).

##   character   type
## 1   Gandalf wizard
## 2     Frodo hobbit
## 3   Legolas    elf
## 4     Bilbo hobbit
## 5     Arwen    elf
## 6   Saruman wizard
## 7    Pippin hobbit
## 8     Gimli  dwarf

Now imagine we have another, separate dataframe, containing descriptive information about the different character types (e.g. wizards are magical, Hobbits have hairy feet etc.).

##     type     feature
## 1 wizard     magical
## 2 hobbit  hairy feet
## 3    elf pointy ears

If we want to join these two dataframes together, we can use left_join(), specifying the two dataframes as arguments. It will notice that both dataframes have a column called type, and look for matches between the values. If it finds a match in the second table, it copies over the extra columns back to the first table. Note that for any missing values (e.g. Dwarf), it will just use NA:

##   character   type     feature
## 1   Gandalf wizard     magical
## 2     Frodo hobbit  hairy feet
## 3   Legolas    elf pointy ears
## 4     Bilbo hobbit  hairy feet
## 5     Arwen    elf pointy ears
## 6   Saruman wizard     magical
## 7    Pippin hobbit  hairy feet
## 8     Gimli  dwarf        <NA>

If for some reason the joining column has a different name in the two dataframes, you will have to specify them as an extra argument. For example, let’s say the type column in one of the dataframes is actually called race instead; the join command would be left_join(dataframe1, dataframe2, by = c("type" = "race")).

left_join() is just one of a number of join commands in R - e.g. right_join(), full_join(), inner_join() etc. We’ll only be using left_join() today, but remember you can always check the help vignette by typing ?left_join() into the console if you want to know how they’re all different.


2.1 Applying sentiment analysis to Twitter

Now let’s apply these methods of sentiment analysis to some Twitter data. Maybe something topical. It can only mean one thing…

If you can’t access the Twitter API, download a copy of the data here: brexit-tweets.Rdata

Since Brexit is something people (rightly) have very strong feelings about, it’s the perfect case study for testing out our methods of sentiment analysis. Let’s take a quick look at what some people are saying (remember you can use sample_n() to take a random sample of rows from a dataframe)

status_id text
tweet_1001 This brexit bollocks is still going ? Will beer prices increase?
tweet_3195 Longing for the day Politics isn’t about Brexit, just for a day at least
tweet_3957 the Tories started this nightmare &amp; I will NEVER vote for them again #brexit
tweet_5832 This brexit bullshit is boring now. 🙃
tweet_5889 Brexit is a scam and you’ve been had. Revoke the fuck out of it. Now.

Ok now that we’ve got our dataframe of tweets, the next step is to use unnest_tokens() to take each tweet and convert it to one word per line (as we did in Part 2), then assign the sentiment values to each word using left_join() with the sentiment dictionary stored in sent:

We should now have one word per line, with the associated sentiment value in the value column. Let’s take a look at an example tweet to make sure it’s worked:

## # A tibble: 9 x 3
##   status_id  word      value
##   <chr>      <chr>     <dbl>
## 1 tweet_3296 can          NA
## 2 tweet_3296 a            NA
## 3 tweet_3296 competent     2
## 4 tweet_3296 soul         NA
## 5 tweet_3296 stop         -1
## 6 tweet_3296 this         NA
## 7 tweet_3296 brexit       NA
## 8 tweet_3296 madness      -3
## 9 tweet_3296 please        1

You might disagree with some of the assigned values, but the important thing is that it’s worked!

Our next step is to produce a single value for each tweet, corresponding to how positive/negative it is. At this point, we have a couple of options:

  • for each tweet, we can average over each word’s sentiment value (which would give us a value between -5 to +5 for each tweet)
  • alternatively, we can add up all of the individual sentiment values for each tweet to give us a total overall sentiment

Let’s try both. We’ll use group_by() to group each tweet’s words back together, filter out all the words without a sentiment value using filter() and !is.na(), and finally summarise(), mean() and sum() to produce an average value and total value for each lyric:

Let’s take a look at our new brexit.values dataframe:

## # A tibble: 5 x 3
##   status_id  average.value total.value
##   <chr>              <dbl>       <dbl>
## 1 tweet_5441         0.333           1
## 2 tweet_6722        -2              -4
## 3 tweet_6422        -2.33           -7
## 4 tweet_4205        -2              -2
## 5 tweet_6095        -1.25           -5

So far so good - but it would be useful to actually see the content of each tweet alongside our sentiment values. This is really easily - we’ve got a status_id column both in this dataframe and the original brexit.tweets dataframe, so we can perform a simple left_join() between them.

Just to make sure we’re getting sensible results, let’s take a look at what’s been classified as the most negative tweet in the dataset:

text
Brexit is a puking death fuck-up-pisser collapsing, pissing the broken pisser as dishonourable as the bigotry made of extremist-fucking death that implodes, and fucking a thousand unbelievably despicable sarcophagus-juice-fuckers that shits

Yep, seems pretty accurate! And what about the most positive tweet?

text
Thank you, Brexit! You lovely, lovely people, @user !!!! Thank you, @user !!!!I hope you are all very, very proud…..@user #RevokeA50Now

I guess automated sentiment analysis isn’t sensitive to sarcasm…

2.2 Visualisation

Now that we’ve got a single dataframe with two measures of sentiment score for each tweet, it’s time to plot the data!

Let’s try plotting a histogram to visualise the distribution of sentiment values. Notice how:

  • inside geom_histogram(), we’ve changed the number of bins (essentially how fine-grained you’d like the x-axis to be), and set the fill and colour arguments to specify the colour of the bars (and their outline)
  • we’ve also added a type of ‘geom’ that hasn’t been covered yet: geom_vline() draws a vertical line at whatever position on the x-axis you specify in xintercept - you can use lwd to change its width, and lty to change the type of line

There is an unsurprising skew towards the negative side of the scale - interesting!

For comparison, I also collected 10000 tweets containing the words puppy or puppies and generated an equivalent dataset of sentiment scores. I won’t reproduce all of the code here, since I just used the same method of sentiment analysis we’ve just worked through, but you can download the dataset below (or even better: perform the search yourself using search_tweets() to see if you can replicate the results with a more recent set of tweets!).

In the code below, we add an identifying column type to distinguish the two sets of tweets, before combining them together into a dataframe called combined.sentiment:

Download a copy of the data here: puppy-tweets.Rdata

And now let’s compare their sentiment scores:

Who would’ve thought - puppies are more popular than Brexit…


Another thing we can do is plot a word cloud of the positive and negative words present in the datasets. It’s easier to use a specific package for this rather than using ggplot as we’ve done so far. Let’s go ahead and install the wordcloud package, then load it into the workspace:

Since we’re not using ggplot, the syntax is a little different. To plot a word cloud of the most frequent positive words, we use filter() to only return those words with a positive sentiment value, then count() to calculate the frequency of each unique word, and then we make a call to the wordcloud() function itself.

We can do the same for the negative words, i.e. those where value < 0 (we’re also excluding the word no here as it’s just so much more frequent than the rest and it isn’t really interesting)

Exercise

Pick a topic of your choice and do another Twitter search, then follow the same instructions as above to perform sentiment analysis and compare the results to another topic to see which prompts more positive/negative discussion on Twitter.


Discussion

What are the limitations of conducting sentiment analysis in this way? Take the following tweet as an example, which would be classified as being quite positive with an overall sentiment score of 2:

## # A tibble: 8 x 3
##   status_id  word    value
##   <chr>      <chr>   <dbl>
## 1 tweet_6087 brexit     NA
## 2 tweet_6087 is         NA
## 3 tweet_6087 never      NA
## 4 tweet_6087 going      NA
## 5 tweet_6087 to         NA
## 6 tweet_6087 be         NA
## 7 tweet_6087 a          NA
## 8 tweet_6087 success     2

How could you improve this method of sentiment analysis? Think about how things like negation, intensifiers, or mitigators can affect the overall emotion of a sentence, for example in the following:

  • He’s a comedian but he isn’t isn’t funny
  • The film was really good (cf. The film was good)
  • Her performance was kind of impressive