Overview

Before we do any analysis, we’ll need to collect some data. There are a number of ways of ‘scraping’ Twitter for tweets, and some are better than others depending on your workflow and what kind of information you’d like to collect. R packages such as rtweet are great because you can collect the data and then analyse it all within the same program, but there are other options such as the stand-alone FireAnt program that provides a nice user-friendly way of collecting data. In this section, we will cover both of these methods.

1 Mining tweets in R

The first thing we need to do is install and then load the rtweet R package. You only need to install the package once: after this, you can simply load it into the workspace using the library() function.

You’ll also need to install the httpuv package in order to authenticate your R session and gain permission to mine Twitter data.

1.1 Searching by tweet content

There are a few options when it comes to searching for tweets, but perhaps the most basic (and most useful!) is to search for a particular term. You can do this by using the search_tweets() function to collect tweets from the past week that include a particular word or phrase (the search is restricted to tweets from the last week, unfortunately).

As with any R function, you can type in ?search_tweets to bring up its help vignette, which will explain how it works and what each argument means. If you’re ever unsure about what an R function does, or what arguments are necessary and what they mean, always check the help vignette first.

Let’s do a Game of Thrones search (potential spoilers ahead!). In the example below, the code will return a maximum of 100 tweets (specified by the n argument) containing the word ‘Lannister’, and write the output of this search to a dataframe called lannister.tweets. The include_rts argument allows us to exclude retweets from this search, and lang allows us to filter by language (e.g. ‘en’ for English, ‘de’ for German etc.)

If you can’t access the Twitter API, download a copy of the data here: lannister-tweets.Rdata

It’s important to note that Twitter limits your queries to around 18,000 tweets every 15 minutes. If you get a warning message about exceeding the limit, you’ll have to wait around 15 minutes until it resets and then you can search again.

We can also specify more than one search term at a time, if we separate the search terms with ‘OR’. This time, let’s search for tweets containing the word Tyrion or the word Lannister. Let’s also increase n to 300:

If you can’t access the Twitter API, download a copy of the data here: tyrion-tweets.Rdata

Now that we’ve collected some data, let’s see how many rows and columns our dataframe has:

## [1] 300  90

It includes 300 rows (as we might expect), and 90 columns. 90 columns is a lot! If we run the colnames() function on our dataframe, we can see what each column is named, revealing all the metadata that we collect alongside the content of each tweet itself.

##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "quote_count"             "reply_count"            
## [17] "hashtags"                "symbols"                
## [19] "urls_url"                "urls_t.co"              
## [21] "urls_expanded_url"       "media_url"              
## [23] "media_t.co"              "media_expanded_url"     
## [25] "media_type"              "ext_media_url"          
## [27] "ext_media_t.co"          "ext_media_expanded_url" 
## [29] "ext_media_type"          "mentions_user_id"       
## [31] "mentions_screen_name"    "lang"                   
## [33] "quoted_status_id"        "quoted_text"            
## [35] "quoted_created_at"       "quoted_source"          
## [37] "quoted_favorite_count"   "quoted_retweet_count"   
## [39] "quoted_user_id"          "quoted_screen_name"     
## [41] "quoted_name"             "quoted_followers_count" 
## [43] "quoted_friends_count"    "quoted_statuses_count"  
## [45] "quoted_location"         "quoted_description"     
## [47] "quoted_verified"         "retweet_status_id"      
## [49] "retweet_text"            "retweet_created_at"     
## [51] "retweet_source"          "retweet_favorite_count" 
## [53] "retweet_retweet_count"   "retweet_user_id"        
## [55] "retweet_screen_name"     "retweet_name"           
## [57] "retweet_followers_count" "retweet_friends_count"  
## [59] "retweet_statuses_count"  "retweet_location"       
## [61] "retweet_description"     "retweet_verified"       
## [63] "place_url"               "place_name"             
## [65] "place_full_name"         "place_type"             
## [67] "country"                 "country_code"           
## [69] "geo_coords"              "coords_coords"          
## [71] "bbox_coords"             "status_url"             
## [73] "name"                    "location"               
## [75] "description"             "url"                    
## [77] "protected"               "followers_count"        
## [79] "friends_count"           "listed_count"           
## [81] "statuses_count"          "favourites_count"       
## [83] "account_created_at"      "verified"               
## [85] "profile_url"             "profile_expanded_url"   
## [87] "account_lang"            "profile_banner_url"     
## [89] "profile_background_url"  "profile_image_url"

Not only do we get the content of the tweet (in text), we also get information on:

  • who sent it (screen_name)
  • the time/date it was sent (created_at)
  • the exact latitude/longitude coordinates from where the tweet was sent, if the account has geotagging enabled (geo_coords)
  • many, many other things!

Exercise

  1. Take a look at the data set. For each column, try and work out what kind of information it contains, and whether or not it could be useful for any analysis

  2. Try running the code again but for a different search term. Be careful not to run it too many times with a high n argument otherwise you might go over the limit (remember: 18,000 tweets every 15 mins)

  3. Read the help vignette for the search_tweets() function (you can do this by typing ?search_tweets). Focus in particular on the description of the q argument: what’s the difference between searching for Tyrion Lannister, Tyrion OR Lannister, Tyrion AND Lannister, and “Tyrion Lannister”?

1.2 Searching by user

You can also collect tweets from an individual account, such as @realDonaldTrump (if you can stomach it). Note that this is restricted to the most recent 3,200 tweets from a single account (even if you change the n argument to something even higher, like 10,000).

If you can’t access the Twitter API, download a copy of the data here: trump-tweets.Rdata

1.2.1 Comparing popularity of users

One fun (and perhaps useful!) thing to do is to compare the popularity of tweets from multiple users. We’ve already got Donald Trump’s 150 most recent tweets saved in the trump.tweets dataframe, but now let’s do the same for @BarackObama:

If you can’t access the Twitter API, download a copy of the data here: obama-tweets.Rdata

Now we can combine these two dataframes together using rbind() - note that this requires both dataframes to have the same column names in the same order. In this case, both dataframes are just the direct output of search_tweets(), so they are identical in structure.

Now that we’ve got our combined.tweets dataframe, let’s plot the correlation between the number of times each tweet was ‘favourited’ (in the favorite_count column) and ‘retweeted’ (in the retweet_count column) - and we should colour-code each tweet based on its author. We can use the ggplot2 package for our plotting, which is part of the tidyverse - a really neat way of structuring our code. Install the tidyverse set of packages, if you haven’t already, and then load it in using library():

The following chunk of code says:

take the combined.tweets dataframe, and input this (using the %>% symbol) to the ggplot() function. Set the x-axis values to favourite_count, the y-axis values to retweet_count, and colour-code our data based on the screen_name column. Then plot this data as a scatterplot using geom_point()

Looks like all of the most popular tweets belong to Obama 🥳 However, you might have noticed that the data is quite skewed, with some particularly high values resulting in most of the data being compressed in the bottom-left corner. We can fix this by applying a logarithmic transformation to the x and y values, which expands the lower values and compresses the higher values, and makes our figure more readable:

Using the tidyverse packages, we can easily summarise the data using some basic descriptive statistics. Let’s say we want to calculate the mean/median number of times each account’s tweets were ‘favourited’, as well as the most number of favourites a single tweet received. We can do this using group_by(), which temporarily splits our dataset based on each unique value in a specified variable (in this case screen_name), and summarise(), which allows us to perform some basic summary statistics (such as mean(), median(), and max())

## # A tibble: 2 x 4
##   screen_name    `mean(favorite_count… `median(favorite_cou… `max(favorite_coun…
##   <chr>                          <dbl>                 <dbl>               <int>
## 1 BarackObama                  240585.               150051              1397785
## 2 realDonaldTru…                59192.                57042.              167229

Exercise

  1. Try using the get_timelines() function for two different accounts (maybe a celebrity’s, or even your own!) and conduct a similar comparison.

  2. Now look for a relationship between favorite_count or retweet_count and some other variable contained within the dataset, e.g. is_quote (does the tweet quote an existing tweet?), or display_text_width (the number of characters in the tweet). Note that since the former is a categorical variable - not a continuous one - a scatterplot wouldn’t be appropriate. Consider using geom_boxplot() instead of geom_point() for this kind of figure.

1.3 Collect tweets in real-time

Another option for data collection is to stream tweets as they’re sent, in real-time.

It doesn’t normally make much sense to run a completely unrestricted search like this, because it’s collecting tweets from all over the world, written in any possible language, written about any possible topic.

It might make more sense to restrict this search geographically. For example, the following code will collect live tweets sent from Manchester over the next minute. If you leave this running for a long time, you can build a corpus of tweets sent from a particular location:

However, a recent update to the Twitter/Google Maps API means that you now have to register for a valid Google Maps API key in order to perform these geographically-restricted searches. We won’t be covering this process in this workshop, but you can read up on how to do it here.

1.4 Saving data

If you want to keep the data you’ve scraped for future analysis, make sure you export it from R. To save your dataframe (assuming you’ve already made a folder called ‘data’ inside your current working directory):

Then you can always load it in again at a later date, using:


2 Mining tweets using FireAnt

If you plan on collecting a lot of data, it might be worthwhile using the FireAnt software instead. FireAnt is a useful piece of software developed by Laurence Anthony, free to download from his website. It provides a graphical user interface (GUI) to access Twitter’s Streaming API, which allows for a user-friendly way to collect tweets sent in real-time. Because you’re collecting tweets as they’re sent, instead of searching back for existing tweets, the limit I mentioned earlier (of 18,000 tweets every 15 minutes) doesn’t apply. You can simply leave the software running for as long as you want (hours, days, weeks) and by the end of it you’ll have a lot of data.

In my experience, if you want to collect a lot of geocoded data from a particular region, using FireAnt is the best option

The data will be saved in .JSON format, which can be read into R as follows:


3 Textual analysis

Now that we’ve run through various ways of collecting tweets, let’s run over some basic analysis you can conduct. We’ll be using the tidyverse package again, which should already be installed and loaded, but we also need the tidytext package to conduct some textual analysis:

3.1 Word frequency

We can look at the content of these tweets in more detail by calculating the most frequent words.

First off, let’s convert our database of Donald Trump tweets so that each word of each tweet is on its own line. We can use select() to just take two columns (the ID, and the content of each tweet), and then unnest_tokens() to transform our dataframe into a one-word-per-line format.

Thanks to the tidyverse packages, we can connect these commands together using the %>% symbol - this is referred to as a ‘pipe chain’, and it basically means:

take the thing that comes before ‘the pipe’, and input this into the thing that comes after ‘the pipe’

In other words, the following code:

  • takes trump.tweets and inputs it into the select() command, which selects just the status_id and text columns, discarding the rest…
  • … these are then input into unnest_tokens(), which takes the content of text, splits it into a one-word-per-line format, and puts this in a column called word
  • … and saves all of this into a dataframe called trump.words

If we were to execute these commands in a ‘non-tidy’ way, it would look something like this instead:

The first, ‘tidy’ method of structuring our code is much more intuitive, without the need to use nested brackets that can sometimes be difficult to interpret. Throughout this workshop we’ll be ‘piping’ commands together in this ‘tidy’ way.

Let’s use head() to look at the first 6 lines, just to make sure our code has worked:

## # A tibble: 6 x 2
##   status_id           word   
##   <chr>               <chr>  
## 1 1170782374843564034 looking
## 2 1170782374843564034 forward
## 3 1170782374843564034 to     
## 4 1170782374843564034 being  
## 5 1170782374843564034 in     
## 6 1170782374843564034 the

Looks good! Each line of trump.words contains a single word from each tweet, with a column corresponding to the tweet ID containing that word. Now we’ve got each word of each tweet on its own line, we can simply count the occurrence of each word and plot the most frequent ones.

All of the following commands are piped together using %>% and their output is saved into a new object called trump.count:

  • count() will count how many times each word appears
  • head() will give you the first x number of rows of this data (in the example below the first 30 lines)
  • mutate() allows us to change the word column - which we will do using the reorder() command to make sure we plot the most frequent words at the top

Now we can plot the word frequency using ggplot(). To do this, we take the dataframe trump.count, pipe it into the ggplot() function, inside of which we specify the columns for our x and y axes. Then we need to specify what kind of graph we’d like, in this case geom_col() will plot a bar chart. The final two lines aren’t mandatory, but coord_flip() will flip the axes round and theme_minimal() changes the ggplot theme to make the figure look a bit cleaner.

Cool! But there’s a problem here: obviously the most frequent words are things like the, to, and, of etc. which aren’t particularly interesting. To remove this, we can use what’s called a stop list, which is a list of highly frequent words you want to exclude from the analysis. Luckily, the tidytext package we installed and loaded earlier already provides one of these, called stop_words. The following code adds a few Twitter-specific items to this list, such as hyperlinks (‘https’) and acronyms (‘rt’, i.e. ‘retweet’) that we obviously aren’t interested in.

Now we can remake the trump.count object, but with the addition of a new line that excludes certain words from our dataset:

  • filter() can be used to filter out certain rows of data depending on specific critiera that you set
  • %in% is a logical operator - it checks to see if the object that comes before it appears in the object that comes after it, and returns either TRUE or FALSE
  • ! is a negator, which means it reverses whatever comes after it

Taken together, the second line of code below essentially says: only include rows of data where the word is not in the updated list of stop words

Now when we re-make the same plot, it shouldn’t include any of the uninteresting function words:

Unsurprisingly, Donald Trump tweets a lot about ‘fake news’, and even ‘Trump’ himself…

Exercise

  1. Now try the same method of analysing word frequency for a different account. You should already have a set of tweets from a different account that you collected earlier using get_timelines() - if not, do it now!

  2. Let’s also make the word frequency plot more colourful. If you want to make all the bars red, for example, you can specify fill = 'red' inside the geom_col() command. Try it out.

  3. We can take it one step further and colour-code the bars based on a variable/column in our dataset. To do this, you just specify the column name for the fill command (without quotation marks), but when you do this you also have to wrap aes() around it. So to colour everything red, it’s geom_col(fill = 'red'), but to colour-code based on the word frequency, it’s geom_col(aes(fill = n)), which makes reference to the n column in the dataframe.

3.2 n-grams

So far we’ve just looked at the frequency of individual words, but of course in language context is very important. For this reason, it’s quite common to investigate frequent collocations of words instead - or n-grams. Let’s test it out by looking at bigrams from Trump’s tweets - i.e. which two words appear together most often?

Since ngram analysis requires more data than individual word frequency analysis, you might want to first re-run the get_timeline() function from before to collect more tweets from the @realDonaldTrump account:

If you can’t access the Twitter API, download a copy of the data here: trump-tweets-big.Rdata

We can use the unnest_tokens() like we did before to get a one-word-per-line format, but this time we include the token = "ngrams" and n = 2 arguments, which tells R to tokenise into bigrams instead of individual words (if we wanted trigrams, we would change n to 3)

Take a look at the first 10 rows to make sure it’s worked:

## # A tibble: 10 x 2
##    status_id           bigram         
##    <chr>               <chr>          
##  1 1171120177196544000 we have        
##  2 1171120177196544000 have been      
##  3 1171120177196544000 been serving   
##  4 1171120177196544000 serving as     
##  5 1171120177196544000 as policemen   
##  6 1171120177196544000 policemen in   
##  7 1171120177196544000 in afghanistan 
##  8 1171120177196544000 afghanistan and
##  9 1171120177196544000 and that       
## 10 1171120177196544000 that was

Next up, we need to count the number of occurrences of each bigram. We can do this using count(), as we did for individual lexical frequency earlier, but before we do that we should separate() each bigram into its constituent words and get rid of any that are in our stop list (stop_words_new) - we can do this using filter().

Now let’s plot them! In the following code, we:

  • take the trump.ngrams.count object we’ve just made…
  • input this to mutate(), where we use paste() to rejoin the two words into a single string and save it in a column called bigram
  • use mutate() again to reorder the bigram column (arranged in descending order by frequency)…
  • use head(10) to take the first 10 rows (i.e. the 10 most frequent bigrams)…
  • input this to ggplot(), where we plot each bigram along the x-axis and the frequency itself - i.e. n - along the y-axis…
  • use geom_col() to plot this as a bar chart…
  • and finally use coord_flip() to flip the x and y axes around (it makes for a nicer-looking plot in this case)

However, it’s more common (and more exciting!) to plot this kind of data as an ngram network instead, where words are clustered by their collocation frequency. To do this, we need to first install and load two new packages:

We can now plot a bigram network using the code below. It looks a little scary at first, and in all honesty you don’t strictly need to know what each bit does

  • filter() is straightforward: we’re only plotting bigrams that appear more than 3 times in the dataset
  • we’ve specified layout = "fr" inside the ggraph() function - this tells R to use the force-directed algorithm developed by Fruchterman and Reingold when positioning the nodes (check the ?layout_tbl_graph_igraph help vignette for a list of other options you could use instead)
  • geom_edge_link() refers to the links between nodes - we’ve set edge_alpha = n so that more frequent bigrams are plotted with darker connecting lines, and we’ve also specifed that we want to plot arrowheads using arrow so that we know which word comes first in each bigram
  • geom_node_point() refers to the nodes/words themselves - here you can specify their colour/size (you can find a list of colour names here)
  • geom_node_text() refers to the labels for each node/word - by setting label = name we’re telling R to plot the word label for each node (it wouldn’t be a very informative graph otherwise!) and the vjust and hjust arguments allow us to nudge the position of the labels a tiny bit so that they’re not completely overlapping with the nodes themselves
  • theme_void() changes the ggplot theme to one with a white background (instead of the default grey)

Exercise

Try using the same methods as above to plot a bigram network of either:

  1. a single user’s tweets (from get_timelines())
  2. a certain topic (from search_tweets())

3.3 Time-series

We can also investigate how frequently certain words occur over time using the ts_plot() function to plot a time-series of our data. Let’s try this out with our tweets from @realDonaldTrump, searching for the phrase ‘fake news’. First off, we’ll have to make a new dataframe which includes only those tweets containing this phrase. We can do this a combination of filter() and str_detect(). We also use mutate() and tolower() to convert the content of each tweet to lower case before performing our search using str_detect():

Now that we’ve got this dataset, we can feed it to ts_plot(), along with an argument for how fine-grained we’d like the time dimension to be, e.g. minutes, hours, days, weeks etc. Let’s try plotting the frequency per day (strictly speaking you only need the first line of code, but labs() is good for adding labels to your plot)

Exercise

Can you think of a popular topic that might show temporal patterns (i.e. an increase or decrease over time)? Try it out! Collect some data using either search_tweets() or get_timelines(), then use ts_plot() to plot the frequency over time.