Overview

In this part of the workshop we’ll cover the basics of using R to perform data analysis, before moving on to Twitter-specific topics in Parts 2-4.

If you’re not already familiar with R, it can appear a little daunting at first, and it certainly has a steeper learning curve than traditional spreadsheet software such as Microsoft Excel. The most obvious difference is that R doesn’t have a graphical user interface in the same way that Excel does: instead, all of the data analysis is conducted through typing code. Just like any other programming language, most of the error messages you’re likely to encounter result from incorrect syntax (e.g. erroneous commas, missing out a closing bracket, using uppercase instead of lowercase) or hard-to-spot typos in your code. This can make learning R a frustrating endeavour at times, but I promise it’s worthwhile!


1 Downloading R and RStudio

You’ll need to download and install two pieces of software for this workshop: the R programming language/environment itself, and the RStudio Integrated Development Environment. There are Mac, Windows, and Linux versions, and both are available completely free of charge.

R can be downloaded here, and you’ll find download links for RStudio here.

Once you’ve downloaded and installed both, you probably won’t ever need to open R directly - go ahead and open RStudio and let’s get started with a whistle-stop tour of how it all works.


2 The RStudio workspace

The RStudio workspace can be split into four main areas:

  1. Script window: while you can type all of your code directly into the console (see below), it makes more sense to type it into an R script file that you can then save and load back in again at a later date. Open a new empty script by going to File → New File → R script or by clicking the button at the top left of the RStudio window. Important: to run code from an R script, you have to click the mouse cursor anywhere on the desired line, then press Cmd + Enter (Mac) or Ctrl + Enter (Windows)
  2. Console: this is where the output of your analysis will be printed. You can also type commands directly into here, but they’ll be lost when you close RStudio so make sure all of your important code is going in the R script itself!
  3. Environment: this is where your data ‘objects’ will be listed when you create/load them
  4. Plots/Packages/Help: there are a number of tabs along the top here, but the most important ones are Plots (where your graphs go), Packages (where you can view/install extra packages to add functionality to R), and Help (where you can find lots of helpful vignettes on how to use each command/function in R)

3 Working with data

Let’s start off working with an example dataset called iris, which is a dataset of sepal/petal measurements of various plants. It comes built-in to R, which means we can already refer to it simply by typing its name, without first having to load it from an external datafile:

We can take a peek at the first six rows by running the head() command:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Sometimes it’s also useful just to print the column names to get an idea of how the dataset is structured; you can do this using the colnames() command:

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

We can also find out how many rows the dataset contains with nrow():

## [1] 150

iris is a special kind of object, called a dataframe - essentially the equivalent of a spreadsheet in Excel. As we’ve just seen, it’s made up of 5 columns and 150 rows. Another important type of object is a vector, which is basically just a list of values.

Importantly, each column in a dataframe is a vector: a list of values all of the same measurement type. You can refer to a specific column using the $ sign. For example, the following will print out the Sepal.Length column/vector as a list of values:

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

It’s sometimes useful just to print out the unique values contained within a column. To do this, you just need to include a call to a dataframe column, as we’ve just tried above, inside the unique() command. Let’s try it with the species column to see how many different plant types are included in the dataframe:

## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica

You can do some cool things with numeric vectors. The following will print out the values of the Sepal.Length column, but with 10 added to each value:

##   [1] 15.1 14.9 14.7 14.6 15.0 15.4 14.6 15.0 14.4 14.9 15.4 14.8 14.8 14.3
##  [15] 15.8 15.7 15.4 15.1 15.7 15.1 15.4 15.1 14.6 15.1 14.8 15.0 15.0 15.2
##  [29] 15.2 14.7 14.8 15.4 15.2 15.5 14.9 15.0 15.5 14.9 14.4 15.1 15.0 14.5
##  [43] 14.4 15.0 15.1 14.8 15.1 14.6 15.3 15.0 17.0 16.4 16.9 15.5 16.5 15.7
##  [57] 16.3 14.9 16.6 15.2 15.0 15.9 16.0 16.1 15.6 16.7 15.6 15.8 16.2 15.6
##  [71] 15.9 16.1 16.3 16.1 16.4 16.6 16.8 16.7 16.0 15.7 15.5 15.5 15.8 16.0
##  [85] 15.4 16.0 16.7 16.3 15.6 15.5 15.5 16.1 15.8 15.0 15.6 15.7 15.7 16.2
##  [99] 15.1 15.7 16.3 15.8 17.1 16.3 16.5 17.6 14.9 17.3 16.7 17.2 16.5 16.4
## [113] 16.8 15.7 15.8 16.4 16.5 17.7 17.7 16.0 16.9 15.6 17.7 16.3 16.7 17.2
## [127] 16.2 16.1 16.4 17.2 17.4 17.9 16.4 16.3 16.1 17.7 16.3 16.4 16.0 16.9
## [141] 16.7 16.9 15.8 16.8 16.7 16.7 16.3 16.5 16.2 15.9

This isn’t very useful, but it goes to show how you can perform certain mutations on a vector of numbers. It might be more useful to add together the Sepal.Length and Petal.Length columns to get a measure of combined length. Let’s try it now:

##   [1]  6.5  6.3  6.0  6.1  6.4  7.1  6.0  6.5  5.8  6.4  6.9  6.4  6.2  5.4
##  [15]  7.0  7.2  6.7  6.5  7.4  6.6  7.1  6.6  5.6  6.8  6.7  6.6  6.6  6.7
##  [29]  6.6  6.3  6.4  6.9  6.7  6.9  6.4  6.2  6.8  6.3  5.7  6.6  6.3  5.8
##  [43]  5.7  6.6  7.0  6.2  6.7  6.0  6.8  6.4 11.7 10.9 11.8  9.5 11.1 10.2
##  [57] 11.0  8.2 11.2  9.1  8.5 10.1 10.0 10.8  9.2 11.1 10.1  9.9 10.7  9.5
##  [71] 10.7 10.1 11.2 10.8 10.7 11.0 11.6 11.7 10.5  9.2  9.3  9.2  9.7 11.1
##  [85]  9.9 10.5 11.4 10.7  9.7  9.5  9.9 10.7  9.8  8.3  9.8  9.9  9.9 10.5
##  [99]  8.1  9.8 12.3 10.9 13.0 11.9 12.3 14.2  9.4 13.6 12.5 13.3 11.6 11.7
## [113] 12.3 10.7 10.9 11.7 12.0 14.4 14.6 11.0 12.6 10.5 14.4 11.2 12.4 13.2
## [127] 11.0 11.0 12.0 13.0 13.5 14.3 12.0 11.4 11.7 13.8 11.9 11.9 10.8 12.3
## [141] 12.3 12.0 10.9 12.7 12.4 11.9 11.3 11.7 11.6 11.0

Note that this doesn’t replace the current values in the Sepal.Length column, it just prints them to the Console. Let’s say we wanted to calculate the combined length and then save this back to a new column in the dataframe, we would have to add a bit of code before the addition:

The above code basically says create a new column in the iris dataframe and call it Combined.Length, and set the values of this to be the addition of the Sepal.Length and Petal.Length values. The <- symbol is used for object assignment: it saves what’s on the right-hand side to whatever you specify on the left-hand side, which in this case is a new column called Combined.Length in the iris dataframe.

Of course, it’s also useful to calculate basic descriptive statistics on numeric columns. We can calculate the mean value of a numeric vector by using mean()

## [1] 5.843333

…and the median by using median() (who would’ve thought!)

## [1] 5.8

3.1 Doing it the ‘tidy’ way

One of the cool things about R is that you can install extra packages to provide additional functionality (there are over 10000 of them!). While most of them are for very specific types of analysis, the tidyverse set of packages allows us to structure our code in a more intuituve, user-friendly way. As with any new package, you first need to install it:

You only need to install a package once, but you have to load it in using library() each time you start a new R session:

Now we’re good to go. A crucial part of the tidyverse is using the %>% operator - also known as the ‘pipe’ - to ‘chain’ commands together. Essentially, it takes whatever comes before %>% and inputs it into whatever comes after.

In the example below, we take the iris dataframe and input it into the select() command, which returns the same dataframe but with only the specified columns (in this case Petal.Width and Species)

##    Petal.Width Species
## 1          0.2  setosa
## 2          0.2  setosa
## 3          0.2  setosa
## 4          0.2  setosa
## 5          0.2  setosa
## 6          0.4  setosa
## 7          0.3  setosa
## 8          0.2  setosa
## 9          0.2  setosa
## 10         0.1  setosa

We can also ‘pull’ out a column as a vector, using pull(). Note that this is the equivalent of running iris$Petal.Width:

##   [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4
##  [18] 0.3 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2
##  [35] 0.2 0.2 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 1.4
##  [52] 1.5 1.5 1.3 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3 1.4 1.5 1.0
##  [69] 1.5 1.1 1.8 1.3 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0 1.1 1.0 1.2 1.6 1.5
##  [86] 1.6 1.5 1.3 1.3 1.3 1.2 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3 2.5 1.9
## [103] 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8 2.2 2.3
## [120] 1.5 2.3 2.0 2.0 1.8 2.1 1.8 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5 1.4 2.3
## [137] 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3 2.5 2.3 1.9 2.0 2.3 1.8

We use mutate() to either edit an existing column, or add a new column (if the specified column name already exists it’ll automatically replace that existing column). In the following example, we ‘mutate’ the iris dataframe to add a new column called Combined.Width - the values in this new column are simply the values of Sepal.Width added to the values of Petal.Width:

What if we’re only interested in specific rows of data? For example, just the virginica species of plants. We can use filter() to output a subset of our data and save to a new dataframe called iris.subset (you can name it whatever you want, but it’s always good to use sensible names). Note the use of the double brackets == when performing an “equal to” comparison - this is important!

We can now calculate the mean of the Sepal.Length column of this new dataset, and then repeat the process for the other types of flower listed in the Species column (namely, setosa and versicolor)

But there’s an easier way! With the tidyverse, we can use a combination of group_by() and summarise() to first group our data by the values of a certain column, then perform summary statistics on each sub-group:

## # A tibble: 3 x 2
##   Species    `mean(Petal.Length)`
##   <fct>                     <dbl>
## 1 setosa                     1.46
## 2 versicolor                 4.26
## 3 virginica                  5.55

What’s cool is that we can add other bits of information to the summary table, such as sd() for the standard deviation, and length() for the number of rows of data in each sub-group:

## # A tibble: 3 x 4
##   Species    `mean(Petal.Length)` `sd(Petal.Length)` `length(Petal.Length)`
##   <fct>                     <dbl>              <dbl>                  <int>
## 1 setosa                     1.46              0.174                     50
## 2 versicolor                 4.26              0.470                     50
## 3 virginica                  5.55              0.552                     50

4 Data visualisation

It’s really easy to make plots in R. For example, to make a scatterplot of two continuous variables:

To make a histogram of a single continuous variable:

To make a boxplot of a single continuous variable:

To make a boxplot comparing the distribution of two continuous variables:

To make a boxplot comparing the distribution of a single continuous variable by the levels of a categorical variable:

4.1 Using ggplot

So far so good, but can we do better? ggplot is essentially a package for making pretty plots, and it’s a very powerful tool. It comes as part of the tidyverse set of packages that we installed and loaded earlier, so we can start using it straight away.

All ggplots start with a call to ggplot(), inside of which we can specify what columns to plot for the x and y axes:

Note that you use %>% to ‘pipe’ the dataframe into the first line of the ggplot command, but after that the ggplot code is added together using +.

Now the plot above isn’t particularly useful, since it doesn’t distinguish between the different species of plants in the dataset. Let’s add two new arguments, colour and shape, which we can set based on the value in the Species column:

This is better, but it’s still quite simple. A fundamental part of ggplot is the idea of layers. For example, we can add a geom_smooth() command to the code above, which fits a smoothed correlation line to each sub-group of the data:

Let’s add a few more lines: xlab() and ylab() allow you to change the default axis labels, and theme_bw() changes the ggplot theme from the default gray background (you can find a list of themes here)

So far we’ve used geom_point() and geom_smooth(), but the list doesn’t end there. The basic syntax of a ggplot is always the same (i.e. specifying what you want on the axes, what to colour-code by etc.), but we can apply a different type of geom layer, such as geom_boxplot():

Or geom_histogram(), where you only need to specify what goes on the x-axis:

There are lots of others, such as geom_line(), geom_path(), geom_polygon(), geom_label(), geom_text() etc. but they aren’t appropriate to use here (if you’re interested, there are some useful guides available online, such as this one).