R, RStudio, and the tidyverse

Introducing R and RStudio

In today’s class we will work with R, which is a very powerful tool, designed by statisticians for data analysis. Described on its website as “free software environment for statistical computing and graphics,” R is a programming language that opens a world of possibilities for making graphics and analyzing and processing data. Indeed, just about anything you may want to do with data can be done with R, from web scraping to making interactive graphics.

RStudio is an “integrated development environment,” or IDE, for R that provides a user-friendly interface.

Launch RStudio, and the screen should look like this:

The main panel to the left is the R Console. Type valid R code into here, hit return, and it will be run. See what happens if you run:

print("Hello World!")

The data we will use

Download the data for this class from here, unzip the folder and place it on your desktop. We will use this data over the next two weeks. It contains the following files:

Some words of caution, before we start

The US is currently in the grip of an epidemic of opioid abuse and addiction. Although widespread medical prescription of opioids helped drive addiction, a majority of overdoses now occur through the consumption of drugs purchased illegally.

Opioids have important medical uses, and just because a doctor prescribes large amounts of the drugs doesn’t necessarily mean they are practising irresponsibly. Turning any of the analyses in the next two classes into stories would require a lot of additional reporting, beyond the data work.

As ProPublica explained, in the methods for its stories based on Medicare Part D prescription data:

The data could not tell us everything. We interviewed many high-volume prescribers to better understand their patients and their practices. Some told us their numbers were high because they were credited with prescriptions by others working in the same practice. In addition, providers who primarily work in long-term care facilities or busy clinics with many patients naturally may write more prescriptions.

Reproducibility: Save your scripts

Data journalism should ideally be fully documented and reproducible. R makes this easy, as every operation performed can be saved in a script, and repeated by running that script. Click on the icon at top left and select R Script. A new panel should now open:

Any code we type in here can be run in the console. Hitting Run will run the line of code on which the cursor is sitting. To run multiple lines of code, highlight them and click Run.

Click on the save/disk icon in the script panel and save the blank script to the file on your desktop with the data for this week, calling it week5.R.

Set your working directory

Now we can set the working directory to this folder by selecting from the top menu Session>Set Working Directory>To Source File Location. (Doing so means we can load the files in this directory without having to refer to the full path for their location, and anything we save will be written to this folder.)

Notice how this code appears in the console:

setwd("~/Desktop/week5")

Save your data

The panel at top right has three tabs, the first showing the Environment, or all of the “objects” loaded into memory for this R session. Save this as well, and you won’t have to load and process all of the data again if you return to return to a project later.

Click on the save/disk icon in the Environment panel to save the file as week5.RData. The following code should appear in the Console:

save.image("~/Desktop/week5/week5.RData")

Copy this code into your script, placing it at the end, with a comment, explaining what it does:

# save session data
save.image("~/Desktop//week5/week5.RData")

Now if you run your entire script, the last action will always be to save the data in your environment.

Comment your code

Anything that appears on a line after # will be treated as a comment, and will be ignored when the code is run. You can use this to explain what the code does. Get into the habit of commenting your code: Don’t trust yourself to remember!

Some R code basics

Here, is.na is a function. Functions are followed by parentheses, and act on the data/code in the parentheses.

Important: Object and variable names in R should not contain spaces.

Introducing R packages and the tidyverse

Much of the power of R comes from the thousands of “packages” written by its community of open source contributors. These are optimized for specific statistical, graphical or data-processing tasks. To see what packages are available in the basic distribution of R, select the Packages tab in the panel at bottom right. To find packages for particular tasks, try searching Google using appropriate keywords and the phrase “R package.”

Our goal for today’s class is to get used to processing and analyzing data using a powerful series of R packages known as the tidyverse.

The tidyverse was pioneered by Hadley Wickham, chief scientist at RStudio, but now has many contributors.

Today, we will start by using:

To install a package, click on the Install icon in the Packages tab, type its name into the dialog box, and make sure that Install dependencies is checked, as some packages will only run correctly if other packages are also installed. The tidyverse packages can be installed in one go. Click Install and all of the required packages should install:

Notice that the following code appears in the console:

install.packages("tidyverse")

So you can also install packages with code in this format, without using the point-and-click interface.

Each time you start R, it’s a good idea to click on Update in the Packages panel to update all your installed packages to the latest versions.

Installing a package makes it available to you, but to use it in any R session you need to load it. You can do this by checking its box in the Packages panel. However, we will enter the following code into our script, then highlight these lines of code and run them:

# load packages to read and write csv files, process data, and work with dates
library(readr)
library(dplyr)
library(lubridate)

At this point, and at regular intervals, save your script, by clicking the save/disk icon in the script panel, or using the ⌘-S keyboard shortcut.

Load and view data

Load data

You can load data into the current R session by selecting Import Dataset>From Text File... in the Environment tab.

However, we will use the read_csv function from the readr package. Copy the following code into your script and Run:

# load ca medical board disciplinary actions data
ca_discipline <- read_csv("ca_discipline.csv")

Notice that the Environment now contains an objects of the type tbl_df, a variety of the standard R object for holding tables of data, known as a data frame:

Examine the data

We can View data at any time by clicking on its table icon in the Environment tab in the Grid view. The following code has the same effect:

View(ca_discipline)

The glimpse function from dplyr will tell you more about the variables in your data, including their data type. Copy this code into your script and Run:

# view structure of data
glimpse(ca_discipline)

This should give the following output in the R Console:

Variables: 10
$ alert_date  <date> 2008-04-18, 2008-04-21, 2008-04-23, 2008-04-28, 2008-05-15, 2008-05-15, 2008-06-18, 2008-06-27, 2008...
$ last_name   <chr> "Boyajian", "Cragen", "Chow", "Gravich", "Kabacy", "Aboulhosn", "Harron", "Fitzpatrick", "Adrian", "M...
$ first_name  <chr> "John", "Richard", "Hubert", "Anna", "George", "Kamal", "Raymond", "Christian", "Adrian", "Pamela", "...
$ middle_name <chr> "Arthur", "Darin", "Wing", NA, "E.", "Fouad", "A.", "John", NA, "J.", "Quoc", "M.", "Elisabet", NA, "...
$ name_suffix <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ city        <chr> "Boise", "Temecula", "San Gabriel", "Los Angeles", "Lacey", "Yakima", "Bridgeport", "Las Vegas", "Las...
$ state       <chr> "ID", "CA", "CA", "CA", "WA", "WA", "WV", "NM", "NV", "CA", "CA", "CA", "CA", "CA", "CA", "TN", "CA",...
$ license     <chr> "A25855", "A54872", "G45435", "A40805", "G13766", "CFE40080", "G8415", "G47520", "AFE56237", "G85601"...
$ action_type <chr> "Surrendered", "Surrendered", "Superior Court Order/Restrictions", "Superior Court Order/Restrictions...
$ action_date <date> 2008-04-18, 2008-04-21, 2008-03-10, 2008-04-25, 2008-05-15, 2008-05-15, 2008-06-18, 2008-06-27, 2008...

chr means “character,” or a string of text (which can also be treated as a categorical variable); date means a date. While we don’t have these data types here, int means an integer, or whole number; dbl means a number that may include decimal fractions; and POSIXct means a full date and timestamp.

If you run into any trouble importing data with readr, you may need to specify the data types for some columns — in particular for date and time. This link explains how to set data types for individual variables when importing data with readr.

To specify an individual column use the name of the data frame and the column name, separated by $. Type this into your script and run:

# print values for alert_date in the ca_discipline data
print(ca_discipline$alert_date)

The output will be the first 1,000 values for that variable.

If you need to change the data type for any variable, use the following functions:

So this code will convert alert_date codes to text:

# convert alert_date to text
ca_discipline$alert_date <- as.character(ca_discipline$alert_date)
glimpse(ca_discipline)

Notice that the data type for alert_date has now changed:

Observations: 7,561
Variables: 10
$ alert_date  <chr> "2008-04-18", "2008-04-21", "2008-04-23", "2008-04-28", "2008-05-15", "2008-05-15", "2008-06-18", "20...
$ last_name   <chr> "Boyajian", "Cragen", "Chow", "Gravich", "Kabacy", "Aboulhosn", "Harron", "Fitzpatrick", "Adrian", "M...
$ first_name  <chr> "John", "Richard", "Hubert", "Anna", "George", "Kamal", "Raymond", "Christian", "Adrian", "Pamela", "...
$ middle_name <chr> "Arthur", "Darin", "Wing", NA, "E.", "Fouad", "A.", "John", NA, "J.", "Quoc", "M.", "Elisabet", NA, "...
$ name_suffix <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ city        <chr> "Boise", "Temecula", "San Gabriel", "Los Angeles", "Lacey", "Yakima", "Bridgeport", "Las Vegas", "Las...
$ state       <chr> "ID", "CA", "CA", "CA", "WA", "WA", "WV", "NM", "NV", "CA", "CA", "CA", "CA", "CA", "CA", "TN", "CA",...
$ license     <chr> "A25855", "A54872", "G45435", "A40805", "G13766", "CFE40080", "G8415", "G47520", "AFE56237", "G85601"...
$ action_type <chr> "Surrendered", "Surrendered", "Superior Court Order/Restrictions", "Superior Court Order/Restrictions...
$ action_date <date> 2008-04-18, 2008-04-21, 2008-03-10, 2008-04-25, 2008-05-15, 2008-05-15, 2008-06-18, 2008-06-27, 2008...

Process and analyze data with dplyr

Now we will use dplyr to process the data, using the basic operations we discussed in week 1:

Here are some of the most useful functions in dplyr:

There are also various functions to join data, which we will explore next week.

These functions can be chained together using the “pipe” operator %>%, which makes the output of one line of code the input for the next. This allows you to run through a series of operations in a logical order. I find it helpful to think of %>% as meaning “then.”

Now we will use dplyr to turn the alert_date variable back to dates:

# convert alert_date to date using dplyr
ca_discipline <- ca_discipline %>%
  mutate(alert_date = as.Date(alert_date))

This code copies the ca_discipline data frame, and then (%>%) uses dplyr‘s mutate function to change the data type for the alert_date variable. Because the copied data frame has the same name, it overwrites the original version.

Filter and sort data

To get used to working with dplyr, we will now start filtering and sorting the data according to the doctors’ locations, and the types of disciplinary actions they faced.

First, let’s look at the types of disciplinary actions in the data:

# look at types of disciplinary actions
types <- ca_discipline %>%
  select(action_type) %>%
  unique()

This code first copies the ca_disciplinary data frame into a new object called types. Then (%>%) it selects the action_type variable only. Finally, it uses a function called unique to display the unique values in that variable, with no duplicates.

The new data frame has one column and 75 rows. If you View, the first few rows should look like this:

Many of the values for action_types seem to be subtle variations of the same thing. If we were going to all analyze the types in detail, we may need to group some of these together, after speaking to an expert to understand what they all mean. But some of the action types are clear and unambiguous: Revoked is the medical board’s most severe sanction, which cancels a doctor’s license to practise. So let’s first filter the data to look at those actions:

# filter for license revocations only
revoked <- ca_discipline %>%
  filter(action_type == "Revoked")

This code copies the ca_discipline data into a new data frame called revoked and then (%>%) uses the filter function to include only actions in which a doctor’s license was revoked. Notice the use of == to test whether action_type is Revoked.

There should be 446 rows in the filtered data, and the first few should look like this:

Some of the doctors are not actually based in California. Doctors can be licensed to practise in more than one state, and the Medical Board of California typically issues its own sanction if a doctor is disciplined in their home state. So let’s now filter the data to look only at doctors with revoked licenses who were based in California, and sort them by city.

# filter for license revocations by doctors based in California, and sort by city
revoked_ca <- ca_discipline %>%
  filter(action_type == "Revoked"
         & state == "CA") %>%
  arrange(city)

Here, the filter combines two conditions with &. That means that both have to be met for the data to be included. Then arrange sorts the data, which for a text variable will be in alphabetical order. If you wanted to sort in reverse alphabetical order, the code would be: (arrange(desc(city)).

There should be 274 rows in the filtered data, and the first few should look like this:

This code will achieve the same result. You should be able to work out why:

# filter for license revocations by doctors based in California, and sort by city
revoked_ca <- revoked %>%
  filter(state == "CA") %>%
  arrange(city)

Now let’s filter for doctors based in Berkeley or Oakland who have had their licenses revoked:

# doctors in Berkeley or Oakland who have had their licenses revoked 
revoked_oak_berk <- ca_discipline %>%
  filter(action_type == "Revoked"
       & (city == "Oakland" | city == "Berkeley"))

There should be just two doctors:

This code uses | to look for doctors in either Oakland or Berkeley. That part of the filter function is wrapped in parantheses to ensure that it is carried out first.

See what happens if you remove those parentheses, and work out why the result has changed. We will discuss this in class.

Append data using bind_rows

To demonstrate the bind_rows function, we will filter for doctors with revoked licenses in each of the two cities separately, and then append one data frame to the other.

# doctors in Berkeley who had their licenses revoked
revoked_berk <- ca_discipline %>%
  filter(action_type == "Revoked"
       & city == "Berkeley")

# doctors in Oakland who had their licenses revoked
revoked_oak <- ca_discipline %>%
  filter(action_type == "Revoked"
       & city == "Oakland")

# doctors in Berkeley or Oakland who have had their licenses revoked
revoked_oak_berk <- bind_rows(revoked_oak, revoked_berk)

In-class practice with filtering and sorting

Write data to a CSV file

The readr package can also be used to write data from your environment into a CSV file:

# write data to CSV file
write_csv(revoked_oak_berk, "revoked_oak_berk.csv", na = "")

The code na = "" ensures that null values in the data are written as blank cells; otherwise they would contain the letters NA.

Group and summarize data

Next we will group and summarize data by counting disciplinary actions by year and by month. But before doing that, we need to use the lubridate package to extract the year and month from action_date.

# extract year and month from action_date
ca_discipline <- ca_discipline %>%
  mutate(year = year(action_date),
         month = month(action_date))

Now we have two extra columns in the data, giving the year and the month as a number:

The year and month functions are from the lubridate package.

Previously we used dplyr‘s mutate function to modify an existing variable. Here we used it to create new variables. You can create or modify multiple variables in the same mutate function, separating each one by commas. Notice the use of = to make a variable equal to the output of code, within the mutate function.

Now we can calculate the number of license revokations for doctors based in California by year:

# license revokations for doctors based in Califorina, by year
revoked_ca_year <- ca_discipline %>%
  filter(action_type == "Revoked" 
         & state == "CA") %>%
  group_by(year) %>%
  summarize(revocations = n())

This should be the result:

The code first filters the data, as before, then groups by the new variable year using group_by, then summarizes by counting the number of records for each year using summarize. The last function creates a new variable called revocations from n(), which is a count of the rows in the data for each year.

Looking at this result and the raw ca_discipline data, we only have partial data for 2008. So if we want to count the number of license revocations by month over all the years, we should first filter out the data for 2008, which will otherwise skew the result.

# license revokations for doctors based in Califorina, by month
revoked_ca_month <- ca_discipline %>%
  filter(action_type == "Revoked" 
         & state == "CA"
         & year >= 2009) %>%
  group_by(month) %>%
  summarize(revocations = n())

This should be the result:

Notice how we used >= to filter for data where the year was 2009 or greater. The following code will achieve the same result:

# license revokations for doctors based in Califorina, by month
revoked_ca_month <- ca_discipline %>%
  filter(action_type == "Revoked" 
         & state == "CA"
         & year != 2008) %>%
  group_by(month) %>%
  summarize(revocations = n())

We can group and summarize by more than one variable at a time. The following code counts the number of actions of all types, by month and year. Again, we will first filter out the incomplete data for 2008.

# disciplinary actions for doctors in California by year and month, from 2009 to 2017
actions_year_month <- ca_discipline %>%
  filter(state == "CA"
         & year >= 2009) %>%
  group_by(year, month) %>%
  summarize(actions = n()) %>%
  arrange(year, month)

The first few rows should look like this:

Notice that both the group_by and arrange functions group and sort the data, respectively, by two variables, separated by commas.

Finally, let’s calculate the mean and median number of disciplinary actions issued per month, over the period 2009 to 2017:

# mean and median actions per month, 2009 to 2017
summary_year_month <- actions_year_month %>%
  ungroup() %>%
  summarize(mean = mean(actions),
            median = median(actions))

This should be the result:

In this code we calculated two summary statistics, mean and median, in the same summarize function, separating each calculation by a comma. First, however, we had to ungroup the grouped actions_year_month data frame.

Think about the similarities and differences between grouping and summarizing data using dplyr and the spreadsheet pivot tables you made in week 3. We will discuss this in class.

In-class practice with filtering, grouping, and summarizing

Closing down properly

Whenever you exit R, get into the habit of saving your script amd the data in your environment. Then close your script and any data frames open in View. When you close R, select Don't Save at this prompt:

These actions ensure that R Studio will open cleanly, without the remants from a previous session, when you next launch it.

Exercises/assignment

These exercises are designed for you to practice writing code to load, filter, sort, group, and summarize data.

First open the .RData file from class, by clicking this icon () in the Environment panel, and navigating to the folder with the data. If you do this, you won’t need to reload the ca_discipline data, and you won’t need to create the variable year in the data, which you will need for the exercises below.

However, if you’d like to practice these things, you are also welcome to start from scratch and include in your script the code that loads the ca_discipline data and creates the year variable.

Now open a new R script, save it into the same folder as your data with the name week5_assignment.R, and set your working directory to this location, as before.

File your R script (week_5_assignment.R) with the code to complete these exercises, and the saved CSV file (doctors_all_actions.csv), via bCourses by Weds Feb 21 at 8.00pm.

Further reading

Introduction to dplyr

RStudio Data Wrangling Cheet Sheet
Also introduces the tidyr package, which can manage wide-to-long transformations, and text-to-columns splits, among other data manipulations.

Stack Overflow
For any work involving code, this question-and-answer site is a great resource for when you get stuck, to see how others have solved similar problems. Search the site, or browse R questions