The goal of today’s class is to show you some more advanced teachniques in R, beyond using dplyr for our basic operations with data:
Sort: Largest to smallest, oldest to newest, alphabetical etc.
Filter: Select a defined subset of the data.
Summarize/Aggregate: Deriving one value from a series of other values to produce a summary statistic. Examples include: count, sum, mean, median, maximum, minimum etc. Often you’ll group data into categories first, and then aggregate by group.
Join: Merging entries from two or more datasets based on common field(s), e.g. unique ID number, last name and first name.
The examples below have been chosen to reflect some of the problems encountered in students’ project work. I don’t suggest trying to follow along today, but rather focus on understanding what I show you. You can work through the examples at your leisure, or modify them to apply to your project. To run the code, you will need to install any required packages that are not already in listed in your RStudio Packages
tab.
The message from thesse examples is that R is an incredibly powerful tool, capable of performing just about any analysis or data processing task you can imagine. Once you get confident in R, you’ll get used to looking for code examples and tutorials to solve the problems you encounter when working with data. That’s how I learned everything outlined below.
If you’re not sure what you are doing, however, seek help! The NICAR listserv, the main online discussion forum for data journalists, is a great place to start.
Some of the data will be loaded using web Application Programming Interfaces that are set up to provide data on demand online.
The rest of the data is here; unzip the folder and place it on your desktop. It contains the following folders and files:
ca_discipline
Folder containing multiple CSV files, one for each each, detailing disciplinary alerts and actions issued by the Medical Board of California from 2008 to 2017. Processed from downloads available here. Each file contains the following variables:
alert_date
Date alert issued.last_name
Last name of doctor/health care provider.first_name
First name of doctor/health care provider.middle_name
Middle/other names.name_suffix
Name suffix (Jr., II etc)city
City of practice location.state
State of practice location.license
California medical license number.action_type
Type of action.action_date
Date of action.ca_tracts
Shapefile of Census tracts in California, from the US Census Bureau.
oakland_eei.csv
Data on educational attainment of children from low-income families (those qualifying for free or reduced price lunch) for schools in Oakland in 2015, originally compiled by Education Cities for the Education Equality Index, and processed/simplified for this class (there is more data available). Contains the following variables:
school_name
eei_score
School performance, based on tests for all subjects and grades, on a 0-100 scale.charter
Is the school a charter school? N
or Y
.pc_black
pc_white
pc_hispanic
Percent of students from each of these racial/ethnic groups.enroll_100
Total number of students at the school, in multiples of 100.pc_fr_lunch
Percentage of students on free or reduced price lunch.As usual, open RStudio, create a new script, save it into the folder with the data for this class, then set your working directory to the same folder.
Often data comes in a series of files, which you need to read into R and combine into a single data frame. This can be achieved using a for loop, a useful programming trick that allows you to iterate through a list of items, performing the same series of actions on each item. To obtain a single data frame with all the Medical Board of Califoria disciplinary actions from 2008 to 2017, we need to read in each file, and append it to the same data frame.
This is how to do it:
# required packages
library(readr)
library(dplyr)
# list files
files <- list.files("ca_discipline")
# create empty data frame
ca_discipline <- data_frame()
# loop to read in files, and append them to that data frame
for (f in files) {
print(f)
tmp <- read_csv(paste0("ca_discipline/",f), col_types = cols(
.default = col_character(),
alert_date = col_datetime(),
action_date = col_datetime()))
ca_discipline <- bind_rows(ca_discipline,tmp)
}
# cleanup
rm(tmp,files,f)
For loops in R have this general format:
for (item in list) {
# Do stuff involving each item
}
item
can have any name you like; I usually give it the first letter of the name of the list I’m iterating through.
This example uses the function list.files
to list all the files in the folder ca_discipline
, then makes an empty data frame called ca_discipline
using the dplyr function data_frame
.
The loop prints each file name, then uses the read_csv
function from readr to load each file into a temporary data frame called tmp
, which gets overwritten in each iteration of the loop. The final step in each iteration of the loop is to append tmp
to the ca_discipline
data frame using the dplyr function bind_rows
.
The read_csv
function is a little more complicated than we’ve seen before. Because the files are in the ca_discipline
folder, the location of each must be defined as ca_discipline/f
using the paste0
function, which concatenates strings of text.
The read_csv
function also defines the default data type for columns in the data, and gives the data type for any exceptions. Explicitly decaring the data types in a loop like this is a good idea, because any inconsistency in data types for each read_csv
may otherwise cause the bind_rows
function to fail. (Other data types include col_integer()
for whole numbers and col_double()
for numbers that may include decimal fractions.)
The rm
function removes objects from your environment.
Often you need to make new columns in your data, particularly when simplifying a categorical variable with many categories into something more manageable. This can be done with in a dplyr mutate
using the function case_when
, which applies conditional statements.
Continuing from the example above, this code simplifies the action_types
variable into a new variable type_edit
with just three possibilities: Revoked
, Surrendered
, or Other
.
ca_discipline <- ca_discipline %>%
mutate(type_edit = case_when(action_type == "Revoked" ~ "Revoked",
action_type == "Surrendered" ~ "Surrendered",
!grepl("Revoked|Surrendered", action_type) ~ "Other"))
View the data to see the new row:
R can process and analyze geographic data, and then write the results to PostgreSQL for mapping in QGIS and further analysis with PostGIS. To see this in action, we’ll first work with some Census data, which can be loaded using US Census APIs with the tidycensus package, and Census shapefiles, which can be loaded using the tigris package. In addition to dplyr, this example uses rgdal, which reads and writes geographic data and handles its projection; RPostgreSQL, which connects to a PostgreSQL database; and rpostgis, to write geographic data into a PostGIS-enabled database table.
First, load all of the packages required for this example:
# load required packages
library(dplyr)
library(tidycensus)
library(tigris)
library(rgdal)
library(RPostgreSQL)
library(rpostgis)
tidycensus can pull data from the Census itself, run every 10 years, or the American Community Survey, an annual snapshot of the country which extrapolates from samples, producing estimates for variables(median household income, population for various races and ethnicities, and so on) for the last year, the last 3 years, or the last 5 years, for geographies of various types. The smaller the geographic area and the shorter the time period, the wider the margin of error. So when analyzing data from small geographic areas, such as Census tracts, it’s generally wise to use the 5-year estimates.
To pull data from a US Census API, you need a Census API key, which you can request from here.
This code creates a data frame showing all of the variables measured in the 2016 5-year American Community Survey:
# load Census API key
census_api_key("your key goes here", install = TRUE)
# get variables acs_2016 5 year estimates
acs_2016_variables <- load_variables(2016, "acs5", cache = TRUE)
View the data frame and use the search box to find the variables you need. Search for “median household income” and you should see the following:
You will need to read the text under concept
to see which variable you want. For the median household income across all households, irrespective of their size, the variable is B19019_001E
. The E
simply means “estimate”, and should be dropped when loading the variable using tidycensus.
The following code gets the median household income for all census tracts in California, as recorded in the 5-year estimates of the 2016 American Community Survey:
# use tidycensus to get data on medium household income for all Census tracts in California
ca_tracts <- get_acs(state = "CA",
year = 2016,
survey = "acs5",
geography = "tract",
variables = "B19013_001",
output = "tidy",
geometry = FALSE)
This should return a data frame with 8057 rows, giving the estimate
and margin of error (moe
) for each tract. Each tract also has an unique code, GEOID
:
You can change the year and survey: use acs3
for 3-year estimates, for example. Available geographies include county
and zcta
, for zip code tabulation areas; but don’t specify a state when using ztca
, because zip codes can cross state lines. (get_decennial
is the equivalent function for pulling data from the full Census.)
Now we can use tigris to load a shapefile for California Census tracts, and produce a rough plot:
# Use tigris to load shapefile for Census tracts in California
ca_tracts_map <- tracts(state = "CA", year = 2016)
# plot
plot(ca_tracts_map)
The plot should look like this:
Often you won’t be able to load shapefiles from an API. But if you have downloaded a shapefile to your computer, you can load it like this:
# read in a shapefile from your computer
ca_tracts_map <- readOGR("ca_tracts_map", "ca_tracts_map")
This code uses the readOGR
function from rgdal; the first mention of ca_tracts_map
refers to the folder containing the shapefile, the second to the root name of all the files within.
Whether loading using tidycensus or from a saved shapefile, the data will load as a SpatialPolygonsDataFrame
. Inspect the data connected to the map with this code:
# View the data associated with the map
View(ca_tracts_map@data)
@
specifies a particular attritubute of a spatial data frame, here its table of associated data.
The data associated with the map also contains GEOID
codes for each tract:
Now use dplyr to join the median housefold income data to the map, based on matching GEOID
codes:
# join the median household income data to the map
ca_tracts_map@data <- inner_join(ca_tracts_map@data, ca_tracts, by = "GEOID")
# check the join has worked
View(ca_tracts_map@data)
Having joined the data you can export the shapefile to your computer like this:
# write out a shapefile
writeOGR(ca_tracts_map, "ca_tracts_income", "ca_tracts_income", driver = "ESRI Shapefile")
If you have your Postgres app running, you can also upload the spatial data frame straight to PostgreSQL/PostGIS like this:
# create new database, if required
system("createdb week14")
# set up connection to that database
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "week14", host = "localhost")
# reproject the data frame to the standard EPSG:4326
ca_tracts_map <- spTransform(ca_tracts_map, CRS("+init=epsg:4326"))
# insert into a PostGIS-enabled table
pgInsert(con,
name = "ca_tracts_income",
data.obj = ca_tracts_map,
geom = "geom",
overwrite = TRUE,
df.mode = FALSE, partial.match = FALSE, new.id = NULL,
row.names = FALSE, upsert.using = NULL,
alter.names = FALSE, encoding = NULL, return.pgi = FALSE)
The system
function, from base R, sends a command to your computer’s Terminal, here to create a new database.
dbDriver
and dbConnect
are functions from RPostgreSQL, used to establish a connection to that database.
PostGIS normally expects tables to be in a standard Coordinate Reference Rystem called EPSG:4326
(this is also the default in QGIS). So before uploading to PostgreSQL/PostGIS, use the spTransform
function from rgdal to ensure your data is in this format.
Finally, use the pgInsert
function from rpostgis to write the data into a PostGIS-enabled table; name
defines the name of the new table, data.obj
is the name of the spatial data frame in your R environment. If you want to append data to an existing table, use overwrite = FALSE
(using the code above, with overwrite = TRUE
, any existing table with the same name would be overwritten).
You may find it much more convenient to load maps into PostgreSQL/PostGIS from R than by the methods you’ve used previously.
Now you can load your map into QGIS as we did previously.
This example uses another API, shows how to turn data with latitude and longitude coordinates into a spatial data frame, and introduces hexagonal binning to analyze the density of points in QGIS.
These are the required packages:
# load required packages
library(dplyr)
library(rgdal)
library(RPostgreSQL)
library(rpostgis)
library(jsonlite)
If continuing from the example above, you only need load jsonlite, which can read data from JSON into an R data frame.
This page at DataSF contains incidents responded to by the San Francisco Police Department in 2018. The data is updated daily, with the most recently added incidents running two weeks behind the current date.
With frequently updated data like this, it’s a good idea to work with a script that can pull in the latest version of the data, rather than a downloaded snapshot. Then your analysis can be updated by simply running the entire script.
If you click on Export
on the 2018 police incidents page, one of the options is to obtain the data using Socrata’s SODA API:
(Socrata is a company used by many state and local governments to put their data online.)
Notice that the endpoint is https://data.sfgov.org/resource/956q-2t7k.json
. The SODA API includes many options, but to simply load the entire dataset into R, the code is simple:
# get San Francisco 2018 police incident data
incidents <- fromJSON("https://data.sfgov.org/resource/956q-2t7k.json?$limit=50000")
# check the data types for the data returned
glimpse(incidents)
If you don’t specify a limit, only the first 1,000 rows will be returned; use trial and error with the limit to make sure you get all the data.
This was the result of the glimpse
on April 24, 2018:
Observations: 26,688
Variables: 13
$ date <chr> "2018-04-01T00:00:00", "2018-03-19T00:00:00", "2018-03-19T00:00:00", "2018-01...
$ address <chr> "0 Block of ROBINSON DR", "JOHNMUIR DR / LAKE MERCED BL", "800 Block of BRUNS...
$ resolution <chr> "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "ARREST, BOOK...
$ pddistrict <chr> "INGLESIDE", "TARAVAL", "INGLESIDE", "INGLESIDE", "INGLESIDE", "INGLESIDE", "...
$ incidntnum <chr> "180250054", "180208241", "186066502", "180018238", "180047271", "186033852",...
$ x <chr> "-122.42871668187439", "-122.4857362496041", "-122.45048298008807", "-122.420...
$ dayofweek <chr> "Sunday", "Monday", "Monday", "Sunday", "Thursday", "Friday", "Wednesday", "W...
$ y <chr> "37.70792190345863", "37.70815359983144", "37.70817029170297", "37.7083109744...
$ location <data.frame> c("37.70792190345863", "37.70815359983144", "37.70817029170297", "37.7...
$ time <chr> "00:01", "19:15", "00:01", "16:40", "15:15", "00:01", "13:06", "12:00", "08:4...
$ pdid <chr> "18025005405041", "18020824128150", "18606650206244", "18001823804134", "1800...
$ category <chr> "BURGLARY", "VANDALISM", "LARCENY/THEFT", "ASSAULT", "SUSPICIOUS OCC", "NON-C...
$ descript <chr> "BURGLARY OF RESIDENCE, FORCIBLE ENTRY", "MALICIOUS MISCHIEF, VANDALISM", "GR...
location
is a data frame within this data frame, and every other variable is treated as text. So remove the location column and process the data with dplyr as follows:
# this ensures no loss of decimal places when converting text values for latitude and longitude
options(digits = 17)
# process the data
incidents <- incidents %>%
select(-location) %>%
mutate(date = as.Date(date),
latitude = as.double(y),
longitude = as.double(x),
hour = as.integer(substr(time,1,2)))
When using select
, putting a dash in front of a column removes it from the data. So this code removes the nested data frame, converts the dates into dates, and creates new numeric columns for latitude, longitude, and the hour of the day on the 24-hour clock. The substr
function extracts part of a string of text, here the first two numbers in the time
variable.
Theft from cars is a big issue in San Francisco, but we want to ensure that we include all crime categories and descriptions that might be relevant:
# find categories and descriptions for thefts
theft_types <- incidents %>%
filter(grepl("theft|larceny", category, ignore.case=TRUE)) %>%
select(category, descript) %>%
unique()
From the data returned from this code, it seems we need to filter the data to include petty or grand left from locked or unlocked autos:
# filter grand or petty theft from locked or unlocked auto
car_breakins <- incidents %>%
filter(grepl("grand|petty", descript, ignore.case = TRUE)
& grepl("unlocked auto|locked auto", descript, ignore.case = TRUE))
# check the crime descriptions resulting from that filter
car_breakin_types <- car_breakins %>%
select(descript) %>%
unique()
View these two data frames to check that the data has filtered correctly.
The temptation is to load the data straight into PostgreSQL for mapping, but let’s first take a look at the addresses that show up most often in the data.
# count car breakins by address
car_breakin_locations <- car_breakins %>%
group_by(address) %>%
summarize(count = n()) %>%
arrange(desc(count))
Here are the top ten addresses:
Wow, there seems to be an epidemic of car break-ins on the 800 block of Bryant Street! What could be going on? Let’s take a look at what’s there:
The San Francisco Hall of Justice is at 850 Bryant. Do we think that’s a real hotspot, or is it more likely that incidents without another location are being mapped to the courthouse by default? You’d want to speak to the person at SFPD who manages this data to find out, but I’m going to exclude this suspicious data before loading to PostgreSQL/PostGIS.
# remove incidents on 800 block of Bryant
car_breakins <- car_breakins %>%
filter(!grepl("800 block of bryant", address, ignore.case = TRUE))
Now write the data to PostgreSQL/PostGIS:
# you may need this line if your data frame is a tbl_df
car_breakins <- as.data.frame(car_breakins)
# make a data frame with just the latitude and longitude coordinates
xy <- car_breakins %>%
select(longitude,latitude)
# convert to a spatial points data frame
car_breakins <- SpatialPointsDataFrame(coords = xy,
data = car_breakins,
proj4string = CRS("+init=epsg:4326"))
# set up local connection to the database
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "week14", host = "localhost")
# insert into a PostGIS-enabled table
pgInsert(con,
name = "sf_car_breakins",
data.obj = car_breakins,
geom = "geom",
overwrite = TRUE,
df.mode = FALSE, partial.match = FALSE, new.id = NULL,
row.names = FALSE, upsert.using = NULL,
alter.names = FALSE, encoding = NULL, return.pgi = FALSE)
This code converts a regular data frame with latitude and longitude columns into a spaital data frame that can be mapped as points. If you have used readr to read in your data, it will be in tbl_df
format, and you will need the first line to convert to a plain data frame.
The key step is to create a data frame called xy
that has the data in the same row order (do not sort!), and just the columns longitide
and latitude
in that order. The SpatialPointsDataFrame
function then uses this to convert the data to a spatial data frame, setting the Coordinate Reference System to EPSG:4326
. (SpatialPointsDataFrame
is from the sp package, which loads automatically with rgdal).
Writing the data to PostgreSQL/PostGIS is exactly as in the previous example.
Now we can switch to QGIS to map hotspots for car break-ins. For this we will use a QGIS Plugin called MMGQIS
. Install it by selecting Plugins>Manage and Install Plugins...
from the top menu. At the dialog box, search for MMGIS and install the plugin:
Now load the data in the sf_car_breakins
table:
The problem with data like this is that points may be layered over the top of one another, which makes it hard to assess where the real hotspots are. One way around this is to impose a hexagonal grid on the map, and then count the points in each cell of the grid.
Select MMQGIS>Create>Create Grid Layer
from the top menu and fill in the dialog box like this:
In practice, you will need to experiment with X Spacing
(Y spacing
) will adjust automatically, to get a grid with cells of reasonable size.
Now Browse...
to save the shapefile that will be created into your data folder, in a new folder called grid
, and click OK
.
This should be the result:
Now select Vector>Analysis Tools>Count points in polygon
from the top menu, and click Run
at the dialog box:
This will create a new layer, Count
, with an Attribute Table with the count of the points in each grid cell in a column called NUMPOINTS
:
Right-click on the Count
layer, select Save As...
and save as an ESRI Shapefile
in a new folder within your data folder.
Now create a new QGIS project and pull in an Open Streetmap layer using the Tile Map Scale Plugin. Right click on the layer, select Proporties
and change its Render type
to Singleband gray
to obtain a neutral gray basemap:
Select Project>Project properties from the top menu and check
Enable ‘on the fly’ CRS transformation (OTF), and click
OK`:
This will ensure that subsequent layers will be reprojected to the Web Mercator projection used by the basemap.
To load the gridded car break-in hotspot shapefile, click the Add Vector Layer
icon, and browse to the shapefile we just made:
Zoom to this layer, then style it to make the layer semi-transparent, and with a Graduated
color ramp based on the values in the NUMPOINT
column. The map should now look something like this:
Zoom into the concentrations of darker hexagons. Why do you think these areas are hotspots for car break-ins?
Often data journallists must consider how a number of different variables affect an outcome. You might, for instance, have data on bail amounts set for people charged with crimes, and want to analyze how the variables including arrestees’ crimes, their age, and their race affects the bail amounts set.
Even if your storyfocuses on only one variable, such as the effect of race on bail amounts, simply ignoring the effect of other variables is a bad idea: You risk drawing false conclusions because of the problem of “lurking” variables, which we discussed in week 1.
In cases like this, you should consider trying regression analysis, in which you build a statistical model to describe how multiple variables affect the outcome you are analyzing.
The example that follows uses multiple linear regression to analyze educational attainment of students from low-income families (those qualifying for free or reduced price lunch) for schools in Oakland in 2015.
First, load the required packages:
# load required packages
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)
library(forcats)
The only ones we haven’t encountered before are GGally and forcats. GGally is an extension to ggplot2 that creates specific plots that can be useful when performing statistical analyses, including regression; forcats is part of the tidyverse, designed for working with categorical variables; it can be used to arrange the categories of a categorical variable in a specific order, which you may need to do for a regression analysis.
The following code loads the data and then uses the GGally function ggpairs
to plot relationships between the variables:
# load data
oakland_eei_2015 <- read_csv("oakland_eei_2015.csv")
# remove the school names, which we don't want to consider
oakland_eei_2015_pairs <- oakland_eei_2015 %>%
select(-school_name)
# visualize the relationship between the remaining variables
ggpairs(oakland_eei_2015_pairs) +
theme_minimal()
This should be the result:
This matrix of charts may seem bewildering at first. The key to understand it is to read the column and row headers at top and right to understand which pairs of variables are being plotted or analyzed.
The diagonal line of charts from top left to bottom right have the same column and row headers, so refer to just one variable. For a continuous variable, they show density plot of the distribution of that variable; for a categorical variable they showing the number of observations in each category, for a categorical variable. So here, for example, the column chart for charter
shows that there are roughly twice as many non-charter schools as charter schools. The density plot for eei_score
, our outcome, shows that it has a skew to lower values. (If your outcome variable is highly skewed, you may need to “transform” it to better approximate a normal distribution before running a miltiple linear regression, by calculating the base 10 logarithm of each value, like this: variable_transform = log10(variable)
)
Above this diagonal line of charts, the matrix shows a correlation coefficient, a number measuring the strength of the correlation between two continous variables, or a box plot summarizing the distribution for each category for the relationship between a categorical and continuous variable. In a box plot, the black line in the middle of the box gives the median value, the box shows the 25% of values above and below that, and the lines extend out to the main range of values. Any “outlier” values deemed to fall outside this range appear as separate points.
Values for correlation coeffients can vary between -1 (a perfect negative correlation), through 0 (no relationship at all) to 1 (a perfect positive correlation); a positive correlation means that the two variables increase together; a negative correlation means that one increases as the other decreases.
We are most interested in how the various potential explanatory variables relate to the eei_score
outcome, which is summarized in the top row of the matrix. (Tip: When running regression analysis, organize your data so that the outcome variable is in the first column, so that the plot matrix is drawn in this way.)
The box plots for charter
indicate that charter schools tend to have better EEI scores. The correlation coefficients reveal that the two most influential continuous variables seem to be the percentage of black students (pc_black
), which has a negative correlation with eei_score
, and the size of the school (enroll_100
), which has a positive correlation with eei_score
.
Below the diagonal the relationships between the variables are plotted as scatterplots, for two continuous variables, or histograms for each category, to illustrate the relationship between a categorical and a continuous variable.
When building a regression model, it is tempting to include all possible explanatory variables. But the variables chosen should be independent (i.e. the vaues for one should not depend on the values for another) and should not be strongly correlated with one another. The three variables for racial/ethnic composition are obviously not independent, because a school that consists mostly of black students, for example, is bound to have relatively few Hispanic or white students. The matrix also shows that the percentage of white students is strongly correlated (correlation coefficient = -0.899) with the percentage of students on free or reduced-price lunches: the more white students, the fewer students tend to be on free or reduced-price lunches.
If choosing only one of the racial/ethic composition variables, it makes sense to choose the percentage of black students, which seems to be the most important. So let’s drop the other three racial/ethnic composition variables, and build a regression model:
oak_fit1 <- lm(eei_score ~ pc_free_lunch + pc_black + charter + enroll_100, data=oakland_eei_2015)
summary(oak_fit1)
This code uses the function lm
, for linear model, to run a multiple linear regression. The outcome variable is given first, followed by a tilda (~
), then the potential explanatory variables separated by +
symbols).
This should be the output of the summary
for the regression model:
Call:
lm(formula = eei_score ~ pc_free_lunch + pc_black + charter +
enroll_100, data = oakland_eei_2015)
Residuals:
Min 1Q Median 3Q Max
-46.155 -15.983 -2.347 12.154 47.044
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.56627 8.32769 4.391 2.59e-05 ***
pc_free_lunch -0.10045 0.08688 -1.156 0.250040
pc_black -0.30586 0.10212 -2.995 0.003384 **
charterY 19.30876 4.53853 4.254 4.39e-05 ***
enroll_100 2.84883 0.74613 3.818 0.000222 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.41 on 111 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.3195, Adjusted R-squared: 0.295
F-statistic: 13.03 on 4 and 111 DF, p-value: 9.878e-09
The Adjusted R-squared
value tells us how well the model explains the values for the outcome variable: A value of 0.295 means that the model explains only 29.5% of the variation in EEI scores. Still, three of the variables we considered seem to have a statistically significant influence on the scores, indicated by the asterisks.
School size, for instance, measured by enroll_100
, is significant at a level of p < 0.001
. This rates the chances of seeing the observed results by chance if there was actually no relationship between school size and EEI scores; 0.001 means a 1 in 1,000 chance.
The percentage of students on free or reduced lunches, however, seems to have no significant relationship with the EEI scores.
The Estimate
column measures size of the effect of each variable in the model, while Std. Error
provides a measure of the uncertainty around that estimate. Here, the EEI scores tend to increase by 2.84 plus-or-minus 0.75 points for each additional 100 students in a school, if the effect of the other variables is held constant. Meanwhile, EEI scores tend to drop by 0.31 ± 0.1 for each additional percentage point of black students in the school.
In a regression model like this, the effect of categorical variables is assessed by comparing each category against a reference category. For charter
we have only two categories, not being a charter school (N
) or being a charter school (Y
). Here, the estimate shows us that not being a charter school is the reference, and being a charter school (charterY
) is associated with an increase in EEI scores of 19.31 ± 4.54 points, if the effect of the other variables is held constant.
In practice, you may want to experiment with different combinations of variables, to see how they effect the model’s explanatory power. Here, we might try dropping pc_fr_lunch
, especially as its strong negative correlation with pc_white
means it may just be a proxy for the percentage of white students in the school — a variable we already decided to exclude.
oak_fit2 <- lm(eei_score ~ pc_black + charter + enroll_100, data=oakland_eei_2015)
summary(oak_fit2)
This should be the output of the summary
:
Call:
lm(formula = eei_score ~ pc_black + charter + enroll_100, data = oakland_eei_2015)
Residuals:
Min 1Q Median 3Q Max
-47.770 -16.144 -2.293 12.326 48.813
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.2924 5.4652 5.360 4.51e-07 ***
pc_black -0.3210 0.1014 -3.165 0.002000 **
charterY 19.0416 4.5395 4.195 5.49e-05 ***
enroll_100 2.8983 0.7460 3.885 0.000174 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.44 on 112 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.3113, Adjusted R-squared: 0.2929
F-statistic: 16.88 on 3 and 112 DF, p-value: 4.111e-09
The model still explains 20.3% of the variation in EEI scores.
After running a regression analysis like, a statistician will perform a number of tests and draw a series of charts to make sure that none of the assumptions of multiple linear have been violated. That’s beyond the scope of this class. But if you running a regression analysis for an important story, I strongly suggest getting expert statistical advice to ensure that your model isn’t lying to you!
One simple thing that can go wrong, however, is that apparent relationships between your explanatory variables and the outcome variable may be strongly influenced by a few outliers in the data. This is where the scatter plots in the plot matrix can be helpful. Here, they show that three schools are much larger than all the others, with values of enroll_100
of more than 15, that is they have more than 1,500 students. Looking at the bottom-but-one scatterplot in the first column of the matrix, we should be concerned that the relationship between school size and eei scores might be due to the influence of these schools. To test this, we can remove them from the data and run the regression analysis again:
# take out the three outliers for school enrollment
oakland_eei_2015_edit <- oakland_eei_2015 %>%
filter(enroll_100 < 15)
# fit the model again
oak_fit3 <- lm(eei_score ~ pc_black + charter + enroll_100, data = oakland_eei_2015_edit)
summary(oak_fit3)
This should be the output of the summary
:
Call:
lm(formula = eei_score ~ pc_black + charter + enroll_100, data = oakland_eei_2015_edit)
Residuals:
Min 1Q Median 3Q Max
-45.122 -15.700 -2.461 11.135 49.081
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.9248 7.6779 3.116 0.00234 **
pc_black -0.2894 0.1072 -2.700 0.00804 **
charterY 19.5028 4.5817 4.257 4.41e-05 ***
enroll_100 4.0856 1.4075 2.903 0.00448 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.55 on 109 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.2997, Adjusted R-squared: 0.2805
F-statistic: 15.55 on 3 and 109 DF, p-value: 1.728e-08
The model has lost a small amount of explanatory power, and the significance level of school size has dropped to p < 0.01
, but the overall conclusions are the same.
From this regression analysis, we now have some questions to investigate. Why do students from low-income families seem to be doing better in charter schools, and in larger schools? And why are they doing worse in schools with a larger percentage of black students? These questions can’t be answered from this data. But you might want to explore, through further reporting, the resources available to the schools in Oakland, or other socioeconomic characteristics of students — black students may be disadvantaged in various ways, other than their families’ incomes, for example.
Often, when running a regression analysis, you may need to chose which category for a categorical variable to use as the reference. If considering the effect of race/ethnicity on bail amounts, for instance, you would usually want to use white as the reference, and compare the outcome for other races/ethnicities to the outcome where the individual charged with a crime is white. By default, R will consider categories in alphabetical order, using the first in the list as the reference. But you can override this by explicitly turning a column containing text values into a categorical variable, and then reordering it, like this:
# turn charter into a categorical variable
oakland_eei_2015_edit <- oakland_eei_2015_edit %>%
mutate(charter = as.factor(charter))
# change from the dfeault alphabetical order
oakland_eei_2015_edit <- oakland_eei_2015_edit %>%
mutate(charter = fct_relevel(charter, "Y")
# view the new order
levels(oakland_eei_2015_edit$charter)
In this code, fct_relevel
moves Y
to the start of the list of values for the categorical variable charter
. It works in exactly same way if you have more than two levels. Now the regression model will consider charter schools as the reference, and compare the effect of not being a charter school.
# fit the model again
oak_fit4 <- lm(eei_score ~ pc_black + charter + enroll_100, data=oakland_eei_2015_edit)
summary(oak_fit4)
This should be the output of the summary
:
Call:
lm(formula = eei_score ~ pc_black + charter + enroll_100, data = oakland_eei_2015_edit)
Residuals:
Min 1Q Median 3Q Max
-45.122 -15.700 -2.461 11.135 49.081
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.4276 7.2336 6.004 2.58e-08 ***
pc_black -0.2894 0.1072 -2.700 0.00804 **
charterN -19.5028 4.5817 -4.257 4.41e-05 ***
enroll_100 4.0856 1.4075 2.903 0.00448 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.55 on 109 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.2997, Adjusted R-squared: 0.2805
F-statistic: 15.55 on 3 and 109 DF, p-value: 1.728e-08
The result is the same as before, except that now the Estimate
is for charterN
, and it has a negative value.
In some cases, your outcome measure may not be a continuous variable, but a binary outcome: In parole board decisions, for example, prisoners may be granted parole, or denied. For this type of analysis, you need to use logistic regression. Here you should code the outcome variable as either 0
or 1
. The format for the R code to fit the modek is as follows:
# fit a logistic regression model
fit <- glm(outcome ~ variable1 + variable2 + variable3, family = binomial(link = "logit"), data = data_frame))
summary(fit)
There are many more types of statistic models that can be fitted to data, and R code in R follows the same general pattern. For a series of stories in 2017, for example, I used a form of machine learning called a random forest model to identify hidden spy planes in months of flight tracking data, training the model on the characteristics of known government surveillance aircraft. The code for this is here.
One you get comfortable with R, there are almost endless possibilities for data analysis. Here, for example, are two stories I did this year using the tidytext package for text analysis; the code is here and here.
Stack Overflow
For any work involving code, this question-and-answer site is a great resource for when you get stuck, to see how others have solved similar problems. Search the site, or browse R questions