Instructor: Amanda Hickman
Chenwei Tian & Chloe Lessard
How did we do? Peter is going to grade them, but we can use Google Spreadsheets to check our answers to some questions.
- An agency’s budget was $21 million in 2017. It is $24.5 million in 2018. What was the percentage increase, rounded to one decimal place?
Can we use a spreadsheet formula to calculate this? (Yes. So open Google Sheets ...)
- There were 15,000 violent crimes in 2017 in city A, which had a population of 350,000. There were 9,000 violent crimes in City B, which had a population of 500,000. Complete the following sentence with a number rounded to one decimal place: “City A’s violent crime rate was ... times that of city B.”
What we need to do is normalize the data to the population.
City | Crimes | Population | Rate |
---|---|---|---|
A | 15000 | 350000 | |
B | 9000 | 500000 |
Spreadsheets are the simplest tool in your data toolbox. There are some real limitations that you'll encounter when you try to use spreadsheets for large datasets or complex, reproducible analysis, but some facility with core spreadsheet functions will go a long, long way towards improving your data skills.
Sort, filter, summarize, group and join.
We're going to use Google Sheets in class, but LibreOffice Calc is a great, open source option. Excel is also perfectly good, though it's been a long time since anyone showed me a reason to choose it over LibreOffice.
On a data set the size of ProPublica's Medicaid data, you aren't going to be able to use Google Sheets (which has a max of ~200K cells -- meaning that a four column spreadsheet can only accommodate 50K rows). So as the semester progresses and we move into tools that are way harder to use than spreadsheets are, it's because we can't use a spreadsheet.
Open 311_Cases_Dec2017.csv
. We're going to start by just sniffing around. The best way to get it into a Google Drive spreadsheet is to use File > Import ...
The data is San Francisco's 311 call records, from SF's Open Data Portal -- I used their service to filter out only the cases opened between 12/01/2017 12:00:00 AM and 01/01/2018 12:00:00 AM.
Click in any cell in the header row and look for the 'filter' icon. Click it. It will take a minute but we'll get a little filter icon in each header cell, and then we can start to skim a bit. Try filtering by status, and source.
Notice that it takes a minute to get through this data.
Good reading on 311 data:
http://kycir.org/2018/02/01/behind-the-data-how-we-analyzed-louisville-trash-complaints/
http://kycir.org/2018/01/30/the-next-louisville-what-trash-cans-tell-us-about-poverty/
In groups of 3, pull up your pitches and go over them. Questions to ask:
Notes: Most of you need to... a) get your hands on the actual data and make sure there's something there, b) go do some reading of other reporting on this, or make a few phone calls to figure out what the context is.
I keep a running list of my favorite spreadsheet commands on a course wiki -- it's a great resource.
You went over the basics week 1: =SUM
, =AVERAGE
, =MAX
, =MIN
City | Crimes | Population | Formula | Rate |
---|---|---|---|---|
A | 15000 | 350000 | =(B2/C2)*100000 | 4286 |
B | 9000 | 500000 | =(B3/C3)*100000 | 1800 |
Later this semester, we'll use some much more powerful tools, especially R. And the honest truth is that spreadsheets can be super dangerous. It's hard to keep track of the transformations you're applying when they're hidden in spreadsheet functions. But: you can do a lot in spreadsheets and sometimes they're exactly the right tool.
Data > Pivot Table...
Responsible Agency
and then Category
as rowsCaseID
as value. Supervisor District
as columns.If you wanted to quickly see the opened vs closed case count for each Supervisor's district, how would you do that?
The USGS publishes real time earthquake data, but the locations they include are not something we can sort by.
Start by using Data > Text to Columns ...
to separate out the broad location from the specifics.
Filter for blank cells. We can copy and paste to fill in the blank cells.
Create a pivot table -- use "COUNT of Mag"; sort it by count.
Go back to the spreadsheet and Categorize by magnitude. I made the slightly arbitrary decision that we're going to call everything below 4.5 "Minor", and everything above 6.0 "Strong" -- the USGS actually has a few more grades in their system, but these are good buckets.
=LOOKUP(E2,{0,4.5,6},{"Minor","Moderate","Strong"})
Fill Down: Select F2 through F6 and do ctrld -- that's one "fill down" option. Or grab the blue box and pull down. Or, hover over the little blue square and double click. They're all good options.
Go back to your pivot table and add the "categories" as columns.
The State of California publishes quite a bit of high school graduation data statewide. Open calpad_cohort16_alameda.csv in Google Sheets.
Note that the statewide data comes with a 14 digit code. Take a look at the dictionary and let's figure out how to break out the CDS into county, district, school.
If we want to know how Oakland Technical compares to the rest of Oakland's High Schools, how would you approach that?
Installs: make sure that you have Chrome and Open Refine installed.
Pivot Table Exercise: The complete assignment is in bCourses.
Go back to the Berkeley Collision Data that Peter shared with you Week 1. Use a pivot table to get the total number of fatalities by day of the week for each year. You will probably want to consult the data dictionary.
Then, revise your pivot table to answer the following two questions keeping in mind that a single fatal crash may include multiple fatalities:
Share your spreadsheet with me (abh@berkeley.edu) so I can see your pivot table.
Data Presentation Feb. 8 Cat Schuknecht & James Steinbauer