Instructor: Amanda Hickman
This is a good explanation of the command path issues we were navigating last time we worked in the terminal:
http://www.tech-recipes.com/rx/2621/os_x_change_path_environment_variable/k
Roughly, your command line interpreter, bash
, uses a variable called PATH
to decide where to look for programs to run. If you run echo $PATH
at the command line, you can see a list of all the directories your shell looks in for executable programs. Yours is probably different from mine, but on my primary laptop (which runs Ubuntu, not OSX), my path looks like this:
amanda@mona:~$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/heroku/bin:/opt/venvs/vdirsyncer-latest/bin/
Note that I manually added a few paths there: /usr/local/heroku/bin
lets me run a handful of specialized heroku commands, and /opt/venvs/vdirsyncer-latest/bin/
lets me run vidirsyncer, which I’m super happy to talk about but is kind of a rabbit hole in this context.
If you want to run a program that isn’t in your PATH
variable, you have to use the “absolute path” — the full path to the program, starting with /
for the root directory. Alternatively, you can edit your PATH
to include the path to the directory that includes that program.
To add shp2pgsql
so you can access it easily, you want to tell your shell that it needs to look in /Applications/Postgres.app/Contents/Versions/10/bin/
for programs.
To do that, you’re going to edit a hidden file in your home directory (probably /Users/yourname/
) called .profile
, to add a line like:
export PATH=/Applications/Postgres.app/Contents/Versions/10/bin/:$PATH
PATH
is the shell variable that contains the array of directories that bash looks in to find executable programs, and we print the contents of that variable with $PATH
. This command exports
or “sets” a new PATH
by concatenating the new directory (/Applications/Postgres.app/Contents/Versions/10/bin/
) that you want to include, with the existing values stored in PATH
.
I was surprised to discover that quite a few of you couldn’t actually load the US zipcode file into postgres via Postico. My suspicion is that you could load it if you use the command line tools, but Postico wasn’t able to load the whole thing into its preview window — queueing it up so you can scroll around in it takes more memory than just loading it.
Use these instructions to find some basic information about your computer. Please complete this survey: https://goo.gl/forms/oIsHQAUuyWREvlrC3 so we can get a handle on the problem.
Charlotte and Susie are up:
https://docs.google.com/spreadsheets/d/11JLkkyWZf3fvVz3aebgMjcZ6mxV-j5Gw7hEpeiPAGY4/edit#gid=0
Quick follow up to a conversation we had two weeks ago about why FLOAT is a synonym for DOUBLE PRECISION, and what that means to begin with. The short answer is “it has something to do with how computers store data” — As for almost any arcane topic in mathematics and computing there is a wikipedia article that goes deep on the finer points. Floating point arithmetic is no exception.
You can see the full list of data types that Postgres supports in the Postgres documentation. You’re almost always going to be using one of only a few: numeric, float, integer, text, char, varchar, and geometry.
format | usage |
---|---|
NUMERIC | any number — this is the preferable format but some systems default to FLOAT if you don’t specify NUMERIC |
FLOAT | a decimal number, synonymous with DOUBLE PRECISION — if you plan to do any math on the number, NUMERIC is preferable |
INTEGER | any whole number |
BIGINT | very large integer (> 2.1B or < -2.1B) |
TEXT | any text — this is an efficient format to use in Postgres but in other SQL databases you may not be able to index, search or sort a TEXT column. It isn’t standard across SQL implementations. |
CHAR | text that is always the same length |
VARCHAR | text of varying length |
GEOMETRY | geographic data (points, lines, polygons) |
We started this two weeks ago and hit some walls that I wasn’t expecting, so let’s try again. Download the Week 13 data bundle so you have everything in one place.
Remember that to do PostGIS queries you will need to make sure you’ve enabled it on the database you’re working in, with CREATE EXTENSION postgis;
— so make sure that you’re working in a database where you’ve enabled postgis.
what are the zipcodes that are overlap with Alameda county. Start by spelling out how you’d approach this. Can you describe in words what you’re trying to do? Write this in a comment at the top of your script.
You’re looking for a way to capture all the shapes in one layer that intersect with a single shape in another layer. The census publishes zipcode boundaries for the whole US, but for some of your computers that was too much data, so I cut it down to just California zipcodes for you. You’ll need the sql to create a table and the data itself is in a separate csv.
Alameda County’s open data portal publishes a county boundary file. I already used shp2pgsql
to convert the shapefile to SQL.
shp2pgsql
to convert the Shapefile into SQL.alameda_county_boundary.sql
, orpsql
at the command line with psql -d week11_postgis -f alameda_county_boundary.sql
(but note that week11_postgis
is my database name. You may not have named your database week11_postgis
.)CREATE INDEX sidx_alameda_geom ON alameda USING gist (geom);
There are a few that sound like they might be what we want: ST_Within, ST_Contains, ST_Intersection , ST_Intersects. Take a look at the documentation: how do these differ?
We’re going to use ST_Intersects to find all the California zipcode shapes that intersect with Alameda county.
SELECT
alameda.geom AS county_geom,
alameda.name AS county,
zipcodes.zcta5ce10 AS zipcode,
zipcodes.geom as zip_geom
FROM
alameda, zipcodes
WHERE
ST_Intersects(zipcodes.geom, alameda.geom) AND alameda.name = 'Alameda County';
Since that seems like it worked, make a table out of that:
CREATE TABLE alameda_zipcodes AS
SELECT zipcodes.* FROM zipcodes, alameda
WHERE ST_Intersects(zipcodes.geom, alameda.geom) AND alameda.name = 'Alameda County';
You’re going to wind up capturing a few zipcodes that only cross the county line in tiny spots. Can you brainstorm some ways to address those?
San Francisco doesn’t publish a handy county file, so I pulled down the TIGER county shapefiles. This is smaller than the zipcode file but it’s still pretty big, so I also ran it through shp2pgsql
with shp2pgsql cb_2017_us_county_20m.shp counties postgres > counties.sql
, and then pruned it by loading it into Postgres and running:
DELETE FROM counties WHERE statefp != '06';
You can load the smaller, California only file from the week 13 data file — it’s counties.sql
.
Start by writing down the steps you need to take, then … take them.
San Francisco publishes car break in data.
Can you …
I did use csvsql to generate a CREATE statement for you.
CREATE TABLE car_breakins (
incidntnum VARCHAR(9) NOT NULL,
category VARCHAR(13) NOT NULL,
descript VARCHAR(28) NOT NULL,
dayofweek VARCHAR(9) NOT NULL,
event_date DATE NOT NULL,
event_time TIME WITHOUT TIME ZONE NOT NULL,
pddistrict VARCHAR(10) NOT NULL,
resolution VARCHAR(21),
address VARCHAR(42) NOT NULL,
x FLOAT NOT NULL,
y FLOAT NOT NULL,
location VARCHAR(41) NOT NULL
);
Bonus questions: where is this data from? Who compiled it? Can you sniff anything out from the Socrata metadata?
Please submit your final data journalism project by Tuesday May 1 at 8pm.
In bCourses, submit a full write-up of your project. This should be written so that someone with no prior knowledge of your project can understand why this is interesting as well as what you have done. It should include:
A clear and direct lede that tells us why we should keep reading, and what the main points of your work are. This should be written for a new audience and should demonstrate the writing skills you have learned in your other classes. Think of this as your chance to sell an editor on this work, by showing that you can write and that you understand this data.
A description of the data you used, and how it was cleaned and processed for analysis. Again, imagine you are writing for an editor who doesn’t know you and needs to know that you have made smart decisions about data to base your reporting on and that you truly understand what you are doing.
The questions you asked of the data, and the tools you used to ask them.
The conclusions you have drawn. Include any charts/maps that are relevant.
Any further analyses you would need to run to turn this project into a story.
The additional reporting, beyond data analysis, that would be required to turn this project into a story.
Share the data and code for your analysis. Code should be commented so that your instructors are able to understand each step of your analysis. You can share this in a zipped folder and email if the data files are not too large. If they are too large to email, you can share via a Dropbox link.
Plan to come to class on May 3 prepared to give a 5-7 minute presentation on your project, and answer questions from other students. Plan, also, to listen thoughtfully to your classroom colleagues and engage them on their work this semester.