Week 13 — April 19, 2018

Instructor: Amanda Hickman

Fixing your path

This is a good explanation of the command path issues we were navigating last time we worked in the terminal:


Roughly, your command line interpreter, bash, uses a variable called PATH to decide where to look for programs to run. If you run echo $PATH at the command line, you can see a list of all the directories your shell looks in for executable programs. Yours is probably different from mine, but on my primary laptop (which runs Ubuntu, not OSX), my path looks like this:

amanda@mona:~$ echo $PATH

Note that I manually added a few paths there: /usr/local/heroku/bin lets me run a handful of specialized heroku commands, and /opt/venvs/vdirsyncer-latest/bin/ lets me run vidirsyncer, which I’m super happy to talk about but is kind of a rabbit hole in this context.

If you want to run a program that isn’t in your PATH variable, you have to use the “absolute path” — the full path to the program, starting with / for the root directory. Alternatively, you can edit your PATH to include the path to the directory that includes that program.

To add shp2pgsql so you can access it easily, you want to tell your shell that it needs to look in /Applications/Postgres.app/Contents/Versions/10/bin/ for programs.

To do that, you’re going to edit a hidden file in your home directory (probably /Users/yourname/) called .profile, to add a line like:

export PATH=/Applications/Postgres.app/Contents/Versions/10/bin/:$PATH

PATH is the shell variable that contains the array of directories that bash looks in to find executable programs, and we print the contents of that variable with $PATH. This command exports or “sets” a new PATH by concatenating the new directory (/Applications/Postgres.app/Contents/Versions/10/bin/) that you want to include, with the existing values stored in PATH.

Do you have enough RAM for this?

I was surprised to discover that quite a few of you couldn’t actually load the US zipcode file into postgres via Postico. My suspicion is that you could load it if you use the command line tools, but Postico wasn’t able to load the whole thing into its preview window — queueing it up so you can scroll around in it takes more memory than just loading it.

Use these instructions to find some basic information about your computer. Please complete this survey: https://goo.gl/forms/oIsHQAUuyWREvlrC3 so we can get a handle on the problem.

Data of the Week

Charlotte and Susie are up:

SQL Vocabulary

Quick follow up to a conversation we had two weeks ago about why FLOAT is a synonym for DOUBLE PRECISION, and what that means to begin with. The short answer is “it has something to do with how computers store data” — As for almost any arcane topic in mathematics and computing there is a wikipedia article that goes deep on the finer points. Floating point arithmetic is no exception.

You can see the full list of data types that Postgres supports in the Postgres documentation. You’re almost always going to be using one of only a few: numeric, float, integer, text, char, varchar, and geometry.

format usage
NUMERIC any number — this is the preferable format but some systems default to FLOAT if you don’t specify NUMERIC
FLOAT a decimal number, synonymous with DOUBLE PRECISION— if you plan to do any math on the number, NUMERIC is preferable
INTEGER any whole number
BIGINT very large integer (> 2.1B or < -2.1B)
TEXT any text — this is an efficient format to use in Postgres but in other SQL databases you may not be able to index, search or sort a TEXT column. It isn’t standard across SQL implementations.
CHAR text that is always the same length
VARCHAR text of varying length
GEOMETRY geographic data (points, lines, polygons)

Hands On PostGIS

We started this two weeks ago and hit some walls that I wasn’t expecting, so let’s try again. Download the Week 13 data bundle so you have everything in one place.

Where are you working?

Remember that to do PostGIS queries you will need to make sure you’ve enabled it on the database you’re working in, with CREATE EXTENSION postgis; — so make sure that you’re working in a database where you’ve enabled postgis.

Find Alameda County zipcodes

what are the zipcodes that are overlap with Alameda county. Start by spelling out how you’d approach this. Can you describe in words what you’re trying to do? Write this in a comment at the top of your script.

You’re looking for a way to capture all the shapes in one layer that intersect with a single shape in another layer. The census publishes zipcode boundaries for the whole US, but for some of your computers that was too much data, so I cut it down to just California zipcodes for you. You’ll need the sql to create a table and the data itself is in a separate csv.

Alameda County’s open data portal publishes a county boundary file. I already used shp2pgsql to convert the shapefile to SQL.

  1. Use shp2pgsql to convert the Shapefile into SQL.
  2. Import the SQL into Postgres. You can either
    a. use Postico’s “Load Query” button to load alameda_county_boundary.sql, or
    b. use psql at the command line with psql -d week11_postgis -f alameda_county_boundary.sql (but note that week11_postgis is my database name. You may not have named your database week11_postgis.)
  3. Create a spatial index. You can either
    a. Load it into QGIS and click “create index” or
    b. run CREATE INDEX sidx_alameda_geom ON alameda USING gist (geom);

Find Your PostGIS command

There are a few that sound like they might be what we want: ST_Within, ST_Contains, ST_Intersection , ST_Intersects. Take a look at the documentation: how do these differ?

We’re going to use ST_Intersects to find all the California zipcode shapes that intersect with Alameda county.

  alameda.geom AS county_geom,
  alameda.name AS county,
  zipcodes.zcta5ce10 AS zipcode,
  zipcodes.geom as zip_geom
  alameda, zipcodes
  ST_Intersects(zipcodes.geom, alameda.geom) AND alameda.name = 'Alameda County';

Since that seems like it worked, make a table out of that:

CREATE TABLE alameda_zipcodes AS
  SELECT zipcodes.* FROM zipcodes, alameda
  WHERE ST_Intersects(zipcodes.geom, alameda.geom) AND alameda.name = 'Alameda County';


You’re going to wind up capturing a few zipcodes that only cross the county line in tiny spots. Can you brainstorm some ways to address those?

Find San Francisco zipcodes

San Francisco doesn’t publish a handy county file, so I pulled down the TIGER county shapefiles. This is smaller than the zipcode file but it’s still pretty big, so I also ran it through shp2pgsql with shp2pgsql cb_2017_us_county_20m.shp counties postgres > counties.sql, and then pruned it by loading it into Postgres and running:

DELETE FROM counties WHERE statefp != '06';

You can load the smaller, California only file from the week 13 data file — it’s counties.sql.

Find all the zipcodes in San Francisco county? What about Fresno county?

Start by writing down the steps you need to take, then … take them.

Create mappable points

San Francisco publishes car break in data.

Can you …

I did use csvsql to generate a CREATE statement for you.

CREATE TABLE car_breakins (
    incidntnum VARCHAR(9) NOT NULL,
    category VARCHAR(13) NOT NULL,
    descript VARCHAR(28) NOT NULL,
    dayofweek VARCHAR(9) NOT NULL,
    event_date DATE NOT NULL,
    pddistrict VARCHAR(10) NOT NULL,
    resolution VARCHAR(21),
    address VARCHAR(42) NOT NULL,
    location VARCHAR(41) NOT NULL

Bonus questions: where is this data from? Who compiled it? Can you sniff anything out from the Socrata metadata?


Please submit your final data journalism project by Tuesday May 1 at 8pm.

Keep Learning