Instructor: Amanda Hickman
This is a good explanation of the command path issues we were navigating last time we worked in the terminal:
Roughly, your command line interpreter,
bash, uses a variable called
PATH to decide where to look for programs to run. If you run
echo $PATH at the command line, you can see a list of all the directories your shell looks in for executable programs. Yours is probably different from mine, but on my primary laptop (which runs Ubuntu, not OSX), my path looks like this:
amanda@mona:~$ echo $PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/heroku/bin:/opt/venvs/vdirsyncer-latest/bin/
Note that I manually added a few paths there:
/usr/local/heroku/bin lets me run a handful of specialized heroku commands, and
/opt/venvs/vdirsyncer-latest/bin/ lets me run vidirsyncer, which I’m super happy to talk about but is kind of a rabbit hole in this context.
If you want to run a program that isn’t in your
PATH variable, you have to use the “absolute path” — the full path to the program, starting with
/ for the root directory. Alternatively, you can edit your
PATH to include the path to the directory that includes that program.
shp2pgsql so you can access it easily, you want to tell your shell that it needs to look in
/Applications/Postgres.app/Contents/Versions/10/bin/ for programs.
To do that, you’re going to edit a hidden file in your home directory (probably
.profile, to add a line like:
PATH is the shell variable that contains the array of directories that bash looks in to find executable programs, and we print the contents of that variable with
$PATH. This command
exports or “sets” a new
PATH by concatenating the new directory (
/Applications/Postgres.app/Contents/Versions/10/bin/) that you want to include, with the existing values stored in
I was surprised to discover that quite a few of you couldn’t actually load the US zipcode file into postgres via Postico. My suspicion is that you could load it if you use the command line tools, but Postico wasn’t able to load the whole thing into its preview window — queueing it up so you can scroll around in it takes more memory than just loading it.
Charlotte and Susie are up:
Quick follow up to a conversation we had two weeks ago about why FLOAT is a synonym for DOUBLE PRECISION, and what that means to begin with. The short answer is “it has something to do with how computers store data” — As for almost any arcane topic in mathematics and computing there is a wikipedia article that goes deep on the finer points. Floating point arithmetic is no exception.
You can see the full list of data types that Postgres supports in the Postgres documentation. You’re almost always going to be using one of only a few: numeric, float, integer, text, char, varchar, and geometry.
|NUMERIC||any number — this is the preferable format but some systems default to
|FLOAT||a decimal number, synonymous with
|INTEGER||any whole number|
|BIGINT||very large integer (> 2.1B or < -2.1B)|
|TEXT||any text — this is an efficient format to use in Postgres but in other SQL databases you may not be able to index, search or sort a
|CHAR||text that is always the same length|
|VARCHAR||text of varying length|
|GEOMETRY||geographic data (points, lines, polygons)|
We started this two weeks ago and hit some walls that I wasn’t expecting, so let’s try again. Download the Week 13 data bundle so you have everything in one place.
Remember that to do PostGIS queries you will need to make sure you’ve enabled it on the database you’re working in, with
CREATE EXTENSION postgis; — so make sure that you’re working in a database where you’ve enabled postgis.
what are the zipcodes that are overlap with Alameda county. Start by spelling out how you’d approach this. Can you describe in words what you’re trying to do? Write this in a comment at the top of your script.
You’re looking for a way to capture all the shapes in one layer that intersect with a single shape in another layer. The census publishes zipcode boundaries for the whole US, but for some of your computers that was too much data, so I cut it down to just California zipcodes for you. You’ll need the sql to create a table and the data itself is in a separate csv.
shp2pgsqlto convert the Shapefile into SQL.
psqlat the command line with
psql -d week11_postgis -f alameda_county_boundary.sql(but note that
week11_postgisis my database name. You may not have named your database
CREATE INDEX sidx_alameda_geom ON alameda USING gist (geom);
We’re going to use ST_Intersects to find all the California zipcode shapes that intersect with Alameda county.
SELECT alameda.geom AS county_geom, alameda.name AS county, zipcodes.zcta5ce10 AS zipcode, zipcodes.geom as zip_geom FROM alameda, zipcodes WHERE ST_Intersects(zipcodes.geom, alameda.geom) AND alameda.name = 'Alameda County';
Since that seems like it worked, make a table out of that:
CREATE TABLE alameda_zipcodes AS SELECT zipcodes.* FROM zipcodes, alameda WHERE ST_Intersects(zipcodes.geom, alameda.geom) AND alameda.name = 'Alameda County';
You’re going to wind up capturing a few zipcodes that only cross the county line in tiny spots. Can you brainstorm some ways to address those?
San Francisco doesn’t publish a handy county file, so I pulled down the TIGER county shapefiles. This is smaller than the zipcode file but it’s still pretty big, so I also ran it through
shp2pgsql cb_2017_us_county_20m.shp counties postgres > counties.sql, and then pruned it by loading it into Postgres and running:
DELETE FROM counties WHERE statefp != '06';
You can load the smaller, California only file from the week 13 data file — it’s
Start by writing down the steps you need to take, then … take them.
San Francisco publishes car break in data.
Can you …
I did use csvsql to generate a CREATE statement for you.
CREATE TABLE car_breakins ( incidntnum VARCHAR(9) NOT NULL, category VARCHAR(13) NOT NULL, descript VARCHAR(28) NOT NULL, dayofweek VARCHAR(9) NOT NULL, event_date DATE NOT NULL, event_time TIME WITHOUT TIME ZONE NOT NULL, pddistrict VARCHAR(10) NOT NULL, resolution VARCHAR(21), address VARCHAR(42) NOT NULL, x FLOAT NOT NULL, y FLOAT NOT NULL, location VARCHAR(41) NOT NULL );
Bonus questions: where is this data from? Who compiled it? Can you sniff anything out from the Socrata metadata?
Please submit your final data journalism project by Tuesday May 1 at 8pm.
In bCourses, submit a full write-up of your project. This should be written so that someone with no prior knowledge of your project can understand why this is interesting as well as what you have done. It should include:
A clear and direct lede that tells us why we should keep reading, and what the main points of your work are. This should be written for a new audience and should demonstrate the writing skills you have learned in your other classes. Think of this as your chance to sell an editor on this work, by showing that you can write and that you understand this data.
A description of the data you used, and how it was cleaned and processed for analysis. Again, imagine you are writing for an editor who doesn’t know you and needs to know that you have made smart decisions about data to base your reporting on and that you truly understand what you are doing.
The questions you asked of the data, and the tools you used to ask them.
The conclusions you have drawn. Include any charts/maps that are relevant.
Any further analyses you would need to run to turn this project into a story.
The additional reporting, beyond data analysis, that would be required to turn this project into a story.
Share the data and code for your analysis. Code should be commented so that your instructors are able to understand each step of your analysis. You can share this in a zipped folder and email if the data files are not too large. If they are too large to email, you can share via a Dropbox link.
Plan to come to class on May 3 prepared to give a 5-7 minute presentation on your project, and answer questions from other students. Plan, also, to listen thoughtfully to your classroom colleagues and engage them on their work this semester.