DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider - - PowerPoint PPT Presentation
DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider - - PowerPoint PPT Presentation
DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider January 20, 2016 todd@toddwschneider.com Where to find me toddwschneider.com github.com/toddwschneider @todd_schneider toddsnyder Things Ill talk about Taxi
Where to find me
toddwschneider.com github.com/toddwschneider @todd_schneider toddsnyder
Things I’ll talk about
- Taxi and Uber data
- Live coding demo
- Medium data analysis tips
Taxi and Uber Data
http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
Citi Bike Data
http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/
NYC Taxi and Uber Data
- Taxi & Limousine Commission released
public, trip-level data for over 1.1 billion taxi rides 2009–2015
- Some public Uber data available as
well, thanks to a FOIL request by FiveThirtyEight
- Lots of great work by lots of people
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
Why analyze taxi data?
- “Because it’s there”
- More precisely: it’s fun, it’s interesting, and
it’s computationally feasible on a commodity laptop
(Taxi) Analysis Overview
- 1. Get raw data
- 2. Write code to process raw data into
something more useful
- 3. Analyze data
- 4. Write about what you found out
Taxi Analysis Tools
- PostgreSQL
- PostGIS
- R
- Command line
https://github.com/toddwschneider/nyc-taxi-data
Raw data processing goals
- Load flat files of varying file formats into a unified,
persistent PostgreSQL database that we can use to answer questions about the data
- Do some one-time calculations to augment the raw
data
- For taxi data in particular, we want to answer
neighborhood-based questions, so we’ll map latitude/longitude coordinates to NYC census tracts
- trips table will contain
- ne row for each taxi/Uber
trip record*
- Most fields come from raw
data, with some renaming
- Add fields for census tracts,
which we know we’re going to use
trips table schema
Processing raw data: The reality
- Often messy, raw data can require massaging
- Not fun, takes a while, but is essential
- Specifically: we have to plan ahead a bit,
anticipate usage patterns, questions we’re going to ask, then decide on schema
Quick peek at some raw data…
Specific issues encountered with raw taxi data
- Some files contain empty lines and unquoted carriage
returns 😑
- Raw data files have different formats even within the
same cab type 😖
- Some files contain extra columns in every row 😡
- Some files contain extra columns in only some rows 😢
How do we load all of these files into the database?
- One at a time!
- Bash script loops through each raw data file, for
each file it executes code to process data and insert records into trips table
https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh
How do we map latitude and longitude to census tracts?
- PostGIS!
- Geographic information system (GIS) for
PostgreSQL
- Can do calculations of the form, “is a point inside a
polygon?”
- Every pickup/drop off is a point, NYC’s census
tracts are polygons
NYC Census Tracts
- 2,166 tracts
- 196 neighborhood
tabulation areas (NTAs)
Shapefiles
- Shapefile format describes geometries like points,
lines, polygons
- Many shapefiles publicly available, e.g. NYC
provides a shapefile that contains definitions for all census tracts and NTAs
- PostGIS includes functionality to import shapefiles
Quick peek at imported shapefile…
PostGIS: ST_Within()
- ST_Within(geom A, geom B) function returns
true if and only if A is entirely within B
- A = pickup or drop off point
- B = NYC census tract polygon
ST_Within() Usage
SELECT raw_trips.id AS trip_id, nyct2010.gid AS pickup_tract_id FROM raw_trips, nyct2010 WHERE ST_Within( raw_trips.pickup, nyct2010.geom );
Spatial Indexes
- Problem: determining whether a point is inside an
arbitrary polygon is computationally intensive and slow
- PostGIS spatial indexes to the rescue!
Spatial indexes in a nutshell bounding box
Bounding box Census tract
Spatial Indexes
- Determining whether a point is inside a rectangle is easy!
- Spatial indexes store rectangular bounding boxes for
polygons, then when determining if a point is inside a polygon, calculate in 2 steps:
- 1. Is the point inside the polygon’s bounding box?
- 2. If so, is the point inside the polygon itself?
- Most of the time the cheap first check will be false, then
we can skip the expensive second step
Putting it all together
- Download NYC census tracts shapefile, import
into database, create spatial index
- Download raw taxi/Uber data files and loop
through them, one file at a time
- For each file: fix data issues, load into database,
calculate census tracts with ST_Within()
- Wait 3 days and voila!
Analysis, a.k.a.“the fun part”
- Ask fun and interesting
questions
- Try to answer them
- Rinse and repeat
Taxi maps
- Question: what does a map of every taxi pickup
and drop off look like?
- Each trip has a pickup and drop off location, plot a
bunch of dots at those locations
- Made entirely in R using ggplot2
Taxi maps
Taxi maps preprocess
- Problem: R can’t fit 1.1 billion rows
- Solution: preprocess data by rounding
lat/long to 4 decimal places (~10 meters), count number of trips at each aggregated point
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215
Render maps in R
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R
Data reliability
Every other comment on reddit:
Airport traffic
- Question: how long does it take to get to the airport?
- LGA, JFK, and EWR are each their own census
tracts
- Get all trips that dropped off in one of those tracts
- Calculate travel times from neighborhoods to airports
Airport traffic preprocess
CREATE TABLE airport_trips AS SELECT …, CASE dropoff_nyct2010_gid WHEN 1840 THEN 'LGA' WHEN 2056 THEN 'JFK' WHEN 2167 THEN 'EWR' END AS airport_code, EXTRACT(HOUR FROM pickup_datetime) AS hour FROM trips WHERE dropoff_nyct2010_gid IN (1840, 2056, 2167) AND EXTRACT(DOW FROM pickup_datetime) NOT IN (0, 6);
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L242-L269
Airport traffic summary
CREATE TABLE airport_trips_summary AS SELECT n.ntacode, airport_code, hour, COUNT(*) AS trips_count, percentile_cont(0.1) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct10, percentile_cont(0.25) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct25, percentile_cont(0.5) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct50, percentile_cont(0.75) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct75, percentile_cont(0.9) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct90 FROM airport_trips t INNER JOIN nyct2010 n ON t.pickup_nyct2010_gid = n.gid WHERE duration_in_minutes IS NOT NULL GROUP BY airport_code, n.ntacode, hour;
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L273-L292
Airport traffic
Bridge & Tunnel analysis
- Where does B&T crowd hang out?
- Look at taxi pickups around Penn
Station on Saturday evenings, see where they go (Murray Hill)
- Use R’s ggmap package to plot
data overlaid on Google Maps
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R#L425-L460
How popular is Uber?
What census tract has the most growth in taxi activity?
http://toddwschneider.com/data/taxi/northside_animation.gif
More fun stuff in the full post
- On the realism of Die Hard 3
- Relationship between the weather and taxi/Uber
usage
- Neighborhoods with most nightlife according to
taxi activity
- East Hampton privacy concerns
- What time do investment bankers arrive at work?
Live coding interlude
Question: what happens to taxi activity during the Super Bowl?
Medium data analysis tips
What is “medium” data?
No clear answer, but my rough thinking:
- Tiny: fits in spreadsheet
- Small: doesn’t fit in spreadsheet, but fits in RAM
- Medium: too big for RAM, but fits on local hard disk
- Big: too big for local disk, has to be distributed across
many nodes
Personal experience with medium-sized datasets
- Wedding Crunchers: New York Times weddings section
- Rap Stats: corpus of rap lyrics
- Gambletron 2000: realtime betting data for sports and politics
- Loan level mortgage data from Fannie Mae and Freddie Mac
- Taxi/Uber data
- Citi Bike data
Use the right tool for the job
My personal toolkit (yours may vary!):
- PostgreSQL for storing and aggregating data. Geospatial
calculations with PostGIS extension
- R for modeling and plotting
- Command line tools for looping through files, loading data,
text processing on input data with sed, awk, etc.
- Ruby for making API calls, scraping websites, running web
servers, and sometimes using local rails apps to organize relational data
R + PostgresSQL
- The R ↔ Postgres link is invaluable! Use R and
Postgres for the things they’re respectively best at
- Postgres: persisting data in tables, rote number
crunching
- R: calibrating models, plotting
- RPostgreSQL package allows querying Postgres
from within R
Postgres tips
- COPY command is your friend for reading in
raw data from text files
- Favor creating new tables over updating large
tables
- Create indexes after importing data
Tip: pre-aggregate
- Think about how you’re going to access the data, and
consider creating intermediate aggregated tables which can be used as building blocks for later analysis
- Example: number of taxi trips grouped by pickup
census tract and date/time truncated to the hour
- Resulting table is only 30 million rows, easier to work
with than full trips table, and can still answer lots of interesting questions
Pre-aggregating example
CREATE TABLE hourly_pickups AS SELECT date_trunc('hour', pickup_datetime) AS pickup_hour, cab_type_id, pickup_nyct2010_gid, COUNT(*) FROM trips WHERE pickup_nyct2010_gid IS NOT NULL GROUP BY pickup_hour, cab_type_id, pickup_nyct2010_gid;
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L30-L38
Miscellany
- Start small, e.g. make sure your loading process runs on a
loop of 2 files before looping through all files
- Add an auto-incrementing primary key with Postgres
serial data type
- Get an external hard drive for raw data
- Use pg_dump to create a compressed version of full
dataset, save it on an external drive and/or S3 for easy restore
- The wheel was invented on Stack Overflow
But wait, there’s more!
“Content marketing”
- Sounds really lame, but is important
- “The default state of the internet is that no one
cares” —Priceonomics
How to get people to read your work
- It has to be interesting. If you’re not excited,
probably nobody else is either
- Most people are distracted, and they read things in
“fast scroll” mode. Optimize for them
- The questions you ask are more important than
the methods you use to answer them
Specific tips
- Write in short paragraphs with straightforward
language
- Use plenty of section headers
- Good ratio of pictures to text
- Avoid the dreaded “wall of text”
More tips
- It’s not a mystery novel, don’t bury the lede: the best
thing should go at the beginning
- If you do include technical detail, assume that a large
portion of your audience will bail when they see it, so probably put it toward the end
- Make sure everything looks okay on mobile devices
- Learn a bit of web development (HTML, CSS,
Javascript), it can be very helpful with design and interactivity
Promote yourself
- Nobody will do it for you
- Email a bunch of people (journalists, bloggers,
friends) you think might be interested in your work
- Aggregator sites like Hacker News and reddit can
generate lots of traffic
- Find and post in other forums related to your work
- Try to have at least one shareable “nugget”
Above all…
- Have fun!
- Keep an inquisitive mind.