DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider - - PowerPoint PPT Presentation

diy medium data analysis nyc taxi trips and beyond
SMART_READER_LITE
LIVE PREVIEW

DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider - - PowerPoint PPT Presentation

DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider January 20, 2016 todd@toddwschneider.com Where to find me toddwschneider.com github.com/toddwschneider @todd_schneider toddsnyder Things Ill talk about Taxi


slide-1
SLIDE 1

DIY “Medium Data” Analysis: NYC Taxi Trips and Beyond

Todd Schneider January 20, 2016 todd@toddwschneider.com

slide-2
SLIDE 2

Where to find me

toddwschneider.com github.com/toddwschneider @todd_schneider toddsnyder

slide-3
SLIDE 3

Things I’ll talk about

  • Taxi and Uber data
  • Live coding demo
  • Medium data analysis tips
slide-4
SLIDE 4

Taxi and Uber Data

http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

slide-5
SLIDE 5

Citi Bike Data

http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/

slide-6
SLIDE 6

NYC Taxi and Uber Data

  • Taxi & Limousine Commission released

public, trip-level data for over 1.1 billion taxi rides 2009–2015

  • Some public Uber data available as

well, thanks to a FOIL request by FiveThirtyEight

  • Lots of great work by lots of people

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

slide-7
SLIDE 7

Why analyze taxi data?

  • “Because it’s there”
  • More precisely: it’s fun, it’s interesting, and

it’s computationally feasible on a commodity laptop

slide-8
SLIDE 8

(Taxi) Analysis Overview

  • 1. Get raw data
  • 2. Write code to process raw data into

something more useful

  • 3. Analyze data
  • 4. Write about what you found out
slide-9
SLIDE 9

Taxi Analysis Tools

  • PostgreSQL
  • PostGIS
  • R
  • Command line

https://github.com/toddwschneider/nyc-taxi-data

slide-10
SLIDE 10

Raw data processing goals

  • Load flat files of varying file formats into a unified,

persistent PostgreSQL database that we can use to answer questions about the data

  • Do some one-time calculations to augment the raw

data

  • For taxi data in particular, we want to answer

neighborhood-based questions, so we’ll map latitude/longitude coordinates to NYC census tracts

slide-11
SLIDE 11
  • trips table will contain
  • ne row for each taxi/Uber

trip record*

  • Most fields come from raw

data, with some renaming

  • Add fields for census tracts,

which we know we’re going to use

trips table schema

slide-12
SLIDE 12

Processing raw data: The reality

  • Often messy, raw data can require massaging
  • Not fun, takes a while, but is essential
  • Specifically: we have to plan ahead a bit,

anticipate usage patterns, questions we’re going to ask, then decide on schema

slide-13
SLIDE 13

Quick peek at some raw data…

slide-14
SLIDE 14

Specific issues encountered with raw taxi data

  • Some files contain empty lines and unquoted carriage

returns 😑

  • Raw data files have different formats even within the

same cab type 😖

  • Some files contain extra columns in every row 😡
  • Some files contain extra columns in only some rows 😢
slide-15
SLIDE 15

How do we load all of these files into the database?

  • One at a time!
  • Bash script loops through each raw data file, for

each file it executes code to process data and insert records into trips table

https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh

slide-16
SLIDE 16

How do we map latitude and longitude to census tracts?

  • PostGIS!
  • Geographic information system (GIS) for

PostgreSQL

  • Can do calculations of the form, “is a point inside a

polygon?”

  • Every pickup/drop off is a point, NYC’s census

tracts are polygons

slide-17
SLIDE 17

NYC Census Tracts

  • 2,166 tracts
  • 196 neighborhood

tabulation areas (NTAs)

slide-18
SLIDE 18

Shapefiles

  • Shapefile format describes geometries like points,

lines, polygons

  • Many shapefiles publicly available, e.g. NYC

provides a shapefile that contains definitions for all census tracts and NTAs

  • PostGIS includes functionality to import shapefiles
slide-19
SLIDE 19

Quick peek at imported shapefile…

slide-20
SLIDE 20

PostGIS: ST_Within()

  • ST_Within(geom A, geom B) function returns

true if and only if A is entirely within B

  • A = pickup or drop off point
  • B = NYC census tract polygon
slide-21
SLIDE 21

ST_Within() Usage

SELECT raw_trips.id AS trip_id, nyct2010.gid AS pickup_tract_id FROM raw_trips, nyct2010 WHERE ST_Within( raw_trips.pickup, nyct2010.geom );

slide-22
SLIDE 22

Spatial Indexes

  • Problem: determining whether a point is inside an

arbitrary polygon is computationally intensive and slow

  • PostGIS spatial indexes to the rescue!
slide-23
SLIDE 23

Spatial indexes in a nutshell bounding box

Bounding box Census tract

slide-24
SLIDE 24

Spatial Indexes

  • Determining whether a point is inside a rectangle is easy!
  • Spatial indexes store rectangular bounding boxes for

polygons, then when determining if a point is inside a polygon, calculate in 2 steps:

  • 1. Is the point inside the polygon’s bounding box?
  • 2. If so, is the point inside the polygon itself?
  • Most of the time the cheap first check will be false, then

we can skip the expensive second step

slide-25
SLIDE 25

Putting it all together

  • Download NYC census tracts shapefile, import

into database, create spatial index

  • Download raw taxi/Uber data files and loop

through them, one file at a time

  • For each file: fix data issues, load into database,

calculate census tracts with ST_Within()

  • Wait 3 days and voila!
slide-26
SLIDE 26

Analysis, a.k.a.“the fun part”

  • Ask fun and interesting

questions

  • Try to answer them
  • Rinse and repeat
slide-27
SLIDE 27

Taxi maps

  • Question: what does a map of every taxi pickup

and drop off look like?

  • Each trip has a pickup and drop off location, plot a

bunch of dots at those locations

  • Made entirely in R using ggplot2
slide-28
SLIDE 28

Taxi maps

slide-29
SLIDE 29

Taxi maps preprocess

  • Problem: R can’t fit 1.1 billion rows
  • Solution: preprocess data by rounding

lat/long to 4 decimal places (~10 meters), count number of trips at each aggregated point

https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215

slide-30
SLIDE 30

Render maps in R

https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R

slide-31
SLIDE 31

Data reliability

Every other comment on reddit:

slide-32
SLIDE 32

Airport traffic

  • Question: how long does it take to get to the airport?
  • LGA, JFK, and EWR are each their own census

tracts

  • Get all trips that dropped off in one of those tracts
  • Calculate travel times from neighborhoods to airports
slide-33
SLIDE 33

Airport traffic preprocess

CREATE TABLE airport_trips AS SELECT …, CASE dropoff_nyct2010_gid WHEN 1840 THEN 'LGA' WHEN 2056 THEN 'JFK' WHEN 2167 THEN 'EWR' END AS airport_code, EXTRACT(HOUR FROM pickup_datetime) AS hour FROM trips WHERE dropoff_nyct2010_gid IN (1840, 2056, 2167) AND EXTRACT(DOW FROM pickup_datetime) NOT IN (0, 6);

https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L242-L269

slide-34
SLIDE 34

Airport traffic summary

CREATE TABLE airport_trips_summary AS SELECT n.ntacode, airport_code, hour, COUNT(*) AS trips_count, percentile_cont(0.1) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct10, percentile_cont(0.25) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct25, percentile_cont(0.5) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct50, percentile_cont(0.75) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct75, percentile_cont(0.9) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct90 FROM airport_trips t INNER JOIN nyct2010 n ON t.pickup_nyct2010_gid = n.gid WHERE duration_in_minutes IS NOT NULL GROUP BY airport_code, n.ntacode, hour;

https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L273-L292

slide-35
SLIDE 35

Airport traffic

slide-36
SLIDE 36

Bridge & Tunnel analysis

  • Where does B&T crowd hang out?
  • Look at taxi pickups around Penn

Station on Saturday evenings, see where they go (Murray Hill)

  • Use R’s ggmap package to plot

data overlaid on Google Maps

https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R#L425-L460

slide-37
SLIDE 37

How popular is Uber?

slide-38
SLIDE 38

What census tract has the most growth in taxi activity?

http://toddwschneider.com/data/taxi/northside_animation.gif

slide-39
SLIDE 39

More fun stuff in the full post

  • On the realism of Die Hard 3
  • Relationship between the weather and taxi/Uber

usage

  • Neighborhoods with most nightlife according to

taxi activity

  • East Hampton privacy concerns
  • What time do investment bankers arrive at work?
slide-40
SLIDE 40

Live coding interlude

slide-41
SLIDE 41

Question: what happens to taxi activity during the Super Bowl?

slide-42
SLIDE 42

Medium data analysis tips

slide-43
SLIDE 43

What is “medium” data?

No clear answer, but my rough thinking:

  • Tiny: fits in spreadsheet
  • Small: doesn’t fit in spreadsheet, but fits in RAM
  • Medium: too big for RAM, but fits on local hard disk
  • Big: too big for local disk, has to be distributed across

many nodes

slide-44
SLIDE 44

Personal experience with medium-sized datasets

  • Wedding Crunchers: New York Times weddings section
  • Rap Stats: corpus of rap lyrics
  • Gambletron 2000: realtime betting data for sports and politics
  • Loan level mortgage data from Fannie Mae and Freddie Mac
  • Taxi/Uber data
  • Citi Bike data
slide-45
SLIDE 45

Use the right tool for the job

My personal toolkit (yours may vary!):

  • PostgreSQL for storing and aggregating data. Geospatial

calculations with PostGIS extension

  • R for modeling and plotting
  • Command line tools for looping through files, loading data,

text processing on input data with sed, awk, etc.

  • Ruby for making API calls, scraping websites, running web

servers, and sometimes using local rails apps to organize relational data

slide-46
SLIDE 46

R + PostgresSQL

  • The R ↔ Postgres link is invaluable! Use R and

Postgres for the things they’re respectively best at

  • Postgres: persisting data in tables, rote number

crunching

  • R: calibrating models, plotting
  • RPostgreSQL package allows querying Postgres

from within R

slide-47
SLIDE 47

Postgres tips

  • COPY command is your friend for reading in

raw data from text files

  • Favor creating new tables over updating large

tables

  • Create indexes after importing data
slide-48
SLIDE 48

Tip: pre-aggregate

  • Think about how you’re going to access the data, and

consider creating intermediate aggregated tables which can be used as building blocks for later analysis

  • Example: number of taxi trips grouped by pickup

census tract and date/time truncated to the hour

  • Resulting table is only 30 million rows, easier to work

with than full trips table, and can still answer lots of interesting questions

slide-49
SLIDE 49

Pre-aggregating example

CREATE TABLE hourly_pickups AS SELECT date_trunc('hour', pickup_datetime) AS pickup_hour, cab_type_id, pickup_nyct2010_gid, COUNT(*) FROM trips WHERE pickup_nyct2010_gid IS NOT NULL GROUP BY pickup_hour, cab_type_id, pickup_nyct2010_gid;

https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L30-L38

slide-50
SLIDE 50

Miscellany

  • Start small, e.g. make sure your loading process runs on a

loop of 2 files before looping through all files

  • Add an auto-incrementing primary key with Postgres

serial data type

  • Get an external hard drive for raw data
  • Use pg_dump to create a compressed version of full

dataset, save it on an external drive and/or S3 for easy restore

  • The wheel was invented on Stack Overflow
slide-51
SLIDE 51

But wait, there’s more!

slide-52
SLIDE 52

“Content marketing”

  • Sounds really lame, but is important
  • “The default state of the internet is that no one

cares” —Priceonomics

slide-53
SLIDE 53

How to get people to read your work

  • It has to be interesting. If you’re not excited,

probably nobody else is either

  • Most people are distracted, and they read things in

“fast scroll” mode. Optimize for them

  • The questions you ask are more important than

the methods you use to answer them

slide-54
SLIDE 54

Specific tips

  • Write in short paragraphs with straightforward

language

  • Use plenty of section headers
  • Good ratio of pictures to text
  • Avoid the dreaded “wall of text”
slide-55
SLIDE 55

More tips

  • It’s not a mystery novel, don’t bury the lede: the best

thing should go at the beginning

  • If you do include technical detail, assume that a large

portion of your audience will bail when they see it, so probably put it toward the end

  • Make sure everything looks okay on mobile devices
  • Learn a bit of web development (HTML, CSS,

Javascript), it can be very helpful with design and interactivity

slide-56
SLIDE 56

Promote yourself

  • Nobody will do it for you
  • Email a bunch of people (journalists, bloggers,

friends) you think might be interested in your work

  • Aggregator sites like Hacker News and reddit can

generate lots of traffic

  • Find and post in other forums related to your work
  • Try to have at least one shareable “nugget”
slide-57
SLIDE 57

Above all…

  • Have fun!
  • Keep an inquisitive mind.

Observe stuff happening around you, ask questions about it, try to answer those questions

slide-58
SLIDE 58

“Increasingly, the limiting factor of data analysis is not computational horsepower, but human curiosity and creativity” —me

slide-59
SLIDE 59

Thanks!

todd@toddwschneider.com