DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider - PowerPoint PPT Presentation

DIY “Medium Data” Analysis: NYC Taxi Trips and Beyond Todd Schneider January 20, 2016 todd@toddwschneider.com

Where to find me toddwschneider.com github.com/toddwschneider @todd_schneider toddsnyder

Things I’ll talk about • Taxi and Uber data • Live coding demo • Medium data analysis tips

Taxi and Uber Data http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

Citi Bike Data http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/

NYC Taxi and Uber Data • Taxi & Limousine Commission released public, trip-level data for over 1.1 billion taxi rides 2009–2015 • Some public Uber data available as well, thanks to a FOIL request by FiveThirtyEight • Lots of great work by lots of people http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

Why analyze taxi data? • “Because it’s there” • More precisely: it’s fun, it’s interesting, and it’s computationally feasible on a commodity laptop

(Taxi) Analysis Overview 1. Get raw data 2. Write code to process raw data into something more useful 3. Analyze data 4. Write about what you found out

Taxi Analysis Tools • PostgreSQL • PostGIS • R • Command line https://github.com/toddwschneider/nyc-taxi-data

Raw data processing goals • Load flat files of varying file formats into a unified, persistent PostgreSQL database that we can use to answer questions about the data • Do some one-time calculations to augment the raw data • For taxi data in particular, we want to answer neighborhood-based questions, so we’ll map latitude/longitude coordinates to NYC census tracts

trips table schema • trips table will contain one row for each taxi/Uber trip record* • Most fields come from raw data, with some renaming • Add fields for census tracts, which we know we’re going to use

Processing raw data: The reality • Often messy, raw data can require massaging • Not fun, takes a while, but is essential • Specifically: we have to plan ahead a bit, anticipate usage patterns, questions we’re going to ask, then decide on schema

Quick peek at some raw data…

Specific issues encountered with raw taxi data • Some files contain empty lines and unquoted carriage returns 😑 • Raw data files have different formats even within the same cab type 😖 • Some files contain extra columns in every row 😡 • Some files contain extra columns in only some rows 😢

How do we load all of these files into the database? • One at a time! • Bash script loops through each raw data file, for each file it executes code to process data and insert records into trips table https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh

How do we map latitude and longitude to census tracts? • PostGIS! • Geographic information system (GIS) for PostgreSQL • Can do calculations of the form, “is a point inside a polygon?” • Every pickup/drop off is a point, NYC’s census tracts are polygons

NYC Census Tracts • 2,166 tracts • 196 neighborhood tabulation areas (NTAs)

Shapefiles • Shapefile format describes geometries like points, lines, polygons • Many shapefiles publicly available, e.g. NYC provides a shapefile that contains definitions for all census tracts and NTAs • PostGIS includes functionality to import shapefiles

Quick peek at imported shapefile…

PostGIS: ST_Within() • ST_Within(geom A, geom B) function returns true if and only if A is entirely within B • A = pickup or drop off point • B = NYC census tract polygon

ST_Within() Usage SELECT raw_trips.id AS trip_id, nyct2010.gid AS pickup_tract_id FROM raw_trips, nyct2010 WHERE ST_Within( raw_trips.pickup, nyct2010.geom );

Spatial Indexes • Problem: determining whether a point is inside an arbitrary polygon is computationally intensive and slow • PostGIS spatial indexes to the rescue!

Spatial indexes in a nutshell bounding box Census Bounding tract box

Spatial Indexes • Determining whether a point is inside a rectangle is easy! • Spatial indexes store rectangular bounding boxes for polygons, then when determining if a point is inside a polygon, calculate in 2 steps: 1. Is the point inside the polygon’s bounding box? 2. If so, is the point inside the polygon itself? • Most of the time the cheap first check will be false, then we can skip the expensive second step

Putting it all together • Download NYC census tracts shapefile, import into database, create spatial index • Download raw taxi/Uber data files and loop through them, one file at a time • For each file: fix data issues, load into database, calculate census tracts with ST_Within() • Wait 3 days and voila!

Analysis, a.k.a.“the fun part” • Ask fun and interesting questions • Try to answer them • Rinse and repeat

Taxi maps • Question: what does a map of every taxi pickup and drop off look like? • Each trip has a pickup and drop off location, plot a bunch of dots at those locations • Made entirely in R using ggplot2

Taxi maps

Taxi maps preprocess • Problem: R can’t fit 1.1 billion rows • Solution: preprocess data by rounding lat/long to 4 decimal places (~10 meters), count number of trips at each aggregated point https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215

Render maps in R https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R

Data reliability Every other comment on reddit:

Airport traffic • Question: how long does it take to get to the airport? • LGA, JFK, and EWR are each their own census tracts • Get all trips that dropped off in one of those tracts • Calculate travel times from neighborhoods to airports

Airport traffic preprocess CREATE TABLE airport_trips AS SELECT …, CASE dropoff_nyct2010_gid WHEN 1840 THEN 'LGA' WHEN 2056 THEN 'JFK' WHEN 2167 THEN 'EWR' END AS airport_code, EXTRACT(HOUR FROM pickup_datetime) AS hour FROM trips WHERE dropoff_nyct2010_gid IN (1840, 2056, 2167) AND EXTRACT(DOW FROM pickup_datetime) NOT IN (0, 6); https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L242-L269

Airport traffic summary CREATE TABLE airport_trips_summary AS SELECT n.ntacode, airport_code, hour, COUNT(*) AS trips_count, percentile_cont(0.1) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct10, percentile_cont(0.25) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct25, percentile_cont(0.5) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct50, percentile_cont(0.75) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct75, percentile_cont(0.9) WITHIN GROUP (ORDER BY duration_in_minutes) AS pct90 FROM airport_trips t INNER JOIN nyct2010 n ON t.pickup_nyct2010_gid = n.gid WHERE duration_in_minutes IS NOT NULL GROUP BY airport_code, n.ntacode, hour; https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L273-L292

Airport traffic

Bridge & Tunnel analysis • Where does B&T crowd hang out? • Look at taxi pickups around Penn Station on Saturday evenings, see where they go (Murray Hill) • Use R’s ggmap package to plot data overlaid on Google Maps https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R#L425-L460

How popular is Uber?

What census tract has the most growth in taxi activity? http://toddwschneider.com/data/taxi/northside_animation.gif

More fun stuff in the full post • On the realism of Die Hard 3 • Relationship between the weather and taxi/Uber usage • Neighborhoods with most nightlife according to taxi activity • East Hampton privacy concerns • What time do investment bankers arrive at work?

Live coding interlude

Question: what happens to taxi activity during the Super Bowl?

Medium data analysis tips

What is “medium” data? No clear answer, but my rough thinking: • Tiny: fits in spreadsheet • Small: doesn’t fit in spreadsheet, but fits in RAM • Medium: too big for RAM, but fits on local hard disk • Big: too big for local disk, has to be distributed across many nodes

Personal experience with medium-sized datasets • Wedding Crunchers: New York Times weddings section • Rap Stats: corpus of rap lyrics • Gambletron 2000: realtime betting data for sports and politics • Loan level mortgage data from Fannie Mae and Freddie Mac • Taxi/Uber data • Citi Bike data

Use the right tool for the job My personal toolkit (yours may vary!): • PostgreSQL for storing and aggregating data. Geospatial calculations with PostGIS extension • R for modeling and plotting • Command line tools for looping through files, loading data, text processing on input data with sed, awk, etc. • Ruby for making API calls, scraping websites, running web servers, and sometimes using local rails apps to organize relational data

R + PostgresSQL • The R ↔ Postgres link is invaluable! Use R and Postgres for the things they’re respectively best at • Postgres: persisting data in tables, rote number crunching • R: calibrating models, plotting • RPostgreSQL package allows querying Postgres from within R

DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider - PowerPoint PPT Presentation

DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider January 20, 2016 todd@toddwschneider.com Where to find me toddwschneider.com github.com/toddwschneider @todd_schneider toddsnyder Things Ill talk about Taxi

DIY Blue Teaming DIY Blue Teaming (Keeping attackers out, with duct tape and chewing gum!) DIY

Civil Service 101 Presentation NYC Fleet June 18, 2020 NYC Fleet NYC Fleet NYC Fleet:

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL Lessons from the

NYC DOT Commissioner Polly Trottenberg NACTO Designing Cities 1 NEW YORK CITY IS GROWING

DIY Bl DIY Block ck-Parallel el Da Data An Analysis Scientific Achievement Master Assigner

NYC Fleet Keith T. Kerman, NYC Chief Fleet Officer Montreal, Canada June 3, 2019 NYC Fleet: Who

DC TAXI APP Quick Start Guide Driver & Passenger Apps WATCH THE DC TAXI APP TRAINING VIDEO

EXPAND EXPAND NYC NYC CARE CARE CITYWIDE CITYWIDE NYC Care will expand to Queens and

De-carbonizing Fleet NYC Fleet Keith T . Kerman, NYC Chief Fleet Officer NY Energy Summit,

External Validity of NYC Macroscope Electronic Health External Validity of NYC Macroscope

Bioheat NYC Fleet Presentation to NYOHA December 16, 2019 American Management Association

NYC Hep B Patient Navigation Programs NYC Health Department Nirah Johnson, LCSW Director,

Taxis and Accessible Services Division: Status of Taxi Industry San Francisco Municipal

726-6666 CO-OP TAXI 753-5100 St. John's Taxi Co-operative Society Ltd Box 11, Suite 130,

Every Angle Smart information from SAP systems 11 March 2014 Maxeda DIY DIY retailer in the

DIY AUDIO ELECTRONICS WHY DIY? LETS BUILD AN EFFECTS PEDAL CIRCUIT BASICS A circuit contains

Outsourcing: Best Practices for Artists, Designers and Management Pandemic Studios Who am I?

1 Answer: The number of persons living Good News: People are living longer with HIV is

Emigrant Bank & Emigrant Bank Fine Art Finance Emigrant Bank (Member FDIC, founded 1850)

Radiosity Ray Tracing and Radiosity Ray Tracing and Radiosity Form Factors Form Factors

Digital Twin-Inspired Models of Socio-Technical Systems: A New Implementation of Work-Practice

Simulation based Optimization Methods for High dimensional Urban Mobility Problems Carolina

MEP Evaluation: A Refresher on Evaluation Report Requirements Geri McMahon, IA MEP Jonathan

Every Student Succeeds Act (ESSA) Suffolk BOCES Presented by Assistant Commissioner Ira Schwartz

DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider - PowerPoint PPT Presentation

DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider January 20, 2016 todd@toddwschneider.com Where to find me toddwschneider.com github.com/toddwschneider @todd_schneider toddsnyder Things Ill talk about Taxi

DIY Blue Teaming DIY Blue Teaming (Keeping attackers out, with duct tape and chewing gum!) DIY

Civil Service 101 Presentation NYC Fleet June 18, 2020 NYC Fleet NYC Fleet NYC Fleet:

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL Lessons from the

NYC DOT Commissioner Polly Trottenberg NACTO Designing Cities 1 NEW YORK CITY IS GROWING

DIY Bl DIY Block ck-Parallel el Da Data An Analysis Scientific Achievement Master Assigner

NYC Fleet Keith T. Kerman, NYC Chief Fleet Officer Montreal, Canada June 3, 2019 NYC Fleet: Who

DC TAXI APP Quick Start Guide Driver &amp; Passenger Apps WATCH THE DC TAXI APP TRAINING VIDEO

EXPAND EXPAND NYC NYC CARE CARE CITYWIDE CITYWIDE NYC Care will expand to Queens and

De-carbonizing Fleet NYC Fleet Keith T . Kerman, NYC Chief Fleet Officer NY Energy Summit,

External Validity of NYC Macroscope Electronic Health External Validity of NYC Macroscope

Bioheat NYC Fleet Presentation to NYOHA December 16, 2019 American Management Association

NYC Hep B Patient Navigation Programs NYC Health Department Nirah Johnson, LCSW Director,

Taxis and Accessible Services Division: Status of Taxi Industry San Francisco Municipal

726-6666 CO-OP TAXI 753-5100 St. John's Taxi Co-operative Society Ltd Box 11, Suite 130,

Every Angle Smart information from SAP systems 11 March 2014 Maxeda DIY DIY retailer in the

DIY AUDIO ELECTRONICS WHY DIY? LETS BUILD AN EFFECTS PEDAL CIRCUIT BASICS A circuit contains

Outsourcing: Best Practices for Artists, Designers and Management Pandemic Studios Who am I?

1 Answer: The number of persons living Good News: People are living longer with HIV is

Emigrant Bank &amp; Emigrant Bank Fine Art Finance Emigrant Bank (Member FDIC, founded 1850)

Radiosity Ray Tracing and Radiosity Ray Tracing and Radiosity Form Factors Form Factors

Digital Twin-Inspired Models of Socio-Technical Systems: A New Implementation of Work-Practice

Simulation based Optimization Methods for High dimensional Urban Mobility Problems Carolina

MEP Evaluation: A Refresher on Evaluation Report Requirements Geri McMahon, IA MEP Jonathan

Every Student Succeeds Act (ESSA) Suffolk BOCES Presented by Assistant Commissioner Ira Schwartz

DC TAXI APP Quick Start Guide Driver & Passenger Apps WATCH THE DC TAXI APP TRAINING VIDEO

Emigrant Bank & Emigrant Bank Fine Art Finance Emigrant Bank (Member FDIC, founded 1850)