Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH - PowerPoint PPT Presentation

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda

Case st u d y: Anal yz ing flight dela y s PARALLEL PROGRAMMING WITH DASK IN PYTHON

Limitations of Dask DataFrames Reading data into Dask DataFrames : A single � le Using glob on man y � les Limitations : Uns u pported � le formats Cleaning � les independentl y Nested s u bdirectories trick y w ith glob PARALLEL PROGRAMMING WITH DASK IN PYTHON

Sample acco u nt data accounts/Alice.csv : date,amount 2016-01-31,103.15 2016-02-25,114.17 2016-03-06,4.03 2016-05-20,150.48 accounts/Bob.csv : date,amount 2016-01-04,99.68 2016-02-09,146.41 2016-02-21,-42.94 2016-03-14,0.26 PARALLEL PROGRAMMING WITH DASK IN PYTHON

Reading / cleaning in a f u nction import pandas as pd from dask import delayed @delayed def pipeline(filename, account_name): df = pd.read_csv(filename) df['account_name'] = account_name return df PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using dd . from _ dela y ed () delayed_dfs = [] for account in ['Bob', 'Alice', 'Dave']: fname = 'accounts/{}.csv'.format(account) delayed_dfs.append(pipeline(fname, account)) import dask.dataframe as dd dask_df = dd.from_delayed(delayed_dfs) dask_df['amount'].mean().compute() 10.56476 PARALLEL PROGRAMMING WITH DASK IN PYTHON

Flight dela y s and w eather Cleaning � ight dela y s Use .replace() : 0 → NaN Cleaning w eather data 'PrecipitationIn' : te x t → n u meric Add col u mn for airport code PARALLEL PROGRAMMING WITH DASK IN PYTHON

Flight dela y s data df = pd.read_csv('flightdelays-2016-1.csv') df.columns Index(['FL_DATE', 'UNIQUE_CARRIER', 'FL_NUM', 'ORIGIN', 'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM', 'DEST', 'DEST_CITY_NAME', 'DEST_STATE_ABR', 'DEST_STATE_NM', 'CRS_DEP_TIME', 'DEP_DELAY', 'CRS_ARR_TIME', 'ARR_DELAY', 'CANCELLED', 'DIVERTED', 'CARRIER_DELAY','WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY', 'Unnamed: 22'], dtype='object') PARALLEL PROGRAMMING WITH DASK IN PYTHON

Flight dela y s data df['WEATHER_DELAY'].tail() 89160 NaN 89161 0.0 89162 NaN 89163 NaN 89164 NaN Name: WEATHER_DELAY, dtype: float64 PARALLEL PROGRAMMING WITH DASK IN PYTHON

Replacing v al u es new_series = series.replace( series 6, np.nan) new_series 0 6 1 0 0 NaN 2 6 1 0.0 3 5 2 NaN 4 7 3 5.0 dtype: int64 4 7.0 dtype: float64 PARALLEL PROGRAMMING WITH DASK IN PYTHON

Let ' s practice ! PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON

Preparing Weather Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda

Dail y w eather data import pandas as pd df = pd.read_csv('DEN.csv', parse_dates=True, index_col='Date') df.columns Index(['Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF', 'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity', 'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn', 'Mean Sea Level PressureIn', 'Min Sea Level PressureIn', 'Max VisibilityMiles', 'Mean VisibilityMiles', 'Min VisibilityMiles', 'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH', 'PrecipitationIn', 'CloudCover', 'Events', 'WindDirDegrees'], dtype='object') PARALLEL PROGRAMMING WITH DASK IN PYTHON

Dail y w eather data df.loc['March 2016', ['PrecipitationIn','Events']].tail() PrecipitationIn Events Date 2016-03-27 0.00 NaN 2016-03-28 0.00 NaN 2016-03-29 0.04 Rain-Thunderstorm 2016-03-30 0.04 Rain-Snow 2016-03-31 0.01 Snow PARALLEL PROGRAMMING WITH DASK IN PYTHON

E x amining PrecipitationIn & E v ents col u mns df['PrecipitationIn'][0] type(df['PrecipitationIn'][0]) '0.00' str df[['PrecipitationIn', 'Events']].info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 366 entries, 0 to 365 Data columns (total 2 columns): PrecipitationIn 366 non-null object Events 115 non-null object dtypes: object(2) memory usage: 5.8+ KB PARALLEL PROGRAMMING WITH DASK IN PYTHON

Con v erting to n u meric v al u es new_series = pd.to_numeric(series, series errors='coerce') new_series 0 0 1 M 0 0.0 2 2 1 NaN 3 1.5 2 2.0 4 E 3 1.5 dtype: object 4 NaN dtype: float64 PARALLEL PROGRAMMING WITH DASK IN PYTHON

Merging & Persisting DataFrames PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda

Merging DataFrames Pandas : pd.merge() Pandas : pd.DataFrame.merge() Dask : dask.dataframe.merge() PARALLEL PROGRAMMING WITH DASK IN PYTHON

Merging e x ample left_df right_df cat_left value_left cat_right value_right 0 d 4 0 b 9 1 d 9 1 c 2 2 b 1 2 f 0 3 d 7 3 d 8 4 c 3 4 a 8 PARALLEL PROGRAMMING WITH DASK IN PYTHON

Merging e x ample left_df.merge(right_df, left_on=['cat_left'], right_on=['cat_right'], how='inner') cat_left value_left cat_right value_right 0 d 4 d 8 1 d 9 d 8 2 d 7 d 8 3 b 1 b 9 4 c 3 c 2 PARALLEL PROGRAMMING WITH DASK IN PYTHON

Dask DataFrame pipelines Flight dela y s & w eather set u p 1. Read & clean 12 months of � ight dela y data 2. Make flight_delay dataframe w ith dd.from_delayed 3. Read & clean w eather dail y data from 5 airports 4. Make weather dataframe w ith dd.from_delayed 5. Merge the t w o dataframes PARALLEL PROGRAMMING WITH DASK IN PYTHON

Repeated reads & performance import dask.dataframe as dd df = dd.read_csv('flightdelays-2016-*.csv') %time print(df.WEATHER_DELAY.mean().compute()) 2.701183508773752 CPU times: user 3.35 s, sys: 719 ms, total: 4.07 s Wall time: 1.64 s %time print(df.WEATHER_DELAY.std().compute()) 21.230502105 CPU times: user 3.33 s, sys: 706 ms, total: 4.04 s Wall time: 1.61 s PARALLEL PROGRAMMING WITH DASK IN PYTHON

Repeated reads & performance %time print(df.WEATHER_DELAY.count().compute()) 192563 CPU times: user 3.36 s, sys: 695 ms, total: 4.06 s Wall time: 1.66 s PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using persistence %time persisted_df = df.persist() CPU times: user 3.32 s, sys: 688 ms, total: 4.01 s Wall time: 1.59 s %time print(persisted_df.WEATHER_DELAY.mean().compute()) 2.701183508773752 CPU times: user 15.1 ms, sys: 9.24 ms, total: 24.3 ms Wall time: 18.5 ms PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using persistence %time print(persisted_df.WEATHER_DELAY.std().compute()) 21.230502105 CPU times: user 29.6 ms, sys: 12.5 ms, total: 42.1 ms Wall time: 29.5 ms %time print(persisted_df.WEATHER_DELAY.count().compute()) 192563 CPU times: user 9.88 ms, sys: 2.98 ms, total: 12.9 ms Wall time: 9.43 ms PARALLEL PROGRAMMING WITH DASK IN PYTHON

Final tho u ghts PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Ma � he w Rocklin & Dha v ide Ar u li … Instr u ctors , Anaconda

What y o u'v e learned Ho w to : Use Dask data str u ct u res and dela y ed f u nctions Set u p data anal y sis pipelines w ith deferred comp u tation ... w hile w orking w ith real -w orld data ! PARALLEL PROGRAMMING WITH DASK IN PYTHON

Ne x t steps Deplo y ing Dask on y o u r o w n cl u ster Integrating w ith other P y thon libraries D y namic task sched u ling and data management h � ps :// dask . org / PARALLEL PROGRAMMING WITH DASK IN PYTHON

Congrat u lations ! PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH - PowerPoint PPT Presentation

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda Case st u d y: Anal yz ing flight dela y s PARALLEL PROGRAMMING WITH DASK IN PYTHON Limitations of Dask

Flight Forward Flight Forward Personal Weather Stations Flight Forward Hill AFB 26mi KSLC

Flight Opportuni.es Program Flight Flight Opportuni.es Program

Missing Flight Plans ICAO ATM/CM-SAF meeting Johannesburg 3-5 February 2015 Missing Flight Plans

Discovering Flight Chapter Overview Discovering Flight The Early Days of Flight Chapter

Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME

How to Make a Formal Presentation Contents Preparing Content ( Written ) Theory

Crisis Response DDSD Flight Team Where to start Review Flight Team Vocabulary Types of

Current Flight Paths Current Flight Paths Current approach and departure paths are all over

Flight Patterns Ground Control Point Placement Flight Altitude and Homogenous Features Drone

RIVERSIDE FLIGHT CENTER Taiwan 2016 professional pilot program part 141 flight school examining

ESTO-IIP 2011-2014 Introduction Goddard Space Flight Center EcoSAR is a new radar development

BBA Aviation enabling flight; expanding horizons March 2017 BBA Aviation Enabling flight;

Asia sia P Pacific acific Asia Pacific A Flight light P Procedure rocedure Flight

Time of Flight Detectors at RHIC Time of Flight Measurements at RHIC ! TOF detector as a

DISRUPTING FLIGHT DISRUPTIONS Mission Create a world with frictionless flight disruptions For

Bi Big D g Data A Analy lysis is t to Mea easure Dela elays o of Ca Canadia ian D

Naviga&on & Instruments II Introduc)on to Aeronau)cal

Page 1 1 Review: Particle Systems Review: Simple Ray Tracing changeable/fluid stuff view

GLIFWC Chippewa Ceded Territory Traditional Food Regulatory Project Food Harvester &

Solution Background Human skin-surface temperature is an important indicator of physical health.

Agenda Item 5-G Semi Annual Groundwater Monitoring Report January 14, 2020 montecitogsa.com

MS(Modified Silicone) Polymer Sealant Technologies www.alseal.com.au Introduction

Apple on Health I believe, if you zoom out into the future, and you look back, and you ask the

ORGANIZING THE GLOBAL HISTORICAL CLIMATOLOGY NETWORK JILL HARDY RACE

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH - PowerPoint PPT Presentation

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda Case st u d y: Anal yz ing flight dela y s PARALLEL PROGRAMMING WITH DASK IN PYTHON Limitations of Dask

Flight Forward Flight Forward Personal Weather Stations Flight Forward Hill AFB 26mi KSLC

Flight Opportuni.es Program Flight Flight Opportuni.es Program

Missing Flight Plans ICAO ATM/CM-SAF meeting Johannesburg 3-5 February 2015 Missing Flight Plans

Discovering Flight Chapter Overview Discovering Flight The Early Days of Flight Chapter

Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME

How to Make a Formal Presentation Contents Preparing Content ( Written ) Theory

Crisis Response DDSD Flight Team Where to start Review Flight Team Vocabulary Types of

Current Flight Paths Current Flight Paths Current approach and departure paths are all over

Flight Patterns Ground Control Point Placement Flight Altitude and Homogenous Features Drone

RIVERSIDE FLIGHT CENTER Taiwan 2016 professional pilot program part 141 flight school examining

ESTO-IIP 2011-2014 Introduction Goddard Space Flight Center EcoSAR is a new radar development

BBA Aviation enabling flight; expanding horizons March 2017 BBA Aviation Enabling flight;

Asia sia P Pacific acific Asia Pacific A Flight light P Procedure rocedure Flight

Time of Flight Detectors at RHIC Time of Flight Measurements at RHIC ! TOF detector as a

DISRUPTING FLIGHT DISRUPTIONS Mission Create a world with frictionless flight disruptions For

Bi Big D g Data A Analy lysis is t to Mea easure Dela elays o of Ca Canadia ian D

Naviga&amp;on &amp; Instruments II Introduc)on to Aeronau)cal

Page 1 1 Review: Particle Systems Review: Simple Ray Tracing changeable/fluid stuff view

GLIFWC Chippewa Ceded Territory Traditional Food Regulatory Project Food Harvester &amp;

Solution Background Human skin-surface temperature is an important indicator of physical health.

Agenda Item 5-G Semi Annual Groundwater Monitoring Report January 14, 2020 montecitogsa.com

MS(Modified Silicone) Polymer Sealant Technologies www.alseal.com.au Introduction

Apple on Health I believe, if you zoom out into the future, and you look back, and you ask the

ORGANIZING THE GLOBAL HISTORICAL CLIMATOLOGY NETWORK JILL HARDY RACE

Naviga&on & Instruments II Introduc)on to Aeronau)cal

GLIFWC Chippewa Ceded Territory Traditional Food Regulatory Project Food Harvester &