B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G - PowerPoint PPT Presentation

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda

Seq u ences to bags nested_containers = [[0, 1, 2, 3],{}, [6.5, 3.14], 'Python', {'version':3}, '' ] import dask.bag as db the_bag = db.from_sequence(nested_containers) the_bag.count() 6 the_bag.any(), the_bag.all() True, False PARALLEL PROGRAMMING WITH DASK IN PYTHON

Reading te x t files import dask.bag as db zen = db.read_text('zen') taken = zen.take(1) type(taken) tuple PARALLEL PROGRAMMING WITH DASK IN PYTHON

Reading te x t files taken ('The Zen of Python, by Tim Peters\n',) zen.take(3) ('The Zen of Python, by Tim Peters\n', '\n', 'Beautiful is better than ugly.\n') PARALLEL PROGRAMMING WITH DASK IN PYTHON

Glob e x pressions import dask.dataframe as dd df = dd.read_csv('taxi/*.csv', assume_missing=True) taxi/*.csv is a glob e x pression taxi/*.csv matches : taxi/yellow_tripdata_2015-01.csv taxi/yellow_tripdata_2015-02.csv taxi/yellow_tripdata_2015-03.csv ... taxi/yellow_tripdata_2015-10.csv taxi/yellow_tripdata_2015-11.csv taxi/yellow_tripdata_2015-12.csv PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using P y thon ' s glob mod u le %ls Alice Dave README a02.txt a04.txt b05.txt b07.txt b09.txt b11.t Bob Lisa a01.txt a03.txt a05.txt b06.txt b08.txt b10.txt taxi import glob txt_files = glob.glob('*.txt') txt_files ['a01.txt', 'a02.txt', ... 'b10.txt', 'b11.txt'] PARALLEL PROGRAMMING WITH DASK IN PYTHON

More glob patterns glob.glob('b*.txt') glob.glob('?0[1-6].txt') ['b05.txt', ['a01.txt', 'b06.txt', 'a02.txt', 'b07.txt', 'a03.txt', 'b08.txt', 'a04.txt', 'b09.txt', 'a05.txt', 'b10.txt', 'b05.txt', 'b11.txt'] 'b06.txt'] [] glob.glob('b?.txt') PARALLEL PROGRAMMING WITH DASK IN PYTHON

More glob patterns glob.glob('??[1-6].txt') ['a01.txt', 'a02.txt', 'a03.txt', 'a04.txt', 'a05.txt', 'b05.txt', 'b06.txt', 'b11.txt'] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Permissible glob patterns Filename characters ( e . g ., file-02_tmp.txt ) Wildcard character * : matches 0 or more Wildcard character ? : matches e x actl y 1 Character ranges ( e . g ., [0-5] , [a-m] , [A-Z0-9] ) PARALLEL PROGRAMMING WITH DASK IN PYTHON

Let ' s practice ! PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON

F u nctional Approaches u sing Dask Bags PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda

F u nctional programming F u nctions : � rst - class data Higher - order f u nctions : f u nctions as inp u t or o u tp u t to f u nctions F u nctions replacing loops w ith : map operations � lter operations red u ction operations ( or aggregations ) PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using map def squared(x): return x ** 2 squares = map(squared, [1, 2, 3, 4, 5, 6]) squares <map at 0x1037a1b70> squares = list(squares) squares [1, 4, 9, 16, 25, 36] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using filter def is_even(x): ...: return x % 2 == 0 evens = filter(is_even, [1, 2, 3, 4, 5, 6]) list(evens) [2, 4, 6] even_squares = filter(is_even, squares)) list(even_squares) [4, 16, 36] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using dask . bag . map import dask.bag as db numbers = db.from_sequence([1, 2, 3, 4, 5, 6]) squares = numbers.map(squared) squares dask.bag<map-squared, npartitions=6> result = squares.compute() # Must fit in memory result [1, 4, 9, 16, 25, 36] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using dask . bag . filter numbers = db.from_sequence([1, 2, 3, 4, 5, 6]) evens = numbers.filter(is_even) evens.compute() [2, 4, 6] even_squares = numbers.map(squared).filter(is_even) even_squares.compute() [4, 16, 36] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using . str & string methods zen = db.read_text('zen.txt') uppercase = zen.str.upper() uppercase.take(1) ('THE ZEN OF PYTHON, BY TIM PETERS\n',) def my_upper(string): ...: return string.upper() my_uppercase = zen.map(my_upper) my_uppercase.take(1) ('THE ZEN OF PYTHON, BY TIM PETERS\n',) PARALLEL PROGRAMMING WITH DASK IN PYTHON

A bigger e x ample I def load(k): template = 'yellow_tripdata_2015-{:02d}.csv' return pd.read_csv(template.format(k)) def average(df): return df['total_amount'].mean() def total(df): return df['total_amount'].sum() data = db.from_sequence(range(1, 13)).map(load) data dask.bag<map-loa..., npartitions=12> PARALLEL PROGRAMMING WITH DASK IN PYTHON

A bigger e x ample II totals = data.map(total) averages.compute() averages = data.map(average) totals.compute() [14.75051171665384, 15.463557844570461, [1175217.5200009614, 15.790076907851297, 947282.0900005419, 15.971334410669527, 956752.3400005258, 16.477159899324676, 1304602.4800011297, 16.250654434978838, 1354966.290001166, 16.163639508987067, 1251511.6500010253, 16.164026987891997, 1167936.1000008786, 16.364647910506154, 915174.880000469, 16.544750841370114, 994643.300000564, 16.385807916489675, 1273267.4800010026, 16.28056690958003] 1158279.990000822, 1166242.130000856] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Red u ctions ( aggregations ) t_sum, t_min, t_max, = totals.sum(), totals.min(), totals.max() t_mean, t_std, = totals.mean(), totals.std() stats = [t_sum, t_min, t_max, t_mean, t_std] %time [s.compute() for s in stats] CPU times: user 142 ms, sys: 101 ms, total: 243 ms Wall time: 4.57 s [13665876.250009943, 915174.880000469, 1354966.290001166, 1138823.0208341617, 144025.81874405374] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Red u ctions ( aggregations ) import dask %time dask.compute(t_sum, t_min, t_max, t_mean, t_std) CPU times: user 63.7 ms, sys: 29.1 ms, total: 92.7 ms Wall time: 852 ms (13665876.250009943, 915174.880000469, 1354966.290001166, 1138823.0208341617, 144025.81874405374) PARALLEL PROGRAMMING WITH DASK IN PYTHON

Anal yz ing Congressional Legislation PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda

JSON data files J a v a S cript O bject N otation : stored as plain te x t common w eb format direct mapping to P y thon lists & dictionaries PARALLEL PROGRAMMING WITH DASK IN PYTHON

Sample JSON FIle : items . json items.json [ { "name": "item1", "content": ["a","b","c"] }, { "name": "item2", "content": {"a": 0, "b": 1} } ] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using json mod u le import json with open('items.json') as f: items = json.load(f) type(items) list items[0] items[1] items[1]['content']['b'] {'content': ['a', 'b', 'c'], 'name': 'item1'} {'content': {'a': 0, 'b': 1}, 'name': 'item2'} 1 PARALLEL PROGRAMMING WITH DASK IN PYTHON

JSON Files into Dask Bags items-by-line.json {"name": "item1", "content": ["a", "b", "c"]} {"name": "item2", "content": {"a": 0, "b": 1}} import dask.bag as db items = db.read_text('items-by-line.json') items.take(1) # Note: tuple containing a *string* ('{"name": "item1", "content": ["a", "b", "c"]}\n',) PARALLEL PROGRAMMING WITH DASK IN PYTHON

JSON Files into Dask Bags dict_items = items.map(json.loads) # converts strings -> other data dict_items.take(2) # Note: tuple containing dicts ({'content': ['a', 'b', 'c'], 'name': 'item1'}, {'content': {'a': 0, 'b': 1}, 'name': 'item2'}) PARALLEL PROGRAMMING WITH DASK IN PYTHON

Pl u cking v al u es type(dict_items.take(2)) tuple dict_items.take(2)[1]['content'] # Chained indexing {'a': 0, 'b': 1} dict_items.take(1)[0]['name'] # Chained indexing 'item1' PARALLEL PROGRAMMING WITH DASK IN PYTHON

Pl u cking v al u es contents = dict_items.pluck('content') names = dict_items.pluck('name') contents names dask.bag<pluck-5..., npartitions=1> dask.bag<pluck-3..., npartitions=1> contents.compute() names.compute() [['a', 'b', 'c'], {'a': 0, 'b': 1}] ['item1', 'item2'] PARALLEL PROGRAMMING WITH DASK IN PYTHON

Congressional legislation metadata 23 JSON � les metadata abo u t congressional bills u p to 1500 pieces of legislation per congress . Load all into Dask Bag u se current_status to co u nt v etoed bills u se date info to comp u te a v erage times PARALLEL PROGRAMMING WITH DASK IN PYTHON

Metadata ke y s Selected dictionar y ke y s 'bill_type' 'title_without_number' 'related_bills' 'id' 'titles' 'display_number' 'major_actions' 'current_status_description' 'link' 'current_status_date' 'committee_reports' 'current_status_label' 'introduced_date' 'sponsor' 'current_status' 'title' Warning : Not all a v ailable for e v er y bill PARALLEL PROGRAMMING WITH DASK IN PYTHON

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G - PowerPoint PPT Presentation

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda Seq u ences to bags nested_containers = [[0, 1, 2, 3],{}, [6.5, 3.14], 'Python', {'version':3}, ''

BLUE BINS 101 Stretchy grocery bags Blue bins Shopping bags Clothing/garment bags Dry cleaner

Globbing, pattern matching Globbing is the term used for bashs form of pattern matching in

Ch u nking Arra y s in Dask PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

BAGS and SACKS Shopping Bags Technical Sacks (e.g. Cement , Chemical, ..) Food Sacks

VALVE BAGS Content: 1) Applications 2) Equipment 3) Types of Valve Bags 4) Conversion Process

Algebraic and Logical Query Languages Thomas Schwarz, SJ Bags, Lists, Sets Bags are

THE WORLDS LEADING MANUFACTURER OF LABORATORY BLENDER BAGS Who are Grade Products? Grade

BYOBB BRING YOUR OWN BAGS & BOTTLES Proposed Bag Article: Reducing the Source Thin

2 House of Bags Manufacturing Co. was established in October 2014 in Jeddah, Saudi Arabia, as one

Scaling RAPIDS with Dask Matthew Rocklin, Systems Software Manager GTC San Jose 2019 PyData is

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS Peter Andreas Entschev Senior System

Understanding Comp u ter Storage & Big Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN

Dask extending Python data tools for parallel and distributed computing Joris Van den Bossche -

From Bags to Boards The Experimentation Behind the Recycled Building Material Bag Board

KINGSONS Founded in 2006 in Hong Kong, Kingsons focuses on stylish bags and backpacks for the

Methods for Modeling Realistic Methods for Modeling Realistic Playing in Plucked-String

Lightning Introductions PRIVACY BY DESIGN February 5-6, 2015 Annie Antn / Georgia Institute

ProofTheory: Logicaland Philosophical Aspects Class 5: Semantics and beyond Greg Restall and

CL O X : Common Lisp Objects for XEmacs Motivation Alternatives Closette ELisp vs. Lisp The CL

More PDEs: Realistic Waves on Strings Include Friction & Gravity Rubin H Landau Sally

Matthew Series Lesson #103 November 29, 2015 Dean Bible Ministries www.deanbibleministries.org

with RethinkDB Ilya Verbitskiy Ilya Verbitskiy Distributed systems, application security,

Wilderness to be subdued and turned in gardens The GARDEN to be cultivated and protected

Sambuz

Useful Links

Newsletter

Mail Us

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G - PowerPoint PPT Presentation

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide Ar u liah Director of Training , Anaconda Seq u ences to bags nested_containers = [[0, 1, 2, 3],{}, [6.5, 3.14], 'Python', {'version':3}, ''

BLUE BINS 101 Stretchy grocery bags Blue bins Shopping bags Clothing/garment bags Dry cleaner

Globbing, pattern matching Globbing is the term used for bashs form of pattern matching in

Ch u nking Arra y s in Dask PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

BAGS and SACKS Shopping Bags Technical Sacks (e.g. Cement , Chemical, ..) Food Sacks

VALVE BAGS Content: 1) Applications 2) Equipment 3) Types of Valve Bags 4) Conversion Process

Algebraic and Logical Query Languages Thomas Schwarz, SJ Bags, Lists, Sets Bags are

THE WORLDS LEADING MANUFACTURER OF LABORATORY BLENDER BAGS Who are Grade Products? Grade

BYOBB BRING YOUR OWN BAGS &amp; BOTTLES Proposed Bag Article: Reducing the Source Thin

2 House of Bags Manufacturing Co. was established in October 2014 in Jeddah, Saudi Arabia, as one

Scaling RAPIDS with Dask Matthew Rocklin, Systems Software Manager GTC San Jose 2019 PyData is

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS Peter Andreas Entschev Senior System

Understanding Comp u ter Storage &amp; Big Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN

Dask extending Python data tools for parallel and distributed computing Joris Van den Bossche -

From Bags to Boards The Experimentation Behind the Recycled Building Material Bag Board

KINGSONS Founded in 2006 in Hong Kong, Kingsons focuses on stylish bags and backpacks for the

Methods for Modeling Realistic Methods for Modeling Realistic Playing in Plucked-String

Lightning Introductions PRIVACY BY DESIGN February 5-6, 2015 Annie Antn / Georgia Institute

ProofTheory: Logicaland Philosophical Aspects Class 5: Semantics and beyond Greg Restall and

CL O X : Common Lisp Objects for XEmacs Motivation Alternatives Closette ELisp vs. Lisp The CL

More PDEs: Realistic Waves on Strings Include Friction &amp; Gravity Rubin H Landau Sally

Matthew Series Lesson #103 November 29, 2015 Dean Bible Ministries www.deanbibleministries.org

with RethinkDB Ilya Verbitskiy Ilya Verbitskiy Distributed systems, application security,

Wilderness to be subdued and turned in gardens The GARDEN to be cultivated and protected

Sambuz

Useful Links

Newsletter

Mail Us

BYOBB BRING YOUR OWN BAGS & BOTTLES Proposed Bag Article: Reducing the Source Thin

Understanding Comp u ter Storage & Big Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN

More PDEs: Realistic Waves on Strings Include Friction & Gravity Rubin H Landau Sally