Pa n g e o A c o m m u n i t y- d r i v e n e f f o r t f o r B - - PowerPoint PPT Presentation

pa n g e o
SMART_READER_LITE
LIVE PREVIEW

Pa n g e o A c o m m u n i t y- d r i v e n e f f o r t f o r B - - PowerPoint PPT Presentation

Pa n g e o A c o m m u n i t y- d r i v e n e f f o r t f o r B i g D ata g e o s c i e n c e 2 G l o b a l w a r m i n g i s h a p p e n i n g ! 3 W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ? 3 W h at


slide-1
SLIDE 1

Pa n g e o

A c o m m u n i t y- d r i v e n e f f o r t f o r 
 B i g D ata g e o s c i e n c e

slide-2
SLIDE 2

G l o b a l w a r m i n g i s h a p p e n i n g !

2

slide-3
SLIDE 3

3

W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ?

slide-4
SLIDE 4

3

W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ?

New Ideas

slide-5
SLIDE 5

3

W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ?

New Ideas

E 5 r0jUj p ðN/jUj

jfj/jUj

P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,

slide-6
SLIDE 6

3

W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ?

New Ideas New Observations

E 5 r0jUj p ðN/jUj

jfj/jUj

P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,

slide-7
SLIDE 7

3

W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ?

New Ideas New Observations

E 5 r0jUj p ðN/jUj

jfj/jUj

P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,

slide-8
SLIDE 8

3

W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ?

New Ideas New Observations

E 5 r0jUj p ðN/jUj

jfj/jUj

P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,

slide-9
SLIDE 9

3

W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ?

New Ideas New Observations New Simulations

E 5 r0jUj p ðN/jUj

jfj/jUj

P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,

slide-10
SLIDE 10

3

W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ?

New Ideas New Observations New Simulations

E 5 r0jUj p ðN/jUj

jfj/jUj

P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,

slide-11
SLIDE 11

3

W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ?

New Ideas New Observations New Simulations

E 5 r0jUj p ðN/jUj

jfj/jUj

P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,

slide-12
SLIDE 12

4

Credit: NASA JPL / Dimitris Menemenlis

slide-13
SLIDE 13

4

Credit: NASA JPL / Dimitris Menemenlis

slide-14
SLIDE 14
  • How is energy transferred across scales and dissipated in the ocean?
  • How do mesoscales / submesoscales / tides / internal waves contribute

to the transport of heat / salt / dissolved tracers vertically and horizontally?

  • How does abyssal flow navigate complex small-scale topography (e.g.

shelf overflows, Indonesian Throughflow, abyssal canyons)?

  • How should we represent these processes in coarse resolution climate

models?

5

M a j o r S c i e n c e Q u e s t i o n s

dozens of high impact papers are waiting to be written!

slide-15
SLIDE 15

M y B i g D ata J o u r n e y

6

2013 2014 2015 2016 2017 2018

started at Columbia wandered the desert discovered
 Big Data

slide-16
SLIDE 16

M y B i g D ata J o u r n e y

6

2013 2014 2015 2016 2017 2018

started at Columbia wandered the desert discovered
 Big Data

slide-17
SLIDE 17

M y B i g D ata J o u r n e y

7

2013 2014 2015 2016 2017 2018

discovered
 Big Data started at Columbia wandered the desert discovered xarray!

slide-18
SLIDE 18

S c i e n t i f i c P y t h o n f o r D ata S c i e n c e

8

source: stackoverflow.com

slide-19
SLIDE 19

aospy

S c i e n t i f i c P y t h o n f o r D ata S c i e n c e

9

SciPy

Credit: Stephan Hoyer, Jake Vanderplas (SciPy 2015)

slide-20
SLIDE 20

aospy

S c i e n t i f i c P y t h o n f o r D ata S c i e n c e

9

SciPy

Credit: Stephan Hoyer, Jake Vanderplas (SciPy 2015)

slide-21
SLIDE 21

X a r r ay D ata s e t: M u lt i d i m e n s i o n a l Va r i a b l e s 
 w i t h c o o r d i n at e s a n d m e ta d ata

10

time longitude latitude elevation

Data variables

used for computation

Coordinates

describe data

Indexes

align data

Attributes

metadata ignored by operations

+

land_cover

“netCDF meets pandas.DataFrame”

Credit: Stephan Hoyer

slide-22
SLIDE 22

x a r r ay m a k e s s c i e n c e e a s y

11

import xarray as xr ds = xr.open_dataset('NOAA_NCDC_ERSST_v3b_SST.nc') ds

<xarray.Dataset> Dimensions: (lat: 89, lon: 180, time: 684) Coordinates: * lat (lat) float32 -88.0 -86.0 -84.0 -82.0 -80.0 -78.0 -76.0 -74.0 ... * lon (lon) float32 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0 20.0 ... * time (time) datetime64[ns] 1960-01-15 1960-02-15 1960-03-15 ... Data variables: sst (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ... Attributes: Conventions: IRIDL source: https://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCDC/.ERSST/...

slide-23
SLIDE 23

x a r r ay : l a b e l- b a s e d s e l e c t i o n

12

# select and plot data from my birthday ds.sst.sel(time='1982-08-07', method='nearest').plot()

slide-24
SLIDE 24

x a r r ay : l a b e l- b a s e d o p e r at i o n s

13

# zonal and time mean temperature ds.sst.mean(dim=(‘time', 'lon')).plot()

slide-25
SLIDE 25

x a r r ay : g r o u p i n g a n d a g g r e g at i o n

14

sst_clim = sst.groupby('time.month').mean(dim='time') sst_anom = sst.groupby('time.month') - sst_clim nino34_index = (sst_anom.sel(lat=slice(-5, 5), lon=slice(190, 240)) .mean(dim=('lon', 'lat')) .rolling(time=3).mean(dim='time')) nino34_index.plot()

slide-26
SLIDE 26
  • label-based indexing and arithmetic
  • interoperability with the core scientific Python packages (e.g.,

pandas, NumPy, Matplotlib)

  • ut-of-core computation on datasets that don’t fit into memory

(thanks dask!)

  • wide range of input/output (I/O) options: netCDF, HDF, geoTIFF, zarr
  • advanced multi-dimensional data manipulation tools such as group-

by and resampling

15

x a r r ay

https://github.com/pydata/xarray

slide-27
SLIDE 27

16

NASA Panoply INGRID

L e g a c y s o f t w a r e

slide-28
SLIDE 28

d a s k

17

Complex computations represented as a graph of individual tasks. 
 Scheduler optimizes execution of graph.

https://github.com/dask/dask/

ND-Arrays are split into chunks that comfortably fit in memory

slide-29
SLIDE 29

d a s k

17

Complex computations represented as a graph of individual tasks. 
 Scheduler optimizes execution of graph.

https://github.com/dask/dask/

ND-Arrays are split into chunks that comfortably fit in memory

slide-30
SLIDE 30

18

E x a m p l e C a l c u l at i o n : Ta k e t h e M e a n !

multidimensional
 array

read chunk from disk reduce store read chunk from disk reduce store read chunk from disk reduce store

serial execution (a loop)

reduce

slide-31
SLIDE 31

19

E x a m p l e C a l c u l at i o n : Ta k e t h e M e a n !

multidimensional
 array

read chunk from disk reduce read chunk from disk reduce read chunk from disk reduce

slide-32
SLIDE 32

19

E x a m p l e C a l c u l at i o n : Ta k e t h e M e a n !

multidimensional
 array

read chunk from disk reduce read chunk from disk reduce read chunk from disk reduce store store store reduce

parallel execution (dask graph)

slide-33
SLIDE 33

M y B i g D ata J o u r n e y

20

2013 2014 2015 2016 2017 2018

discovered
 Big Data started at Columbia wandered the desert discovered xarray! used xarray on datasets up to ~200 GB connected with xarray community first Pangeo workshop

slide-34
SLIDE 34
  • Foster collaboration around the open source scientific python

ecosystem for ocean / atmosphere / land / climate science.

  • Support the development with domain-specific geoscience

packages.

  • Improve scalability of these tools to to handle petabyte-scale

datasets on HPC and cloud platforms.

21

Pa n g e o P r o j e c t g o a l s

slide-35
SLIDE 35

M y B i g D ata J o u r n e y

22

2013 2014 2015 2016 2017 2018

discovered
 Big Data started at Columbia wandered the desert discovered xarray! used xarray on datasets up to ~200 GB connected with fantastic xarray community first Pangeo workshop Earthcube proposal awarded

pangeo.pydata.org

slide-36
SLIDE 36

E a r t h c u b e A w a r d T e a m

23

Ryan Abernathey, Chiara Lepore, Michael Tippet, Naomi Henderson, Richard Seager 
 
 
 Kevin Paul, Joe Hamman, Ryan May, Davide Del Vento 
 
 
 Matthew Rocklin

slide-37
SLIDE 37

O t h e r C o n t r i b u t o r s

24

Jacob Tomlinson, Niall Roberts, Alberto Arribas Developing and operating Pangeo environment to support analysis of UK Met

  • ffice products

Rich Signell Deploying Pangeo on AWS to support analysis of coastal ocean modeling Justin Simcock Operating Pangeo in the cloud to support Climate Impact Lab research and analysis Supporting Pangeo via SWOT mission and recently funded ACCESS award to UW / NCAR 🎊 Yuvi Panda, Chris Holdgraf Spending lots of time helping us make things work on the cloud

slide-38
SLIDE 38

25

Pa n g e o A r c h i t e c t u r e

Jupyter for interactive access remote systems

Cloud / HPC

Xarray provides data structures and intuitive interface for interacting with datasets

Parallel computing system allows users deploy clusters of compute nodes for data processing. Dask tells the nodes what to do.

Distributed storage

“Analysis Ready Data”
 stored on globally-available distributed storage.

slide-39
SLIDE 39

26

B u i l d y o u r o w n pa n g e o

Storage Formats Cloud Optimized COG/Zarr/Parquet/etc. ND-Arrays More coming… Data Models Processing Mode Interactive Batch Serverless Compute Platform HPC Cloud Local

slide-40
SLIDE 40

27

Pa n g e o D e p l o y m e n t s

NASA Pleiades

pa n g e o . p y d ata . o r g

NCAR Cheyenne Over 1000 unique users since March http://pangeo.io/deployments.html

slide-41
SLIDE 41
  • Open to anyone with a GitHub account! … but highly experimental / unstable
  • Deployed on Google Cloud Platform
  • Based on zero-to-jupyerhub-k8s (thanks Yuvi Panda, Chris Holdgraf, et al.!)
  • Customizations to allow users to launch dask clusters interactively
  • Pre-loaded example notebooks
  • Lots of data available in GCS (mostly zarr format)
  • Huge learning experience for everyone involved!

28

pa n g e o . p y d ata . o r g

slide-42
SLIDE 42

pa n g e o . p y d ata . o r g u s a g e s tat s

29

since March 2017

slide-43
SLIDE 43

C l i m at e D ata i n t h e C l o u d E R A

30

Traditional Approach: A Data Access Portal

Data Access Server

file.0001.nc file.0002.nc file.0003.nc file.0004.nc

Data Granules (netCDF files)

Client Client Client

Data Center Internet

slide-44
SLIDE 44

C l i m at e D ata i n t h e C l o u d E R A

31

Direct Access to Cloud Object Storage

Catalog

chunk.0.0.0 chunk.0.0.1 chunk.0.0.2 chunk.0.0.3

Data Granules
 (netCDF files or something new) Cloud Object Storage

Client Client Client

Cloud Data Center Cloud Compute Instances

slide-45
SLIDE 45

32

F i l e / B l o c k s t o r a g e

Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file

  • Operating system provides

mechanism to read / write files and directories (e.g. POSIX).

  • Seeking and random access to

bytes within files is fast.

  • “Most file systems are based on

a block device, which is a level

  • f abstraction for the hardware

responsible for storing and retrieving specified blocks of data”

slide-46
SLIDE 46

33

O b j e c t s t o r a g e

Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file

  • An object is a collection of bytes associated with a

unique identifier

  • Bytes are read and written with http calls
  • Significant overhead each individual operation
  • Application level (not OS dependent)
  • Implemented by S3, GCS, Azure, Ceph, etc.
slide-47
SLIDE 47
  • Python library for storage of chunked, compressed ND-

arrays

  • Developed by Alistair Miles (Imperial) for genomics research

(@alimanfoo)

  • Arrays are split into user-defined chunks; each chunk is
  • ptional compressed (zlib, zstd, etc.)
  • Can store arrays in memory, directories, zip files, or any

python mutable mapping interface (dictionary)

  • External libraries (s3fs, gcsf) provide a way to store directly

into cloud object storage

34

z a r r

Zarr Group: group_name

.zgroup .zattrs .zarray .zattrs

Zarr Array: array_name

0.0 0.1 2.0 1.0 1.1 2.1

slide-48
SLIDE 48

35

z a r r

Zarr Group: group_name

.zgroup .zattrs .zarray .zattrs

Zarr Array: array_name

0.0 0.1 2.0 1.0 1.1 2.1

{ "chunks": [ 5, 720, 1440 ], "compressor": { "blocksize": 0, "clevel": 3, "cname": "zstd", "id": "blosc", "shuffle": 2 }, "dtype": "<f8", "fill_value": "NaN", "filters": null, "order": "C", "shape": [ 8901, 720, 1440 ], "zarr_format": 2 }

Example .zarray file (json)

slide-49
SLIDE 49

36

z a r r

Zarr Group: group_name

.zgroup .zattrs .zarray .zattrs

Zarr Array: array_name

0.0 0.1 2.0 1.0 1.1 2.1

{ "_ARRAY_DIMENSIONS": [ "time", "latitude", "longitude" ], "comment": "The sea level anomaly is the sea surface height above mean sea surface; it is referenced to the [1993, 2012] period; see the product user manual for details", "coordinates": "crs", "grid_mapping": "crs", "long_name": "Sea level anomaly", "standard_name": "sea_surface_height_above_sea_level", "units": "m" }

Example .attrs file (json)

slide-50
SLIDE 50
  • Developed new xarray backend which allows xarray to read and write

directly to a Zarr store (with @jhamman)

  • It was pretty easy! Data models are quite similar
  • Automatic mapping between zarr chunks <—> dask chunks
  • We needed to add a custom, “hidden” attribute

(_ARRAY_DIMENSIONS) to give the zarr arrays dimensions

37

X a r r ay + z a r r

slide-51
SLIDE 51
  • 1. Open the original data files into a single xarray dataset with reasonable chunks


ds = xr.open_mfdatset(‘bunch_o_files_*.nc’, chunks={‘time’: 1})

  • 2. Export to zarr


ds.to_zarr(‘/path/to/zarr/directory’)


——or——


ds.to_zarr(gcsamp_object)

  • 3. [maybe] upload to cloud storage


$ gcsutil -m cp -r /path/to/zarr/directory gs://pangeo-data/path

38

P r e pa r i n g d ata s e t s f o r z a r r c l o u d s t o r a g e

slide-52
SLIDE 52
  • Pangeo + Binder! https://github.com/pangeo-data/pangeo-binder
  • Custom JupyterLab extensions (dask dashboards, cluster monitoring,

data catalog browing)

  • User management (home directories, scratch space, etc.)
  • Domain-specific cloud environments: 

  • cean.pangeo.io, atmos.pangeo.io, astro.pangeo.io [?]

39

W h e r e i s Pa n g e o G o i n g ?

slide-53
SLIDE 53
  • Contribute to xarray, dask, zarr, jupyterhub, etc.
  • Access an existing Pangeo deployment on an HPC cluster, or cloud

resources (eg. pangeo.pydata.org)

  • Adapt Pangeo elements to meet your projects needs (data portals,

etc.) and give feedback via github: github.com/pangeo-data/pangeo

40

H o w t o g e t i n v o lv e d

http://pangeo.io