Pa n g e o
A c o m m u n i t y- d r i v e n e f f o r t f o r B i g D ata g e o s c i e n c e
Pa n g e o A c o m m u n i t y- d r i v e n e f f o r t f o r B - - PowerPoint PPT Presentation
Pa n g e o A c o m m u n i t y- d r i v e n e f f o r t f o r B i g D ata g e o s c i e n c e 2 G l o b a l w a r m i n g i s h a p p e n i n g ! 3 W h at D r i v e s P r o g r e s s i n G E O S c i e n c e ? 3 W h at
A c o m m u n i t y- d r i v e n e f f o r t f o r B i g D ata g e o s c i e n c e
2
3
3
New Ideas
3
New Ideas
E 5 r0jUj p ðN/jUj
jfj/jUj
P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,
3
New Ideas New Observations
E 5 r0jUj p ðN/jUj
jfj/jUj
P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,
3
New Ideas New Observations
E 5 r0jUj p ðN/jUj
jfj/jUj
P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,
3
New Ideas New Observations
E 5 r0jUj p ðN/jUj
jfj/jUj
P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,
3
New Ideas New Observations New Simulations
E 5 r0jUj p ðN/jUj
jfj/jUj
P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,
3
New Ideas New Observations New Simulations
E 5 r0jUj p ðN/jUj
jfj/jUj
P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,
3
New Ideas New Observations New Simulations
E 5 r0jUj p ðN/jUj
jfj/jUj
P1D(k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f 2 q dk,
4
Credit: NASA JPL / Dimitris Menemenlis
4
Credit: NASA JPL / Dimitris Menemenlis
to the transport of heat / salt / dissolved tracers vertically and horizontally?
shelf overflows, Indonesian Throughflow, abyssal canyons)?
models?
5
dozens of high impact papers are waiting to be written!
6
2013 2014 2015 2016 2017 2018
started at Columbia wandered the desert discovered Big Data
6
2013 2014 2015 2016 2017 2018
started at Columbia wandered the desert discovered Big Data
7
2013 2014 2015 2016 2017 2018
discovered Big Data started at Columbia wandered the desert discovered xarray!
8
source: stackoverflow.com
aospy
9
Credit: Stephan Hoyer, Jake Vanderplas (SciPy 2015)
aospy
9
Credit: Stephan Hoyer, Jake Vanderplas (SciPy 2015)
X a r r ay D ata s e t: M u lt i d i m e n s i o n a l Va r i a b l e s w i t h c o o r d i n at e s a n d m e ta d ata
10
time longitude latitude elevation
Data variables
used for computation
Coordinates
describe data
Indexes
align data
Attributes
metadata ignored by operations
land_cover
“netCDF meets pandas.DataFrame”
Credit: Stephan Hoyer
11
import xarray as xr ds = xr.open_dataset('NOAA_NCDC_ERSST_v3b_SST.nc') ds
<xarray.Dataset> Dimensions: (lat: 89, lon: 180, time: 684) Coordinates: * lat (lat) float32 -88.0 -86.0 -84.0 -82.0 -80.0 -78.0 -76.0 -74.0 ... * lon (lon) float32 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0 20.0 ... * time (time) datetime64[ns] 1960-01-15 1960-02-15 1960-03-15 ... Data variables: sst (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ... Attributes: Conventions: IRIDL source: https://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCDC/.ERSST/...
12
# select and plot data from my birthday ds.sst.sel(time='1982-08-07', method='nearest').plot()
13
# zonal and time mean temperature ds.sst.mean(dim=(‘time', 'lon')).plot()
14
sst_clim = sst.groupby('time.month').mean(dim='time') sst_anom = sst.groupby('time.month') - sst_clim nino34_index = (sst_anom.sel(lat=slice(-5, 5), lon=slice(190, 240)) .mean(dim=('lon', 'lat')) .rolling(time=3).mean(dim='time')) nino34_index.plot()
pandas, NumPy, Matplotlib)
(thanks dask!)
by and resampling
15
https://github.com/pydata/xarray
16
NASA Panoply INGRID
17
Complex computations represented as a graph of individual tasks. Scheduler optimizes execution of graph.
https://github.com/dask/dask/
ND-Arrays are split into chunks that comfortably fit in memory
17
Complex computations represented as a graph of individual tasks. Scheduler optimizes execution of graph.
https://github.com/dask/dask/
ND-Arrays are split into chunks that comfortably fit in memory
18
multidimensional array
read chunk from disk reduce store read chunk from disk reduce store read chunk from disk reduce store
serial execution (a loop)
reduce
19
multidimensional array
read chunk from disk reduce read chunk from disk reduce read chunk from disk reduce
19
multidimensional array
read chunk from disk reduce read chunk from disk reduce read chunk from disk reduce store store store reduce
parallel execution (dask graph)
20
2013 2014 2015 2016 2017 2018
discovered Big Data started at Columbia wandered the desert discovered xarray! used xarray on datasets up to ~200 GB connected with xarray community first Pangeo workshop
21
22
2013 2014 2015 2016 2017 2018
discovered Big Data started at Columbia wandered the desert discovered xarray! used xarray on datasets up to ~200 GB connected with fantastic xarray community first Pangeo workshop Earthcube proposal awarded
pangeo.pydata.org
23
Ryan Abernathey, Chiara Lepore, Michael Tippet, Naomi Henderson, Richard Seager Kevin Paul, Joe Hamman, Ryan May, Davide Del Vento Matthew Rocklin
24
Jacob Tomlinson, Niall Roberts, Alberto Arribas Developing and operating Pangeo environment to support analysis of UK Met
Rich Signell Deploying Pangeo on AWS to support analysis of coastal ocean modeling Justin Simcock Operating Pangeo in the cloud to support Climate Impact Lab research and analysis Supporting Pangeo via SWOT mission and recently funded ACCESS award to UW / NCAR 🎊 Yuvi Panda, Chris Holdgraf Spending lots of time helping us make things work on the cloud
25
Jupyter for interactive access remote systems
Cloud / HPC
Xarray provides data structures and intuitive interface for interacting with datasets
Parallel computing system allows users deploy clusters of compute nodes for data processing. Dask tells the nodes what to do.
Distributed storage
“Analysis Ready Data” stored on globally-available distributed storage.
26
Storage Formats Cloud Optimized COG/Zarr/Parquet/etc. ND-Arrays More coming… Data Models Processing Mode Interactive Batch Serverless Compute Platform HPC Cloud Local
27
NASA Pleiades
pa n g e o . p y d ata . o r g
NCAR Cheyenne Over 1000 unique users since March http://pangeo.io/deployments.html
28
29
since March 2017
30
Traditional Approach: A Data Access Portal
Data Access Server
file.0001.nc file.0002.nc file.0003.nc file.0004.nc
Data Granules (netCDF files)
Client Client Client
Data Center Internet
31
Direct Access to Cloud Object Storage
Catalog
chunk.0.0.0 chunk.0.0.1 chunk.0.0.2 chunk.0.0.3
Data Granules (netCDF files or something new) Cloud Object Storage
Client Client Client
Cloud Data Center Cloud Compute Instances
32
Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file
mechanism to read / write files and directories (e.g. POSIX).
bytes within files is fast.
a block device, which is a level
responsible for storing and retrieving specified blocks of data”
33
Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file
unique identifier
arrays
(@alimanfoo)
python mutable mapping interface (dictionary)
into cloud object storage
34
Zarr Group: group_name
.zgroup .zattrs .zarray .zattrs
Zarr Array: array_name
0.0 0.1 2.0 1.0 1.1 2.1
35
Zarr Group: group_name
.zgroup .zattrs .zarray .zattrs
Zarr Array: array_name
0.0 0.1 2.0 1.0 1.1 2.1
{ "chunks": [ 5, 720, 1440 ], "compressor": { "blocksize": 0, "clevel": 3, "cname": "zstd", "id": "blosc", "shuffle": 2 }, "dtype": "<f8", "fill_value": "NaN", "filters": null, "order": "C", "shape": [ 8901, 720, 1440 ], "zarr_format": 2 }
Example .zarray file (json)
36
Zarr Group: group_name
.zgroup .zattrs .zarray .zattrs
Zarr Array: array_name
0.0 0.1 2.0 1.0 1.1 2.1
{ "_ARRAY_DIMENSIONS": [ "time", "latitude", "longitude" ], "comment": "The sea level anomaly is the sea surface height above mean sea surface; it is referenced to the [1993, 2012] period; see the product user manual for details", "coordinates": "crs", "grid_mapping": "crs", "long_name": "Sea level anomaly", "standard_name": "sea_surface_height_above_sea_level", "units": "m" }
Example .attrs file (json)
directly to a Zarr store (with @jhamman)
(_ARRAY_DIMENSIONS) to give the zarr arrays dimensions
37
ds = xr.open_mfdatset(‘bunch_o_files_*.nc’, chunks={‘time’: 1})
ds.to_zarr(‘/path/to/zarr/directory’)
——or——
ds.to_zarr(gcsamp_object)
$ gcsutil -m cp -r /path/to/zarr/directory gs://pangeo-data/path
38
data catalog browing)
39
resources (eg. pangeo.pydata.org)
etc.) and give feedback via github: github.com/pangeo-data/pangeo
40