Data Management, In-Situ Workflows and Extreme Scales Manish - - PowerPoint PPT Presentation

data management in situ workflows and extreme scales
SMART_READER_LITE
LIVE PREVIEW

Data Management, In-Situ Workflows and Extreme Scales Manish - - PowerPoint PPT Presentation

Data Management, In-Situ Workflows and Extreme Scales Manish Parashar, Ph.D . Director, Rutgers Discovery Informatics Institute RDI 2 Distinguished Professor, Department of Computer Science Philip Davis, Shaohua Duan, Yubo Qin, Melissa Romanus,


slide-1
SLIDE 1

Manish Parashar, Ph.D.

Director, Rutgers Discovery Informatics Institute RDI2 Distinguished Professor, Department of Computer Science

Philip Davis, Shaohua Duan, Yubo Qin, Melissa Romanus, Pradeep Subedi, Zhe Wang

ROSS 2018 @ HPDC’18, Tempe, AZ, USA June 12, 2018

Data Management, In-Situ Workflows and Extreme Scales

slide-2
SLIDE 2

Outline

  • Extreme scale simulation-based science – opportunities

and challenges

  • Rethinking the simulations-to-insights pipeline: Data

staging and in-Situ workflows

  • Runtime management for data staging and in-situ

workflows

– Data placement – Resilience

  • Conclusion
slide-3
SLIDE 3

Science / Society Transformed by Compute & Data

  • The sc

scientific process ss has evolved to include co computation & da data

  • Nearly every field of discovery is transitioning from “data poor” to “da

data rich”

Oceanography OOI Biology: Sequencing Personalized Medicine Crisis Management Fusion: KSTAR Physics: LHC Sociology: The Web Economics: POS Terminals Neuroscience: EEG, fMRI Internet of Things Astronomy: LSST

slide-4
SLIDE 4

Moving Aggressively Towards Exascale

  • Create systems that

can apply exaflops of computing power to exabytes of data

  • Improve HPC

application developer productivity

  • Establish hardware

technology for future HPC systems

  • ….
slide-5
SLIDE 5

Moving Aggressively Towards Exascale

slide-6
SLIDE 6

Moving Aggressively Towards Exascale

Source: Hyperion (IDC) Paints a Bullish Picture of HPC Future By John Russell

slide-7
SLIDE 7

Extreme Scales => Extreme Challenges

  • Exponential increase in parallelism
  • Extreme core counts, concurrency
  • Diversity in emerging memory and storage technologies
  • New memory technologies
  • Increasing performance gap between memory and disks
  • Growing data volumes, increasing in data costs
  • Data access costs vary widely with the location
  • Variability and heterogeneity in data movement cost (performance, energy)
  • Increasingly heterogeneous machine architecture
  • Complex CPU + accelerator architectures
  • Proliferation of accelerators
  • Diverse and complex application/user requirements
  • Complex application workflows; complex mapping onto heterogeneous systems
  • Large numbers of domain scientists and non-experts
  • Reliability, energy efficiency, correctness, ….
slide-8
SLIDE 8

RDI2

Scientific Discovery through Simulations: A BigData Problem

  • Scientific simulations running on current high-end computing systems

generate huge amounts of data!

– If a single core produces 2MB/minute on average, one of these machines could generate simulation data between ~170TB per hour -> ~700PB per day -> ~1.4EB per year

  • Successful scientific discovery depends on a comprehensive understanding
  • f this enormous simulation data

How we enable the computation scientists to efficiently manage and explore extreme scale data: “find the needles in haystack” ??

slide-9
SLIDE 9

RDI2 Traditional Simulation -> Insight Workflows Break Down

  • Traditional simulation -> insight

pipeline

– Run large-scale simulation workflows on large supercomputers – Dump data to parallel disk systems – Export data to archives – Move data to users’ sites – usually selected subsets – Perform data manipulations and analysis on mid-size clusters – Collect experimental / observational data – Move to analysis sites – Perform comparison of experimental/observational to validate simulation data

  • Figure. Traditional data analysis pipeline

Analysis/Visualization Cluster Simulation Machines Storage Servers

S i m u l a t i

  • n

R a w D a t a

slide-10
SLIDE 10
  • The energy cost of moving data is a

significant concern

  • K. Yelick, “Software and Algorithms for Exascale: Ten Ways to Waste an Exascale

Computer”

The Cost of Data Movement

Energy_move_data = bitrate*length2 cross_section_area_of_wire performance gap

  • Moving data between node

memory and persistent storage is slow!

slide-11
SLIDE 11

We need to Rethink extreme scale simulation workflows!

Traditional data analysis pipeline

We need to Rethink extreme scale simulation workflows!

– Reduce data movement – Move computation/analytics closer to the data – Add value to simulation data along the IO path

The costs of data movement (power and performance) are increasing and dominating! In-situ workflows, In-transit processing

slide-12
SLIDE 12

Some Recent Research Addressing In-Situ

  • Swift/T

– Workflow coordination, all applications share an MPI context, which is split by an execution wrapper

  • Catalyst and Libsim

– Embed analysis/viz in simulation processes using time division.

  • ADIOS

– Flexible I/O abstractions for end to end data pipelines

  • FlowVR

– Independent task coordination across processes.

  • Decaf

– Decoupled dataflow middleware for in-situ workflows

  • Bredala

– Semantic data redistribution of complex data structures for in-situ applications

  • SuperGlue

– Standardizing glue components for HPC workflows

  • Landrush

– Leverage heterogeneous compute node resources like GPUs to run in-situ workflow

  • Damaris

– Leverages dedicated cores in multicore nodes to offload data management tasks

  • Mercury

– RPC and bulk message passing across applications.

  • FlexPath

– Communication between MPI applications using a reliable transport.

  • DataSpaces
slide-13
SLIDE 13

RDI2 Rethinking the Data Management Pipeline – Hybrid Staging + In-Situ & In-Transit Execution

  • Reduce data movement
  • Move computation/analytics closer to the data source
  • Process, transform data along the data path
slide-14
SLIDE 14

DataSpaces: Extreme Scale Data Staging Service

l

Virtual shared-space programming abstraction

l

Simple API for coordination, interaction and messaging

l

Distributed, associative, in-memory object store

l

Online data indexing, flexible querying

l

Adaptive cross-layer runtime management

l

Hybrid in-situ/in-transit execution

l

Efficient, high-throughput/low-latency asynchronous data transport

The DataSpaces Abstraction

slide-15
SLIDE 15

15

The DataSpaces Staging Abstraction

  • In-memory storage distributed across set of cores/node
  • In-staging data processing, querying sharing and exchange

Data staging Runtime data coupling Online data analysis and processing

slide-16
SLIDE 16

16

Design Space for Staging

  • Location of the compute resources

– Same cores as the simulation (in situ) – Some (dedicated) cores on the same nodes – Some dedicated nodes on the same machine – Dedicated nodes on an external resource

  • Data access, placement, and persistence

– Direct access to simulation data structures – Shared memory access via hand-off / copy – Shared memory access via non-volatile near node storage (NVRAM) – Data transfer to dedicated nodes or external resources

  • Synchronization and scheduling

– Execute synchronously with simulation every nth simulation time step – Execute asynchronously

Processing data on remote nodes Using distinct cores on same node Sharing cores with the simulation

DRAM DRAM

DRAM Simulation Node DRAM Staging Node

Network Communication

Analysis Tasks Simulation Visualization

DRAM NVRAM SSD Hard Disk CPUs DRAM NVRAM SSD Hard Disk CPUs Network Node 1 Node 2 Node N

...

Staging option 1 Staging option 2 Staging option 3

slide-17
SLIDE 17

17

Extreme Scale Storage Hierarchies: Devices

SRAM: Latency ~1X DRAM: Latency ~10X 3D-RAM: Latency ~100X NAND-SSDs: Latency ~100,000X Disks: Latency ~10 MillionX

slide-18
SLIDE 18

18

Extreme Scale Storage Hierarchies: Architectures

  • Non-volatile memory

attached to nodes or to burst-buffer nodes

  • Storage nodes accessed via

PFS (Lustre) or object stores (DAOS)

slide-19
SLIDE 19

19

Time-Sensitivity of Data Storage in Scientific Workflows

Credit: Gary Grider LANL

slide-20
SLIDE 20

20

Outline

  • Extreme scale simulation-based science – opportunities

and challenges

  • Rethinking the simulations-to-insights pipeline: Data

staging and in-Situ workflows

  • Runtime management for data staging and in-situ

workflows

– Data placement – Resilience

  • Conclusion
slide-21
SLIDE 21

In-Staging Data Management

  • Limited DRAM capacity, and decreasing bandwidth vs.

increasing data size -- need to use multiple memory levels for staging

  • Effectiveness of staging is sensitive to the data

placement across the staging cores/nodes and the levels

  • f the memory hierarchy

– Data access latency can significant impact the overall performance of the workflows

  • Efficient data placement can be challenging because of

the complex and dynamic data exchange/access patterns exhibited by the different components of the workflow, and by different workflows

slide-22
SLIDE 22

Example: Managing Multi-tiered Data Staging in DataSpaces

  • A multi-tiered data staging approach that leverages both DRAM and SSD to

support code coupling and data management in data-intensive simulation workflows.

  • Efficient utility-based application-aware data placement mechanism
  • - Application-aware: utilizing temporal and spatial data access attributes
  • - Adaptive: placing data objects dynamically based on data read patterns
slide-23
SLIDE 23

Autonomic Data Management

  • Objective: Optimize data access by prefetching and appropriately placing

target data objects to DRAM prior to a read request from a coupled component

  • Approach: Leverage spatial and temporal data read pattern information to

dynamically place data objects at different memory hierarchy levels, i.e., SSD or DRAM

Anticipate data read pattern

  • User provided data read information as ‘hints’
  • Application-level data locality
  • Runtime prediction on data read pattern

Quantify utility

  • Data objects predicted to be read in the

near future have higher utility

Place data

  • bjects
  • Data objects with higher data utility

remains in the DRAM level longer Utility quantifies the relative value of data objects in the staging area, for example, based

  • n anticipated data read patterns
slide-24
SLIDE 24

Acquiring Access/Location Information

  • Determining data access patterns

○ User-define hints about temporal and spatial access ○ Anticipate the access patterns based on historical data accesses

An illustration of spatial-temporal data read patterns for a 2D data domain with N time steps. Gray regions: data written into the staging area; yellow regions and checkered regions: data read by two different applications. An illustration of feature tracking case about subtle vortical structures identified in a large and complex flow field of turbulent combustion.

slide-25
SLIDE 25

Determining Data Placement

○ Place data close to computation

○ Reduce data access costs ○ Example

○ P4 accesses data d4 ○ P4 is mapped to S1 ○ d4 is placed on S1

○ Dynamically replicate data to resolve conflicting requirements

○ Efficient usage of storage space ○ Example

○ d2 is replicated and placed ○

  • n both S1 and S2
slide-26
SLIDE 26

RDI2 Data Placement for Asynchronous Coupling of Task-based Scientific Workflows

  • The performance of task-based application workflows is

sensitive to the data placement across the staging cores/nodes

– Which data elements to place; Where to place data elements

  • Data prioritization – determine which

data to place

– Asynchronous data generation – Estimated Execution Time – EET

  • History-based estimation
  • Resource selection – determine

where to place data

– Computation load

  • Estimated based on EET

– Data affinity

  • Based on dataflow graph
slide-27
SLIDE 27

Some Relevant Papers…

  • Adaptive Data Placement Across Memory Levels

– T. Jin, et al, "Exploring Data Staging Across Deep Memory Hierarchies for Coupled Data Intensive Simulation Workflows", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS'15), May 2015.

  • Adaptive Data Placement/Replication Across Staging Nodes

– Q. Sun, et al, "Adaptive Data Placement For Staging-based Coupled Scientific Workflows", ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'15), Nov, 2015

  • Data Placement for Asynchronous Coupling of Task-based

Scientific Workflows

– Q. Sun, et al, “In-Staging Data Placement for Asynchronous Coupling of Task-Based Scientific Workflows,” 2nd International Workshop on ESPM2'16 in conjunction with SC’16, Nov, 2016. (Best Paper)

slide-28
SLIDE 28

RDI2

Using Machine Learning for Autonomic Data Management (Submitted to SC’18)

slide-29
SLIDE 29

RDI2

Machine Learning Model

  • Artificial Neural Networks

– Supervised/ Unsupervised Learning – Training of neural network or determining weights takes a long time (& needs sufficient input data)

  • N-gram based models

– Use n-grams models to predict not

  • nly access patterns of offsets of a

variable, but also predict which variable – Build a hash-table for each variable and predict upcoming sequence of accesses

slide-30
SLIDE 30

RDI2

Some Early Results for N-gram based Prefetching

50 55 60 65 70 75 80 85 1 2 3 4 5 6 7 8 9 10 256+4096+4096 512+8192+8192 1024+16384+16384 % of Read Requests going to Disk Cumulative Read Time for 20 Time Steps

  • No. of cores (Servers+S3D+Analysis)

In-memory ML-Based Locality-Based No-prefetching ML-Based Disk Access Locality-based Disk Access 50 55 60 65 70 75 80 85 90 95 5 10 15 20 25 256+4096+4096 256+8192+8192 256+16384+16384 % of Read Requests going to Disk Cumulative Read Time for 20 Time Steps

  • No. of cores (Servers+S3D+Analysis)

In-memory ML-Based Locality-Based No-prefetching ML-Based Disk Access Locality-based Disk Access

Experiments were run on Titan Supercomputer with the S3D Workflow with DNS - LES coupling

a) Increase of servers and clients in same ratio b) Increase in clients only

Resilience Resilience

slide-31
SLIDE 31

RDI2

Outline

  • Extreme scale simulation-based science – opportunities

and challenges

  • Rethinking the simulations-to-insights pipeline: Data

staging and in-Situ workflows

  • Runtime management for data staging and in-situ

workflows

– Data placement – Resilience

  • Conclusion
slide-32
SLIDE 32

RDI2 Data Resilience Challenge at Extreme Scale

Histograms of inter-failure arrival time

Data based on available public records in: D. Tiwari, S. Gupta, S. S. Vazhkudai. “Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems.” DSN 2014

Metric: average MTBF – e.g. 8h in Titan Longest period without any failures – e.g. 24h in Titan

Node Failure Frequency in Current Systems

Estimated MTBF for an exascale system would be in minutes

slide-33
SLIDE 33

RDI2 Data Resilience for Staging-based In-Situ Workflows?

q Unfortunately, traditional HPC fault tolerance techniques can not be directly used to implement resilient data staging services.

Data Objects In Memory Data Staging Resiliency approach (ULFM) Simulation C Resiliency approach (Checkpoint/restart) Simulation B Resiliency approach (Fenix) Coupled Simulation F Resiliency approach (Checkpoint/restart) Simulation A Resiliency approach (ULFM) Analytic D Resiliency approach (Fenix) Analytic E Vulnerable for failure

q Ideally, data resilience in the staging service should be transparent to workflows and provide acceptable storage/compute overheads.

Impact of checkpointing on staging-based in-situ workflows

50 100 150 200 250 300 1G 2G 4G 8G Time (sec) Data Size Exec Exec-CoREC Exec-check Checkpoint Restart

slide-34
SLIDE 34

CoREC – Combining Replication and Erasure Coding [IPDPS’18]

  • A hybrid approach to data resilience for staging-based workflows
  • Leverages data classification for intelligent decision making

ü Spatial/Temporal Locality ü Hot data: Replication ü Cold Data: Erasure Coding

Dynamic hybrid Erasure Codes & Replication

Data Set Fault Tolerance Method Set Hot Write Data Cold Data Hot Read Data Erasure code N-way Replication frequency performance Data Objects In Memory Data Staging Redundant Data Objects Hybrid Erasure Coding Cold Data Object: Hot Data Object: Parity Object: Replica: 34

slide-35
SLIDE 35

Modeling the CoREC Approach

q A time complexity of CoREC:

!" !#: Time Complexity of replication / erasure coding $

% $ &: Frequency of updates for hot / cold data

'

% ' &: Percentage of hot / cold data

Cerasure: fully erasure coding Creplica: fully replication Chybrid: simple hybrid erasure coding CCoREC: with varying miss ratios ((

))

= !"$

% − !#$ & + !# − !" $ %( ) -'% + !#$ &-

!./01. = !"$

%-'% + !#$ &-' &

= !"$

% − !#$ & -'% + !#$ &-

'

& = 1 − ' %

!# − !" $

%( )-' %

Relative time complexity Ph (%) 100 1 2 Ph = Pr Percentage of hot data

q Factors affecting relative time complexity of CoREC:

ü The difference of hot and cold data frequencies ($

% − $ &)

ü The difference of replication and erasure coding complexity (!#−!") ü The scale of workload - ü The accuracy of data classification (

)

35

slide-36
SLIDE 36

RDI2 Data Classification

q Hot/Cold data:

If a data object has been recently accessed more than a threshold number of times within a certain time interval it is considered as hot data, otherwise it is considered as cold data.

q Hot/Cold data classification:

Based on Spatial and temporal data locality.

An illustration of spatial and temporal data write/update patterns for a 2D data domain with N + 1 time steps. The red regions and slash regions (hot data) indicate data written into the staging area, while the blue regions (cold data) are not updated since time step i.

TS 1 TS i TS n TS n+1 Y X Hot data Cold data Query data TS 1 TS 2 TS i TS n TS n+1 Y X

(a) Single time step data locality case (b) Multi time steps data locality case

slide-37
SLIDE 37

Experimental Evaluation

Result: CoREC reduces the write response time by 7.3%, 14.8%, and 5.4% as compared to full erasure coding

  • n three scales respectively.

CoREC reduces the read response time by up to 40.8% and 37.4% for one and two failures respectively.

0.5 1 1.5 2 2.5 3 3.5 4 4480 8960 17920

Read Time (sec)

  • No. of cores

S3D disk DataSpaces Replicate Erasure CoREC Erasure+1f CoREC+1f Erasure+2f CoREC+2f 21.5436 sec 22.7551 sec 26.8691 sec 2 4 6 8 10 12 14 16 18 20 22 4480 8960 17920

Write Time (sec)

  • No. of cores

S3D disk DataSpaces Replicate Erasure CoREC Erasure+1f CoREC+1f Erasure+2f CoREC+2f 166.46 sec 234.42 sec 346.68 sec

Comparison of the cumulative data read/write response time using S3D and analysis workflow on Titan

S3D combustion simulation and analysis workflow on Titan Cray XK7

  • No. of cores

4480 8960 17920

  • No. of simulation cores

16x16x16= 4096 32x16x16= 8192 32x32x16=16384

  • No. of staging cores

256 512 1024

  • No. of analysis cores

128 256 512 Volume size 1024x1024x1024 2048x1024x1024 2048x2048x1024 Data size (GB) 160 320 640

  • No. of replica

1 1 1

  • No. of data objects

3 3 3

  • No. of parity objects

1 1 1 Storage efficiency 67% 67% 67%

slide-38
SLIDE 38

WDM Fusion Co-design Workflow (ECP)

XGC GENE

Interpolator

Reduction Reduction XGC Viz. XGC

  • utput

GENE Viz. GENE

  • utput

Comparative Viz.

NVRAM

PFS TAPE

Performance Viz.

slide-39
SLIDE 39

XGC1 – XGCa Coupled Plasma Fusion Simulation

  • XGCa accelerates XGC1

using a coarse particle simulation

  • Large intermediate

turbulence and particle data

– One-way data exchange of two types of data: Particle data (large size, single iteration) and Turbulence data (small size, multiple iterations) – Data generated in each cycle needs to be cached for further usage

  • Loosely and tightly-coupled

variants

XGC-1 Execution Particle Turbulence Particle Particle Turbulence XGC-1 Execution XGC-a Execution XGC-a Execution I/O I/O I/O I/O

slide-40
SLIDE 40

Exchanging Particle Information (2-Way)

Performance of Reading Particle Data*

* XGC1 and XGCa averages shown on same plot

Approach Setup 1 Setup 2 Setup 3 Disk-based

(sec)

35.792 90.062 425.059 Server- based (sec) 1.865 2.283 3.781 On-node

(sec)

0.097 0.131 0.316 Total Time of Writing Particle Data*

ß Decreases the total time for writing particle data by avg. 99% compared to disk-based and by

  • avg. 93% when compared with

server-based Decreases the total time for reading particle data by avg. 98% compared to disk-based and by

  • avg. 92% when compared with

server-based à

17 * XGC1 and XGCa averages shown in same table

slide-41
SLIDE 41

RDI2

Summary & Conclusions

  • Complex applications running on high-end systems

generate extreme amounts of data that must be managed and analyzed to get insights

– Data costs (performance, latency, energy) are quickly dominating – Traditional data management/analytics pipelines are breaking down

  • Hybrid data staging, in-situ workflow execution, adaptive

data placement, dynamic reliability, etc. can address these challenges

– Users to efficiently intertwine applications, libraries, middleware for complex analytics

  • Many challenges; Programming, mapping and scheduling,

control and data flow, autonomic runtime management….

– The DataSpaces project explores solutions at various levels

slide-42
SLIDE 42

Thank You!

Manish Parashar Email: parashar@rutgers.edu WWW: dataspaces.org