Accelerating Parallel Analysis of Scientific Simulation Data via - - PowerPoint PPT Presentation

accelerating parallel analysis of scientific simulation
SMART_READER_LITE
LIVE PREVIEW

Accelerating Parallel Analysis of Scientific Simulation Data via - - PowerPoint PPT Presentation

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research Motivation Goal: To model biological


slide-1
SLIDE 1

Accelerating Parallel Analysis

  • f Scientific Simulation Data

via Zazen

Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw

  • D. E. Shaw Research
slide-2
SLIDE 2

2

Motivation

Goal: To model biological processes that occur on the millisecond time scale Approach: A specialized, massively parallel super- computer called Anton (2009 ACM Gordon Bell Award for Special

Achievement)

slide-3
SLIDE 3

3

Millisecond-scale MD Trajectories

Simulation length: 1x 10-3 s ÷ Output interval: 10 x 10-12 s Number of frames: 100 M frames A biomolecular system: 25 K atoms × Position and velocity: 24 bytes/atom Frame size: 0.6 MB/frame

slide-4
SLIDE 4

4

Part I: How We Analyze Simulation Data in Parallel

slide-5
SLIDE 5

5

An MD Trajectory Analysis Example: Ion Permeation

slide-6
SLIDE 6

6

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ion A Ion B

A Hypothetic Trajectory

20,000 atoms in total; two ions of interest

slide-7
SLIDE 7

7

Ion State Transition

Above channel Inside channel Into channel from above Below channel Into channel from below

S S S

slide-8
SLIDE 8

8

Typical Sequential Analysis

Maintain a main-memory resident data structure to record states and positions Process frames in ascending simulated physical time order Strong inter-frame data dependence: Data analysis tightly coupled with data acquisition

slide-9
SLIDE 9

9

Problems with Sequential Analysis

Millisecond-scale trajectory size : 60 TB Local disk read bandwidth : 100 MB / s Time to fetch data to memory : 1 week Analysis time : Varied Time to perform data analysis : Weeks

Sequential analysis lack the computational, memory, and I/O capabilities!

slide-10
SLIDE 10

10

A Parallel Data Analysis Model

Specify which frames to be accessed

Trajectory definition Stage1: Per-frame data acquisition Stage 2: Cross-frame data analysis

Decouple data acquisition from data analysis

slide-11
SLIDE 11

11

Trajectory Definition

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ion A Ion B

Every other frame in the trajectory

slide-12
SLIDE 12

12

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ion A Ion B

Per-frame Data Acquisition (stage 1)

P0 P1

slide-13
SLIDE 13

13

Cross-frame Data Analysis (stage 2)

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ion A Ion B

Analyze ion A on P0 and ion B on P1 in parallel

slide-14
SLIDE 14

14

Inspiration: Google’s MapReduce

reduce(K1, ...) reduce(K2, ...)

Google File System Input files Input files Input files

map(...)

K1: {v1j} K2: {v2j} K1: {v1j, v1i, v1k}

map(...)

K1: {v1i} K2: {v2i}

map(...)

K1: {v1k} K2: {v2k} K2: {v2k, v2j, v2i} Output file Output file

slide-15
SLIDE 15

15

Trajectory Analysis Cast Into MapReduce

Per-frame data acquisition (stage 1): map() Cross-frame data analysis (stage 2): reduce() Key-value pairs: connecting stage1 and stage2 Keys: categorical identifiers or names Values: including timestamps Examples: (ion_idj, (tk, xik, yjk, zjk))

Key Value

slide-16
SLIDE 16

16

The HiMach Library

A MapReduce-style API that allows users to write Python programs to analyze MD trajectories A parallel runtime that executes HiMach user programs in parallel on a Linux cluster automatically Performance results on a Linux cluster: 2 orders of magnitude faster on 512 cores than on a single core

slide-17
SLIDE 17

17

Typical Simulation–Analysis Storage Infrastructure

Parallel supercomputer File servers

Analysis node

Analysis cluster

I/O node

Analysis node Local disks Local disks

Parallel analysis programs

slide-18
SLIDE 18

18

Part II: How We Overcome the I/O Bottleneck in Parallel Analysis

slide-19
SLIDE 19

19

Trajectory Characteristics A large number of small frames Write once, read many Distinguishable by unique integer sequence numbers Amenable to out-of-order parallel access in the map phase

slide-20
SLIDE 20

20

Our Main Idea At simulation time, actively cache frames in the local disks of the analysis nodes as the frames become available At analysis time, fetch data from local disk caches in parallel

slide-21
SLIDE 21

21

Limitations Require large aggregate disk capacity on the analysis cluster Assume relatively low average simulation data output rate

slide-22
SLIDE 22

22

An Example

Analysis node 0

/ sim0 / bodhi

Analysis node 1

/ bodhi sim0 sim1 sim1 f0 f2

2

seq sim1 f0 f2 f1 f3 sim0 f1

1

seq f3

3 0 1 0 1 Local bitmap 1 0 1 0 Remote bitmap

NFS server

1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 Merged bitmap

Local bitmap Remote bitmap Merged bitmap

slide-23
SLIDE 23

23

The Zazen Protocol

How to guarantee that each frame is read by one and

  • nly one node in the face of

node failure and recovery?

slide-24
SLIDE 24

24

The Zazen Protocol

Execute a distributed consensus protocol before performing actual disk I/O Assign data retrieval tasks in a location- aware manner Read data from local disks if the data are already cached Fetch missing data from file servers No metadata servers to keep record of who has what

slide-25
SLIDE 25

25

The Zazen Protocol (cont’d)

Bitmaps: a compact structure for recording the presence or non-presence

  • f a cached copy

All-to-all reduction algorithms: an efficient mechanism for inter-processor collective communications (used an MPI library in practice)

slide-26
SLIDE 26

26

Implementation

The Bodhi library The Bodhi server The Zazen protocol

Parallel supercomputer

Bodhi server Bodhi library Bodhi server

File servers

Analysis node Analysis node

Zazen cluster

Parallel analysis programs (HiMach jobs) Zazen protocol Bodhi library Bodhi library I/O node

slide-27
SLIDE 27

27

Performance Evaluation

slide-28
SLIDE 28

28

A Linux cluster with 100 nodes Two Intel Xeon 2.33 GHz quad-core processors per node Four 500 GB 7200-RPM SATA disks organized in RAID 0 per node 16 GB physical memory per node CentOS 4.6 with a Linux kernel of 2.6.26 Nodes connected to a Gigabit Ethernet core switch Common accesses to NFS directories exported by a number of enterprise storage servers

Experiment Setup

slide-29
SLIDE 29

29

Fixed-Problem-Size Scalability

2 4 6 8 10 12 14 16 1 2 4 8 16 32 64 128

Time (s) Number of nodes

Execution time of the Zazen protocol to assign the I/O tasks of reading 1 billion frames

slide-30
SLIDE 30

30

Fixed-Cluster-Size Scalability

1E-03 1E-02 1E-01 1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09

Time (s) Number of frames

Execution time of the Zazen protocol on 100 nodes assigning different number of frames

slide-31
SLIDE 31

31

Efficiency I: Achieving Better I/O BW

One Bodhi daemon per user process

5 10 15 20 25 1 2 4 8

GB/s Application read processes per node

One Bodhi daemon per analysis node

5 10 15 20 25 1 2 4 8

GB/s Application read processes per node

1-GB 256-MB 64-MB 2-MB

slide-32
SLIDE 32

32

Efficiency II: Comparison w. NFS/PFS

NFS (v3) on separate enterprise storage servers

  • Dual quad-core 2.8-GHz Opteron processors, 16 GB memory, 48

SATA disks organized in RAID 6

  • Four 1 GigE connection to the core switch of the 100-node cluster

PVFS2 (2.8.1) on the same 100 analysis nodes

  • I/O (data) server and metadata server on all nodes
  • File I/O performed via the PVFS2 Linux kernel interface

Hadoop/HDFS (0.19.1) on the same 100 nodes

  • Data stored via HDFS’s C library interface, block sizes set to be

equal to file sizes, three replications per file

  • Data accessed via a read-only Hadoop MapReduce Java program

(with a number of best-effort optimizations)

slide-33
SLIDE 33

33

Efficiency II: Outperforming NFS/PFS

5 10 15 20 25 2 MB 64 MB 256 MB 1 GB

GB/s

File size for read

NFS PVFS2 Hadoop/HDFS Zazen

I/O bandwidth of reading files of different sizes

slide-34
SLIDE 34

34

Efficiency II: Outperforming NFS/PFS

Time to read one terabyte of data

1E+01 1E+02 1E+03 1E+04 1E+05 2 MB 64 MB 256 MB 1 GB

Time (s) File size for read

NFS PVFS2 Hadoop/HDFS Zazen

slide-35
SLIDE 35

35

Read Perf. under Writes (1GB/s)

10 20 30 40 50 60 70 80 90 100 2 MB 64 MB 256 MB 1 GB

Normalized performance File size for reads File size for writes

No writes 1 GB files 256 MB files 64 MB files 2 MB files

slide-36
SLIDE 36

36

End-to-End Performance

A HiMach analysis program called water residence on 100 nodes 2.5 million small frame files (430 KB each)

100 1,000 10,000 1 2 4 8

Time (s) Application processes per node

NFS Zazen Memory

slide-37
SLIDE 37

37

Robustness

Worst case execution time is T(1 + δ (B/b) ) The water-residence program re-executed with varying number of nodes powered off

200 400 600 800 1,000 1,200 1,400 1,600 0% 10% 20% 30% 40% 50%

Time (s) Node failure rate

Theoretical worst case Actual running time

slide-38
SLIDE 38

38

Summary

Zazen accelerates order-independent, parallel data access by (1) actively caching simulation output, and (2) executing an efficient distributed consensus protocol.

Simple and robust Scalable on a large number of nodes Much higher performance than NFS/PFS * Applicable to a certain class of time- dependent simulation datasets *