Large-Scale Adaptive Mesh Simulations Through Non-Volatile - - PowerPoint PPT Presentation

large scale adaptive mesh simulations through non
SMART_READER_LITE
LIVE PREVIEW

Large-Scale Adaptive Mesh Simulations Through Non-Volatile - - PowerPoint PPT Presentation

Large-Scale Adaptive Mesh Simulations Through Non-Volatile Byte-Addressable Memory Bao Nguyen Hua Tan Xuechen Zhang Kei Davis* * Octree Meshing is Widely Used in HPC Simulation Droplet breakup Micro-boiling Droplet ejection 2


slide-1
SLIDE 1

Large-Scale Adaptive Mesh Simulations Through Non-Volatile Byte-Addressable Memory

Bao Nguyen Hua Tan Xuechen Zhang Kei Davis* *

slide-2
SLIDE 2

Octree Meshing is Widely Used in HPC Simulation

2

Droplet breakup Micro-boiling Droplet ejection

slide-3
SLIDE 3

Quad/Octree-Based Adaptive Meshing

3

1 2 3 4 5 7 8 9 10 6 2 3 4 5 7 8 9 10

R

1 6

Quad/octree representation in DRAM Domain decomposition

Because models span larger length and time-scales, DRAM demand is significant even on supercomputers.

slide-4
SLIDE 4

Per-core DRAM Capacity is Shrinking on Supercomputers

4

Jaguar: 2.7-4 GB/core Titan: 2 GB/core Due to associated capital costs and power consumptions

slide-5
SLIDE 5

Using Non-Volatile Byte-addressable Memory for Meshing

5

Flash DRAM NVBM

Speed Cost Non-Volatility Byte-Addressability

Low High High* Decreasing Increasing Decreasing Yes No Yes No Yes Yes

Power

Low High Low

Speed Cost Non-Volatility Byte-Addressability Power

slide-6
SLIDE 6

Existing Applications were Not Designed for NVBM

6

In-core Algorithms Out-of-core Algorithms Linear octree[SC’07], parallel octree[SC’05], etc. But they save snapshots on storage systems for failure recovery; I/Os can be the bottleneck. Etree[SC’04], visualization[TVCG’97], etc. But they were designed for slow non-volatile mediums, e.g., SSDs and HDDs.

Can we support in-NVBM octree meshing bypassing slow I/O buses?

slide-7
SLIDE 7

Challenge I: NVBM Writes Incur Higher Latency

7

NVBM write latency is 2.5X greater than DRAM.

DRAM NVBM

Meshing operations (e.g., refinement) are write-intensive.

slide-8
SLIDE 8

Challenge II: Existing Octrees Are Not Durable for NVBM

8

7 8 9 10 11

A failure may cause the pointer to link to an undefined region in NVBM.

After normal pointer writing

7 8 9 10 X

After failed pointer writing

slide-9
SLIDE 9

Challenge III: Difficult to Handle Special Pointers

9

2 3 4 5 7 8 9 10

R

1 6

DRAM NVBM

Handling special pointers introduces extra complexity for application developers.

.

Special pointers

slide-10
SLIDE 10

Design Objectives of Persistent-Merged Octree

10

In-NVBM meshing & storage Hiding write latency to NVBM Orthogonal persistence

+ +

Persistent-merged octree (PM-octree)

slide-11
SLIDE 11

PM-Octree Design: A Multi-Version Data Structure

11

Vi-1 Vi

Persistent Volatile

NVBM DRAM +NVBM

The persistent version provides the desired durability.

slide-12
SLIDE 12

PM-Octree Design: Octant Sharing between Versions

12

NVBM

C1 tree Vi-1 Vi

Observation: many spatial domains do not change in adjacent time steps.

.

Reduce the memory usage by up to 1.9X.

slide-13
SLIDE 13

PM-Octree Design: Partitioned Data Structure

13

C1 tree in NVBM C0 tree in DRAM

2 3 4 5 7 8 9 10

R

1 6

Vi VD

i

Effectively use both DRAM and NVBM.

slide-14
SLIDE 14

PM-Octree Design: Dynamic Layout Transformation

14

Layout transformation is periodically executed to hide NVBM write latency.

NVBM DRAM

2 3 4 5 7 8 9 10

R

1 6

ViD

NVBM DRAM

1 6

R

7 8 9 10 2 3 4 5

Vi ViD

slide-15
SLIDE 15

Putting Together the Components of PM-Octree

15

A multi-version data structure for both in-memory meshing and storage.

NVBM DRAM

C1 tree C0 tree Vi-1 Vi Vi

D

C1 tree

It provides near-instantaneous failure recovery by accessing memory bus.

slide-16
SLIDE 16

Basic Operation: Octant Insertion

16 2 3 4 5 7 8 9 10 R 1 6 Vi-1 11 u 2 3 4 5 7 8 9 10 R 1 6 9’ R’ Vi-1 Vi 11 u u’

Before inserting octant 11 After inserting octant 11

slide-17
SLIDE 17

Basic Operation: Octant Update

17 2 3 4 5 7 8 9 10 R 1 6 10’ R’ Vi-1 Vi 9’ 11 u u’ 2 3 4 5 7 8 9 10 R 1 6 9’ R’ Vi-1 Vi 11 u u’

Before updating octant 10 After updating octant 10

slide-18
SLIDE 18

PM-Octree Design: Orthogonal Persistence

18

Routine Description pmoctree ⋆ pm_create(octree ⋆ tree) create a new PM-octree; return a pointer to Vi void pm_persistent(pmoctree ⋆ tree) create a persistent version of

  • ctree

pmoctree ⋆ pm_restore(void) restore a PM-octree; return a pointer to Vi void pm_delete(pmoctree ⋆ tree) delete all octants on NVBM and DRAM

We integrated it with Gerris flow solver.

slide-19
SLIDE 19

Experimental Setting

19

  • Hardware

ØTitan at ORNL ØEmulation of NVBM using DRAM

  • Simulation
  • Droplet rotation and ejection

Routine DRAM NVBM Read Latency (ns) 60 100 Write Latency (ns) 60 150

slide-20
SLIDE 20

Comparison of Meshing Methods

20

Method name Objects in DRAM Objects in NVBM Interface

In-core-octree Octants Snapshot File System Out-of-core-octree Cache Octant record File System PM-octree Octants Octants Memory

slide-21
SLIDE 21

Weak Scaling

21

  • 1.2M to 1077M elements
  • 1 to 1000 PEs
  • Number of element on

each PE: ~1 million

The execution time of PM-octree increases as a logarithm of problem size.

slide-22
SLIDE 22

Execution Time Breakdown with Weak Scaling

22

Tree partitioning overhead prevents from achieving an

  • ptimal speedup.
slide-23
SLIDE 23

Strong Scaling

23

  • Problem size is 150

million elements

  • 240 to 1000 PEs

Scalability of PM-octree is similar as in-core-octree.

slide-24
SLIDE 24

Execution Time Breakdown with Strong Scaling

24

No scalability issue because no major fluctuation is observed

slide-25
SLIDE 25

Failure Recovery

25

PM-octree guarantees data consistency after failures. PM-octree reduces the failure recovery time by up to 20X.

slide-26
SLIDE 26

Conclusions

26

  • PM-octree effectively extends memory

capacity using NVBM.

  • It scales as well as in-core algorithms.
  • It significantly reduces the time of recovery.
  • It provides easy-to-program interface.
slide-27
SLIDE 27

27

Acknowledgments

Xuechen Zhang xuechen.zhang@wsu.edu

Bao Nguyen Hua Tan

slide-28
SLIDE 28

Basic Operation: Octant Merging

28 C1 NVBM subtree DRAM subtree C0 Vi-1 C1 C0 Vi

Before merging C0 After merging C0

slide-29
SLIDE 29

Basic Operations: Persistent

29 R R’ Vi-1 Vi R R’ Vi+1 Vi

Before persistent After persistent

slide-30
SLIDE 30

Layout Dynamic Transformation

30

Execution time is reduced by 25% while the number of writes is reduced by up to 30%.

slide-31
SLIDE 31

Impact of DRAM Size

31

Varied memory sizes influence the merging frequency and execution time.