Large-Scale Adaptive Mesh Simulations Through Non-Volatile - - PowerPoint PPT Presentation
Large-Scale Adaptive Mesh Simulations Through Non-Volatile - - PowerPoint PPT Presentation
Large-Scale Adaptive Mesh Simulations Through Non-Volatile Byte-Addressable Memory Bao Nguyen Hua Tan Xuechen Zhang Kei Davis* * Octree Meshing is Widely Used in HPC Simulation Droplet breakup Micro-boiling Droplet ejection 2
Octree Meshing is Widely Used in HPC Simulation
2
Droplet breakup Micro-boiling Droplet ejection
Quad/Octree-Based Adaptive Meshing
3
1 2 3 4 5 7 8 9 10 6 2 3 4 5 7 8 9 10
R
1 6
Quad/octree representation in DRAM Domain decomposition
Because models span larger length and time-scales, DRAM demand is significant even on supercomputers.
Per-core DRAM Capacity is Shrinking on Supercomputers
4
Jaguar: 2.7-4 GB/core Titan: 2 GB/core Due to associated capital costs and power consumptions
Using Non-Volatile Byte-addressable Memory for Meshing
5
Flash DRAM NVBM
Speed Cost Non-Volatility Byte-Addressability
Low High High* Decreasing Increasing Decreasing Yes No Yes No Yes Yes
Power
Low High Low
Speed Cost Non-Volatility Byte-Addressability Power
Existing Applications were Not Designed for NVBM
6
In-core Algorithms Out-of-core Algorithms Linear octree[SC’07], parallel octree[SC’05], etc. But they save snapshots on storage systems for failure recovery; I/Os can be the bottleneck. Etree[SC’04], visualization[TVCG’97], etc. But they were designed for slow non-volatile mediums, e.g., SSDs and HDDs.
Can we support in-NVBM octree meshing bypassing slow I/O buses?
Challenge I: NVBM Writes Incur Higher Latency
7
NVBM write latency is 2.5X greater than DRAM.
DRAM NVBM
Meshing operations (e.g., refinement) are write-intensive.
Challenge II: Existing Octrees Are Not Durable for NVBM
8
7 8 9 10 11
A failure may cause the pointer to link to an undefined region in NVBM.
After normal pointer writing
7 8 9 10 X
After failed pointer writing
Challenge III: Difficult to Handle Special Pointers
9
2 3 4 5 7 8 9 10
R
1 6
DRAM NVBM
Handling special pointers introduces extra complexity for application developers.
.
Special pointers
Design Objectives of Persistent-Merged Octree
10
In-NVBM meshing & storage Hiding write latency to NVBM Orthogonal persistence
+ +
Persistent-merged octree (PM-octree)
PM-Octree Design: A Multi-Version Data Structure
11
Vi-1 Vi
Persistent Volatile
NVBM DRAM +NVBM
The persistent version provides the desired durability.
PM-Octree Design: Octant Sharing between Versions
12
NVBM
C1 tree Vi-1 Vi
Observation: many spatial domains do not change in adjacent time steps.
.
Reduce the memory usage by up to 1.9X.
PM-Octree Design: Partitioned Data Structure
13
C1 tree in NVBM C0 tree in DRAM
2 3 4 5 7 8 9 10
R
1 6
Vi VD
i
Effectively use both DRAM and NVBM.
PM-Octree Design: Dynamic Layout Transformation
14
Layout transformation is periodically executed to hide NVBM write latency.
NVBM DRAM
2 3 4 5 7 8 9 10
R
1 6
ViD
NVBM DRAM
1 6
R
7 8 9 10 2 3 4 5
Vi ViD
Putting Together the Components of PM-Octree
15
A multi-version data structure for both in-memory meshing and storage.
NVBM DRAM
C1 tree C0 tree Vi-1 Vi Vi
D
C1 tree
It provides near-instantaneous failure recovery by accessing memory bus.
Basic Operation: Octant Insertion
16 2 3 4 5 7 8 9 10 R 1 6 Vi-1 11 u 2 3 4 5 7 8 9 10 R 1 6 9’ R’ Vi-1 Vi 11 u u’
Before inserting octant 11 After inserting octant 11
Basic Operation: Octant Update
17 2 3 4 5 7 8 9 10 R 1 6 10’ R’ Vi-1 Vi 9’ 11 u u’ 2 3 4 5 7 8 9 10 R 1 6 9’ R’ Vi-1 Vi 11 u u’
Before updating octant 10 After updating octant 10
PM-Octree Design: Orthogonal Persistence
18
Routine Description pmoctree ⋆ pm_create(octree ⋆ tree) create a new PM-octree; return a pointer to Vi void pm_persistent(pmoctree ⋆ tree) create a persistent version of
- ctree
pmoctree ⋆ pm_restore(void) restore a PM-octree; return a pointer to Vi void pm_delete(pmoctree ⋆ tree) delete all octants on NVBM and DRAM
We integrated it with Gerris flow solver.
Experimental Setting
19
- Hardware
ØTitan at ORNL ØEmulation of NVBM using DRAM
- Simulation
- Droplet rotation and ejection
Routine DRAM NVBM Read Latency (ns) 60 100 Write Latency (ns) 60 150
Comparison of Meshing Methods
20
Method name Objects in DRAM Objects in NVBM Interface
In-core-octree Octants Snapshot File System Out-of-core-octree Cache Octant record File System PM-octree Octants Octants Memory
Weak Scaling
21
- 1.2M to 1077M elements
- 1 to 1000 PEs
- Number of element on
each PE: ~1 million
The execution time of PM-octree increases as a logarithm of problem size.
Execution Time Breakdown with Weak Scaling
22
Tree partitioning overhead prevents from achieving an
- ptimal speedup.
Strong Scaling
23
- Problem size is 150
million elements
- 240 to 1000 PEs
Scalability of PM-octree is similar as in-core-octree.
Execution Time Breakdown with Strong Scaling
24
No scalability issue because no major fluctuation is observed
Failure Recovery
25
PM-octree guarantees data consistency after failures. PM-octree reduces the failure recovery time by up to 20X.
Conclusions
26
- PM-octree effectively extends memory
capacity using NVBM.
- It scales as well as in-core algorithms.
- It significantly reduces the time of recovery.
- It provides easy-to-program interface.
27
Acknowledgments
Xuechen Zhang xuechen.zhang@wsu.edu
Bao Nguyen Hua Tan
Basic Operation: Octant Merging
28 C1 NVBM subtree DRAM subtree C0 Vi-1 C1 C0 Vi
Before merging C0 After merging C0
Basic Operations: Persistent
29 R R’ Vi-1 Vi R R’ Vi+1 Vi
Before persistent After persistent
Layout Dynamic Transformation
30
Execution time is reduced by 25% while the number of writes is reduced by up to 30%.
Impact of DRAM Size
31
Varied memory sizes influence the merging frequency and execution time.