for Convergence of Extreme Computing and Big Data Satoshi Matsuoka - - PowerPoint PPT Presentation

for convergence of extreme
SMART_READER_LITE
LIVE PREVIEW

for Convergence of Extreme Computing and Big Data Satoshi Matsuoka - - PowerPoint PPT Presentation

TSUBAME2.0, 2.5 towards 3.0 for Convergence of Extreme Computing and Big Data Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology Fellow, Association for Computing Machinery (ACM)


slide-1
SLIDE 1

TSUBAME2.0, 2.5 towards 3.0 for Convergence of Extreme Computing and Big Data

Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology Fellow, Association for Computing Machinery (ACM)

HP-CAST SC2014 Presentation New Orleans, USA 20141114

slide-2
SLIDE 2

2 2

32nm 40nm >400GB/s Mem BW 80Gbps NW BW ~1KW max

>1.6TB/s Mem BW >12TB/s Mem BW 35KW Max >600TB/s Mem BW 220Tbps NW Bisecion BW 1.4MW Max

TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World”

  • GPU-centric (> 4000) high performance & low power
  • Small footprint (~200m2 or 2000 sq.ft), low TCO
  • High bandwidth memory, optical network, SSD storage…

TSUBAME 2.0 New Development

slide-3
SLIDE 3

TSUBAME2.0⇒2.5 Thin Node Upgrade (Fall 1993)

HP S HP SL390G7 L390G7 (De (Developed f eloped for

  • r

TS TSUB UBAM AME 2.0, Modified f E 2.0, Modified for

  • r 2.5)

2.5) GPU: NVIDIA Kepler K20X x 3 1310GFlops, 6GByte Mem Mem(per GPU)

CPU: Intel Westmere-EP 2.93GHz x2 Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2

Thin Node

Infiniband QDR x2 (80Gbps)

Productized as HP ProLiant

SL390s

Modified for TSUABME2.5

Peak Perf. 4.08 Tflops ~800GB/s Mem BW 80GBps NW ~1KW max NVIDIA Fermi M2050 1039/515 GFlops NVIDIA Kepler K20X 3950/1310 GFlops

slide-4
SLIDE 4

Phase-field simulation for Dendritic Solidification [Shimokawabe, Aoki et. al.] Gordon Bell 2011 Winner

  • Peta-Scale phase-field simulations can simulate the multiple dendritic growth during

solidification required for the evaluation of new materials.

  • 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time-to-Solution

Weak scaling on TSUBAME (Single precision) Mesh size(1GPU+4 CPU cores):4096 x 162 x 130

TSUBAME 2.0 2.000 PFlops (4,000 GPUs+16,000 CPU cores) 4,096 x 6,480 x 13,000

TSUBAME 2.5 3.444 PFlops (3,968 GPUs+15,872 CPU cores) 4,096 x 5,022 x 16,640

Developing lightweight strengthening material by controlling microstructure Low-carbon society

slide-5
SLIDE 5

Application TSUBAME2.0 Performance TSUBAME2.5 Performance Boost Ratio

Top500/Linpack 4131 GPUs (PFlops)

1.192 2.843 2.39

Green500/Linpack 4131 GPUs (GFlops/W)

0.958 3.068 3.20

Semi-Definite Programming Nonlinear Optimization 4080 GPUs (PFlops)

1.019 1.713 1.68

Gordon Bell Dendrite Stencil 3968 GPUs (PFlops)

2.000 3.444 1.72

LBM LES Whole City Airflow 3968 GPUs (PFlops)

0.592 1.142 1.93

Amber 12 pmemd 4 nodes 8 GPUs (nsec/day)

3.44 11.39 3.31

GHOSTM Genome Homology Search 1 GPU (Sec)

19361 10785 1.80

MEGADOC Protein Docking 1 node 3GPUs (vs. 1CPU core)

37.11 83.49 2.25

slide-6
SLIDE 6

TSUBAME2.0=>2.5 Power Improvement

2013/11 Green 500 #6 in the world

  • Along with TSUBAME-KFC (#1)
  • 2014/6 #9

2012/12 2013/12

18% Power Reduction

  • inc. cooling
slide-7
SLIDE 7

Com

  • mpa

paring g K K Com

  • mpu

puter r to

  • TS

TSUB UBAM AME 2. 2.5 Pe Perf rf ≒ Cos

  • st <<

<<

K Computer (2011)

11.4 Petaflops SFP/DFP $1400mil 6 years (incl. power) x30 TSUBAME2 TS TSUB UBAM AME2.0 .0(2010) → TS TSUB UBAM AME2.5(2013) 17.1 Petaflops SFP 5.76 Petaflops DFP $45mil / 6 years (incl. power)

slide-8
SLIDE 8

TSUBAME2 vs. K Technological Comparisons

(TSUBAME2 Deploying State-of-Art Tech.)

TSUBAME2.5 BG/Q Sequoia K Computer Single Precision FP 17.1 Petaflops 20.1 Petaflops 11.3 Petaflops Green500 (MFLOPS/W) Nov. 2013 3,068.71 (6th) 2,176.58 (26th) 830.18 (123rd) Operational Power (incl. Cooling) ~0.8MW 5~6MW? 10~11MW Hardware Architecture Many-Core (GPU) + Multi-Core Hetero Multi-Core Homo Multi-Core Homo Maximum HW Threads > 1 Billion ~6 million ~700,000 Memory Technology GDDR5+DDR3 DDR3 DDR3 Network Technology Luxtera Silicon Photonics Standard Optics Copper Non Volatile Memory / SSD SSD Flash all nodes ~250TBytes None None Power Management Node/System Active Power Cap

Rack-level measurement only Rack-level measurement only

Virtualization KVM (G & V queues,

Resource segragation)

None None

slide-9
SLIDE 9

TSUBAME3.0:Leadership “Template” Machine

  • Under Design:Deployment 2016H2~H3
  • High computational power: ~20 Petaflops, ~5 Petabyte/s Mem BW
  • Ultra high density: ~0.6 Petaflops DFP/rack (x10 TSUBAME2.0)
  • Ultra power efficient: 10 Gigaflops/W (x10 TSUBAME2.0, TSUBAME-KFC)

– Latest power control, efficient liquid cooling, energy recovery

  • Ultra high-bandwidth network: over 1 Petabit/s bisection, new topology?

– Bigger capacity than the entire global Intenet (several 100Tbps)

  • Deep memory hierarchy and ultra high-bandwidth I/O with NVM

– Petabytes of NVM, several Terabytes/s BW, several 100 million IOPS – Next generation “scientific big data” support

  • Advanced power aware resource management, high resiliency SW/HW co-

design, VM & container-based dynamic deployment…

slide-10
SLIDE 10

Focused Research Towards Tsubame

3.0 and Beyond towards Exa

  • Software and Algorithms for new memory hiearchy– Pushing the envelops
  • f low Power vs. Capacity, Communication and Synchroniation Reducing

Algorithms (CSRA)

  • Post Petascale Networks – Topology, Routing Algorithms, Placement

Algorithms… (SC14 paper Tue 14:00-14:30 “Fail in Place Network…”)

  • Green Computing: Power aware APIs, fine-grained resource scheduling
  • Scientific “Extreme” Big Data – GPU Hadoop Acceleration, Large Graphs,

Search/Sort, Deep Learning

  • Fault Tolerance – Group-based Hierarchical Checkpointing, Fault Prediction,

Hybrid Algorithms

  • Post Petascale Programming – OpenACC extensions and other many-core

programming substrates,

  • Performance Analysis and Modeling –For CSRA algorithms, for Big Data,

for deep memory hierarchy, for fault tolerance, …

slide-11
SLIDE 11

TSUBAME KFC

TSUBAME-KFC

Towards TSUBAME3.0 and Beyond Oil-Immersive Cooling #1 Green 500 SC13, ISC14, … (Paper @ ICPADS14)

slide-12
SLIDE 12

Extreme Big Data Examples

Rates and Volumes are extremely immense

Social NW – large graph processing

  • Facebook

– 〜1 billion users – Average 130 friends – 30 billion pieces of content shared per month

  • Twitter

– 500 million active users – 340 million tweets per day

  • Internet

– 300 million new websites per year – 48 hours of video to YouTube per minute – 30,000 YouTube videos played per second

Genomics advanced sequence matching Social Simulation

Lincoln Stein, Genome Biology, vol. 11(5), 2010

Sequencing data (bp)/$

  • becomes

x4000 per 5 years c.f., HPC x33 in 5 years

  • 4
  • Impact of new generation sequencers
  • Applications

– Target Area: Planet

(Open Street Map)

– 7 billion people

  • Input Data

– Road Network for Planet: 300GB (XML) – Trip data for 7 billion people 10KB (1trip) x 7 billion = 70TB – Real-Time Streaming Data

(e.g., Social sensor, physical data)

  • Simulated Output for 1 Iteration

– 700TB

Weather – real time large data assimilation

①30-sec Ensemble Forecast Simulations 2 PFLOP ②Ensemble Data Assimilation 2 PFLOP Himawari 500MB/2.5min シミュレーション データ シミュレーション データ Ensemble Forecasts 200GB

Phased Array Radar 1GB/30sec/2 radars

シミュレーション データ シミュレーション データ Ensemble Analyses 200GB A-1. Quality Control A-2. Data Processing B-1. Quality Control B-2. Data Processing Analysis Data 2GB ③30-min Forecast Simulation 1.2 PFLOP 30-min Forecast 2GB Repeat every 30 sec.

NOT simply mining Tbytes Silo Data Peta~Zetabytes Data Ultra High-BW Data Stream Highly Unstructured, Irregular Complex correlations between data from multiple sources Extreme Capacity, Bandwidth, Compute All Required

slide-13
SLIDE 13

Graph500 “Big Data” Benchmark

Kronecke ecker graph h BSP Problem blem

A: 0.57, B: 0.19 C: 0.19, D: 0.05

November 15, 2010 Graph 500 Takes Aim at a New Kind of HPC Richard Murphy (Sandia NL => Micron) “ I expect that this ranking may at times look very different from the TOP500 list. Cloud architectures will almost certainly dominate a major chunk of part of the list.”

The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12

Koji Ueno, Tokyo Institute of Technology/RIKEN AICS

  • n the G
raph500 Ranking of S upe rcompute rs with 17977.1 GE/ s on Scale 40
  • n the 8th G
raph500 lis t publis he d at the Inte rnational S upercomputing Conference, June 22, 2014. Congratulations from the G raph500 Executive Committee No.1 RIKEN Advanced Institute for Computational S cience (AICS )’s K computer is ranked
  • n the G
raph500 Ranking of S upercomputers with 1280.43 GE/ s on Scale 36
  • n the 8th G
raph500 list published at the International S upercomputing Conference, June 22, 2014. Congratulations from the G raph500 Executive Committee No.12 G lobal Scientific Information and Computing Center, Tokyo Institute
  • f Technology’sTSUBAME 2.5
is ranked

#1 K Computer #12 TSUBAME2

Reality: Top500 Supercomputers Dominate No Cloud IDCs at all

slide-14
SLIDE 14

A Major Northern Japanese Cloud Datacenter (2013)

Juniper EX8208 Juniper EX8208 2 zone switches (Virtual Chassis)

Juniper EX4200

Zone (700 nodes)

Juniper EX4200 Juniper EX4200

Zone (700 nodes)

Juniper EX4200 Juniper EX4200

Zone (700 nodes)

Juniper EX4200

Juniper MX480 Juniper MX480 10GbE 10GbE 10GbE 10GbE

LACP

the Internet

8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps

Advanced Silicon Photonics 40G single CMOS Die 1490nm DFB 100km Fiber

Supercomputer Tokyo Tech. Tsubame 2.0 #4 Top500 (2010)

~1500 nodes compute & storage Full Bisection Multi-Rail Optical Network Injection 80GBps/Node Bisection 220Terabps

>>

x1000!

Entire Global Internet Average Data BW ~200 Tbps (source: CISCO)

~=

slide-15
SLIDE 15

JST-CREST “Extreme Big Data” Project (2013-2018)

Supercomputers Compute&Batch-Oriented More fragile Cloud IDC Very low BW & Efficiency Highly available, resilient Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW

PCB TSV Interposer High Powered Main CPU Low Power CPU DRAM DRAM DRAM NVM/Fla sh NVM/Fla sh NVM/Fla sh Low Power CPU DRAM DRAM DRAM NVM/Flas h NVM/Flas h NVM/Flas h

2Tbps HBM 4~6HBM Channels 1.5TB/s DRAM & NVM BW 30PB/s I/O BW Possible 1 Yottabyte / Year

EBD System Software

  • incl. EBD Object System

Large Scale Metagenomics Massive Sensors and Data Assimilation in Weather Prediction

Ultra Large Scale Graphs and Social Infrastructures

Exascale Big Data HPC Co-Design

Future Non-Silo Extreme Big Data Scientific Apps

Graph Store EBD Bag

Co-Design

日本地図 ページ 日本地図

KV S KV S KV S

EBD KVS Cartesian Plane

Co-Design

Given a top-class supercomputer, how fast can we accelerate next generation big data c.f. Clouds? Issues regading Architectural, algorithmic, and system software evolution? Use of GPUs?

slide-16
SLIDE 16

Towards Extreme-scale BigData Machines Convergence

  • Computation

– Increase in Parallelism, Heterogeneity, Density, BW

  • Multi-core, Many-core processors
  • Heterogeneous processor
  • Deep Hierarchial Memory/Storage Architecture

– NVM (Non-Volatile Memory), SCM (Storage Class Memory) FLASH, PCM, STT-MRAM, ReRAM, HMC, etc. – Next-gen HDDs (SMR), Tapes (LTFS), Cloud

Network Locality Productivity FT Algorithm Power Storage Hierarchy I/O

Problems

Heterogeneity Scalability

slide-17
SLIDE 17 日本地図 ページ 日本地図

KVS KVS KVS

Graph Store

Cloud Datacenter

Large Scale Genomic Correlation Data Assimilation in Large Scale Sensors and Exascale Atmospherics Large Scale Graphs and Social Infrastructure Apps

100,000 Times Fold EBD “Convergent” System Architecture

TSUBAME 3.0 TSUBAME-GoldenBox

EBD Bag EBD KVS Cartesian Plane

MapReduce for EBD Workflow/Scripting Languages for EBD Interconnect

(InfiniBand 100GbE)

EBD Abstract Data Models

(Distributed Array, Key Value, Sparse Data Model, Tree, etc.)

EBD Algorithm Kernels

(Search/ Sort, Matching, Graph Traversals, , etc.) NVM

(FLASH, PCM, STT-MRAM, ReRAM, HMC, etc.)

HPC Storage EBD File System EBD Data Object

Akiyama Group Suzumura Group Miyoshi Group Matsuoka Group Tatebe Group

SQL for EBD Graph Framework Message Passing (MPI, X10) for EBD PGAS/Global Array for EBD Network

(SINET5)

Intercloud / Grid (HPCI)

Web Object Storage EBD Burst I/O Buffer EBD Network Topology and Routing

Koibuchi Group

17

Programming Layer Basic Algorithms Layer System SW Layer Big Data & SC HW Layer Big Data & SC HW Layer

slide-18
SLIDE 18

The Graph500 – June 2014 K Computer #1 Tokyo Tech[EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS

List

Rank GTEPS

Implementation

November 2013

4 5524.1 2

Top-down only June 2014

1 17977. 05

Efficient hybrid

200 400 600 800 1000 1200 64 nodes (Scale 30) 65536 nodes (Scale 40) Elapsed Time (ms) Communicaton Computation

*Problem size is weak scaling 73% total exec time wait in communication 1236

MTEPS/node

274

MTEPS/node

slide-19
SLIDE 19

Out-of-core GPU-MapReduce for Large-scale Graph Processing [Cluster 2014]

Red uce Map Red uce Map Red uce Shuf fle Shuf fle Sort Scan Initialization Operation on GPU Operation on GPU GPU CPU Memcpy (H2D, D2H) Processing for each chunk Map Red uce Sort Map

Problem: GPU memory capacity limits scalable large-scale graph processing Emergence of large-scale graphs

  • SNS, road network, smart grid, etc.
  • Millions to trillions of vertices/edges

→ Need for fast graph processing on supercomputers

500 1000 1500 2000 2500 3000 500 1000 1500 Performance [MEdges/sec] Number of Compute Nodes

1CPU (S23 per node) 1GPU (S23 per node) 2CPUs (S24 per node) 2GPUs (S24 per node) 3GPUs (S24 per node)

2.10x

(3 GPU vs 2CPU)

Weak scaling on TSUBAME2.5

Experimental Results: performance improvement over CPUs

  • Map: 1.41x, Reduce: 1.49x, Sort: 4.95x

speedup

  • Overlapping communication effectively

Proposal: Out-of-core GPU memory management on MapReduce

  • Stream-based GPU MapReduce
  • Out-of-core GPU sorting

EBD Programming Framework

slide-20
SLIDE 20

GPU-HykSort [IEEE BigData2014]

Motivation

Effectiveness of sorting for large-scale GPU- based heterogeneous systems remains unclear

  • Appropriate selection of phases to be
  • ffloaded to GPU is required
  • Handling GPU memory overflow is required

Approach

Offload local sort, the most time-consuming phase, to GPU accelerators

10 20 30 500 1000 1500 2000

# of proccesses (2 proccesses per node) Keys/second(billions)

HykSort 1thread HykSort 6threads HykSort GPU + 6threads

GPU-HykSort achieves 2.2x performance improvement with 50GB/s CPU-GPU interconnect

Performance of weak scaling Performance prediction

1.4x 3.6x 0.25 TB/s ~1024 nodes ~2048 GPUs

unsorted sorted sorted sorted sorted unsorted unsorted unsorted unsorted

Separate an unsorted array Transfer an unsorted chunk to GPU memory Transfer a sorted chunk to DRAM Merge sorted chunks into a sorted array

sorted GPU

Iter 1 Iter 2 Iter 3 Iter 4 Iter 1 Iter 2 Iter 3 Iter 4 Sort a chunk on GPU

Implementation

unsorted unsorted unsorted unsorted sorted sorted sorted sorted merged merged merged merged local sort select splitters data transfer merge Process 0 Process 1 Process 2 Process 3

EBD Algorithm Kernels

slide-21
SLIDE 21

Efficient Parallel Sorting Algorithm for Variable-Length Keys

Aleksandr Drozd, Miquel Pericàs, Satoshi Matsuoka. Efficient String Sorting on Multi- and Many- Core Architectures. in Proceedings of IEEE 3rd International Congress on Big Data. Anchorage, USA, August 2014 Aleksandr Drozd, Miquel Pericàs, Satoshi Matsuoka. MSD Radix String Sort on GPU: Longer Keys, Shorter Alphabets in proceedings of 第142回ハイパフォーマンスコンピューティング合同 研究発表会 (HOKKE-21)

  • Comparison-based sorts inefficient for

long/variable-length keys (like strings)

  • Better way: examining individual

characters (based on MSD Radix sort algorithm)

  • Hybrid parallelization scheme:

combining data-parallel and task- parallel stages

apple apricot banana kiwi 70 M keys/second sorting throughput

  • n 100bytes strings

EBD Algorithm Kernels

slide-22
SLIDE 22

Large Scale Graph Processing Using NVM

  • 2. Proposal
  • 1. Hybrid-BFS ( Beamer’11 )

CPU Intel Xeon E5-2690 2690 × 2 DRAM

256 256 GB

GB NVM NVM EBD-I/O 2TB × 2

  • 3. Experiment

0.0 0.0 1.0 1.0 2.0 2.0 3.0 3.0 4.0 4.0 5.0 5.0 6.0 6.0 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31

SCAL CALE( E(# v # vert ertic ices es = 2 = 2SCALE)

DRAM + EBD-I/O DRAM Only Limit of DRAM Only 3.8 3.8 4.1

Med edian an Gi Giga gaTE TEPS PS

(Giga Traversed Edges Per Seconds) (Giga Traversed Edges Per Seconds)

mSATA-SSD

RAID Card

RAID Card (RAID 0) 0) mSATA SSD SSD

× 8

mSATA SSD SSD ・・・

www.adaptec.com www.crucial.com/

4 times larger graph with

6.9 % of degradation DRAM NVM

Load highly accessed graph data before BFS Holds full size of Graph Holds highly accessed data

[BigData2014]

Top-down Bottom-up

# of frontiers:nfrontier,# of all vertices:nall, Parameter : α, β

Switching two approaches

Tokyo’s Institute of Technology G raphCREST-Custom # 1 is ranked No.3 in the Big Data category of the G reen G raph 500 Ranking of Supercomputers with 35.21 MTEPS/ W on Scale 31
  • n the third G
reen G raph 500 list published at the International Supercomputing Conference, June 23, 2014. Congratulations from the G reen G raph 500 Chair

Ranked 3rd

in Green Graph500 (June 2014)

EBD Algorithm Kernels

slide-23
SLIDE 23

Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era

JST-CREST project, 2012-2018, PI Toshio Endo

Comm/BW reducing algorithms System software for mem hierarchy mgmt

+ +

HPC Architecture with hybrid memory devices

HMC, HBM O(GB/s) Flash Next-gen NVM

Target: Realizing extremely Fast&Big simulations of {O(100PF/s) or O(10PB/s)} & O(10PB) around 2018

slide-24
SLIDE 24

Supporting Larger domains than GPU device memory for Stencil Simulations

>>

Caution: Simply “swapping out” to larger host memory is disastrously slow PCIe traffic is too large! Keys are “Communication Avoiding & Locality Improvement” Algorithms

GPU mem 6GB Host memory 54GB

L2$ 1.5MB

GPU cores

250GB/s 8GB/s TSUBAME2.5 node GPU card

CPU cores

slide-25
SLIDE 25

Temporal Blocking (TB) for Comm. Avoiding

  • Performs multiple updates on a small

block, before proceeding to the next block

– Originally proposed to improve cache locality [Kowarschik 04] [Datta 08]

s-step updates at once

Step 1 Step 2 Step 3 Step 4

Simulated time

Redundant computation is introduced due to data dependency with neighbor

Introducing “larger halo” Step 1 Step 2 Step 3 Step 4 Redundancy can be removed when blocks are computed sequentially [Demmels 12]

Multi-level TB to reduce both

  • PCIe traffic
  • device memory traffic
slide-26
SLIDE 26

20 40 60 80 100 120 140 160 180 200 500 1000 1500 2000 Speed (GFlops) Size of Each Dimension Common Naïve Basic-TB Opt-TB

Single GPU Performance

  • With optimized TB, 10x larger domain size is successfully used

with little overhead!!!  A step towards extremely fast&big simulations

3D 7point stencil on a K20X GPU (6GB GPU mem) Bigger Faster

5.3GB 52GB Version 1 Version 3

slide-27
SLIDE 27

Problem: Programming Cost

  • Communication reducing algorithms efficiently

support larger domains

  • Programming cost is the issue

– Complex loop structure, complex border handling

  • Reducing programming cost by using system

software supporting memory hierarchy

– HHRT (Hybrid Hierarchical Runtime) – Physis DSL, by Maruyama, RIKEN

slide-28
SLIDE 28

Memory Hierarchy Management with Runtime Libraries

Compute node

GPU mem Host mem Process’s data

  • HHRT supports memory swapping between GPU and host mem

at granuarity of processes

  • Similar to NVIDIA UVM, but works well with communication

reducing algorithms HHRT (Hybrid hierarchical RT) is for GPU supercomputers and MPI+CUDA user applications

  • HHRT provides MPI and CUDA compatible APIs
  • # of MPI processes > # of GPUs
  • Several processes share a GPU

Compute node Compute node

slide-29
SLIDE 29

HHRT Comm. Reducing Results

Larger Faster 3D 7point stencil on a single K20X GPU

20 40 60 80 100 120 140 10 20 30 Speed (GFlops) Problem Size (GB) Hand-TB NoTB HHRT-TB

Beyond GPU memory efficient execution w/ moderate programming cost

Weak scalability on TSUBAME2.5 Small: 3.4GB per GPU Large: “16GB” per GPU (>6GB!)

5 10 15 20

50 100 150 200 Speed (TFlops)

The number of GPUs

Small Large

14TFlops with 3TB Problem

slide-30
SLIDE 30

Where Do We Go From Here? TSUBAME KFC TSUBAME EBD Green and Extreme Big Data TSUBAME3.0 (2016) TSUBAME4.0 2021~ Post CMOS Moore?

slide-31
SLIDE 31

TSUBAME4 2021~2022 K-in-a-Box (Golden Box) BD/EC Convergent Architecture

1/500 Size, 1/150 Power, 1/500 Cost, x5 DRAM+ NVM Memory 10 Petaflops, 10 Petabyte Hiearchical Memory (K: 1.5PB), 10K nodes 50GB/s Interconnect (200-300Tbps Bisection BW) (Conceptually similar to HP “The Machine”)

Datacenter in a Box Large Datacenter will become “Jurassic”

slide-32
SLIDE 32

Tsubame 4: 2020- DRAM+NVM+CPU with 3D/2.5D Die Stacking

  • The Ultimate Convergence of BD and EC-

PCB TSV Interposer

Optical SW & Launch Pad

Low Power CPU DRAM DRAM DRAM NVM/Flash NVM/Flash NVM/Flash Low Power CPU DRAM DRAM DRAM NVM/Flash NVM/Flash NVM/Flash

2Tbps HBM 4~6HBM Channels 2TB/s DRAM & NVM BW

Direct Chip-Chip Interconnect with DWDM optics

slide-33
SLIDE 33

GoldenBox “Proto1” (NVIDIA K1-based) at Tokyo Tech. SC14 Booth #1857 (also Wed. morning plenary talk)

  • 36 Node Tegra K1,

~11TFlops SFP

  • ~700GB/s BW
  • 100-700Watts
  • Integrated mSata

SSD, ~7GB/s I/O

  • Ultra dense, Oil

immersive cooling

  • Same SW stack as

TSUBAME2

2022: x10 Flops, x10 Mem Bandwidth, silicon photonics, x10 NVM, x10 node density, with new device and packaging technologies

slide-34
SLIDE 34

OUR GOALS

MPI abstractions Hardware Layer Application Layer MPI Layer

Performance Analysis Process View

Network Performance Visualization [EuroMPI/Asia 2014 Poster]

② Non-intrusively profile low- level metrics ③ Flexible hardware-centric performance analysis ① Portably expose MPI’s internal performance

① ② ③

1 2 3 4 5 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768

  • Comm. Latency (%)

Message size (bytes)

Overhead of our profiler (named ibprof): NAS Parallel FT Benchmark

Runtime overhead = less than 0.02% (12.1919s -> 12.1935s)

MPI_Alltoall

Network visualization of TSUBAME 2.5 running the Graph500 benchmark on 512 nodes

Avg: 2.06%

EBD Interconnect

slide-35
SLIDE 35

Design of nonblocking B+Tree for NVM-KV [Tatebe Group, Jabri]

 Take advantage of NVM new capabilities: atomic

writes, huge sparse address space, direct access to NVM device natively as a KVS

 Enable range-queries support for KVS running

natively on NVM like FusionIO ioDrive

 Provide optional persistence to the BPTree

structure and also snapshots

NVM-BPTree is a Key-Value Stores (KVS) running natively over Non-Volatile- Memory (NVM), like flash, supporting range-queries.

OpenNVM like Key-value store Interface NVM (Fusion-io flash device) KVS on NVM supporting range-queries In-memory B+Tree

  • Fusion-io sdk 0.4 and ioDrive 160GB SLC
  • Key size: fixed to 40 Bytes
  • Value size: ranging from 1 up to 1024 sectors

(512B)

  • NVM-BPTree does not impact the

performance compared with original KVS

EBD NVM System Software

slide-36
SLIDE 36

LLNL-PRES-654744

Extreme scale I/O for Burst Buffer

Extreme scale C/R modeling

EBD I/O and C/R modeling for extreme scale[CCGrid2014 Best Paper]

1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1

k

p0(t +ck) t0(t +ck)

k

pi(t +ck) ti(t +ck)

i

k k

i

p0(r

k)

pi(r

k)

p0(r

k)

t0(r

k)

ti(r

k)

Duration

t +ck r

k

No failure Failur e

li : i -level checkpoint time

: c -level checkpoint time

r

c : c -level recovery time

cc

t : Interval

p0(T) t0(T)

: No failure for T

: Expected time pi(T)

ti(T) : i - level failure for T : Expected time

mSATA ☓ 8

Adaptec RAID ☓ 1

mSATA mSATA mSATA mSATA mSATA mSATA mSATA mSATA

0.5 1 1.5 2 2.5 3 3.5 4 4.5 2 4 6 8 10 12 14 16 Read/Write throughput (GB/sec) # of Processes

Read - Peak Read - Local Read - IBIO Read - NFS Write - Peak Write - Local Write - IBIO Write - NFS Chunk buffers Compute node 1 Compute node 2 Compute node 3 Compute node 4 IBIO client IBIO client IBIO client IBIO server thread file4 Storage file3 file2 file1 3 fd1 fd2 fd3 fd4 2 Writer thread Writer thread Writer thread Writer thread Writer threads chunk 1 4 5 IBIO client

IBIO write: four IBIO clients and one IBIO server

Compute node Burst buffer node Application IBIO Client IBIO Server Write threads

EBD I/O

Li = Ci + Ei Oi = Ci + Ei (Sync.) Ii (Async.) Ci or Ri =

< C/R date size / node >☓ <# of C/R nodes per Si* >

< write perf. ( wi ) > or <read perf. ( ri ) >

Hi

Compute node

Si

i = 0 i > 0

1 2

mi

Hi-1 Hi-1 Hi-1

Storage Model: HN {m1, m2, . . . , mN }

EBD NVM System Software

slide-37
SLIDE 37

Storage 1

I/O node I/O node I/O node Compute node Compute node Compute node

System 1

I/O node I/O node I/O node Compute node Compute node Compute node

I/O Bursting Buffer Nodes System 2 WAN LAN LAN LAN LAN Buffer Queue I/O node

Cloud-based I/O Burst Buffer Architecture (I/O Burst Buffer) In collaboration talks with Amazon EC2

20 40 60 80 100 120 140 160 1 2 3 4 5 6 7 8 Throughput (MB/s) # of node

read experiment result read simulation result write experiment result write simulation result read without IOnode write without IOnode 100 200 300 400 500 600 1 2 3 4 Throughput(MB/s) # of I/O nodes=# of clients write experiment result read experiment result write simulation result read simulation result read without IOnode write without IOnode

7X!

Main idea: using several compute nodes in public cloud as I/O nodes

Buffer I/O data in the main memory of I/O nodes All I/O nodes maintain a on-memory buffer queue dynamic burst buffer, # of I/O nodes can be dynamically decided

Taking advantage of high throughput of LAN inside public cloud

One Client Performance Multi Client Performance

1.7X!

8X! Cloud Supercomputer