[PPT] - Deep Learning Acceleration of Progress Toward Delivery of Fusion PowerPoint Presentation

SLIDE 1

Deep Learning Acceleration of Progress

Toward Delivery of Fusion Energy

William M. Tang Princeton University/Princeton Plasma Physics Laboratory (PPPL) GPU TECHNOLOGY CONFERENCE -- GTC-2017 San Jose, California May 10, 2017

Co-authors: Julian Kates-Harbeck, Alexey Svyatkovskiy, Kyle Felker, Eliot Feibush, Michael Churchill

SLIDE 2

CNN’s “MOONSHOTS for 21st CENTURY” (Hosted by Fareed Zakaria) – Five segments (broadcast in Spring, 2015 on CNN) exploring “exciting futuristic endeavors in science & technology” in the 21st century (1) Human Mission to Mars (2) 3D Printing of a Human Heart (3) Creating a Star on Earth: Quest for Fusion Energy (4) Hypersonic Aviation (5) Mapping the Human Brain CNN Moonshots Series: “Creating a Star on Earth” à “takes a fascinating look at how harnessing the energy of nuclear fusion reactions may create a virtually limitless energy source.”

SLIDE 3

Application Domain: MAGNETIC FUSION ENERGY (MFE) ITER ~$25B facility located in France & involving 7 governments representing over half

f world’s population

à dramatic next-step for Magnetic Fusion Energy (MFE) producing a sustained burning plasma

- Today: 10 MW(th) for 1 second with gain ~1
- ITER: 500 MW(th) for >400 seconds with gain >10

magnets plasma magnetic field

“Tokamak” Device

SLIDE 4

SITUATION ANALYSIS Most critical problem for MFE: avoid/mitigate large-scale major disruptions

Approach: Use of big-data-driven statistical/machine-learning (ML) predictions for the
ccurrence of disruptions in EUROFUSION facility “Joint European Torus (JET)”
Current Status: ~ 8 years of R&D results (led by JET) using Support Vector Machine

(SVM) ML on zero-D time trace data executed on CPU clusters yielding ~ reported success rates in mid-80% range for JET 30 ms before disruptions , BUT > 95% with false alarm rate < 3% actually needed for ITER (Reference – P. DeVries, et al. (2015)

Princeton Team Goals include:

(i) improve physics fidelity via development of new ML multi-D, time-dependent software including better classifiers; (ii) develop “portable” (cross-machine) predictive software beyond JET to other devices and eventually ITER; and (iii) enhance execution speed of disruption analysis for very large datasets à development & deployment of advanced ML software via Deep Learning Recurrent Neural Networks

SLIDE 5

Plasma Disruption Characteristics

Large-scale macroscopic instabilities:

Loss of confinement – ends fusion reaction
Intense radiation – damaging concentration

in small areas

Current quench – produces high magnetic

forces Time Scale: Milliseconds (ms) à Need at least 30ms warning to mitigate à accurate/rapid prediction is necessary Consequences: More severe with higher volume to surface area ratio à ITER cannot tolerate disruptions at maximum current ! Present Day Approaches: Hypothesis-based first principles simulations; simple statistical/threshold models with regression analysis; and “Shallow Machine Learning ” (e.g. small NNs, SVM, Random Forests, ….)

SLIDE 6

ρ = 0 ρ = 1

Mazon, Didier, Christel Fenzi, and Roland Sabot. "As hot as it gets." Nature Physics 12.1 (2016): 14-17.

Higher Dimensional Signals

At each timestep: arrays instead of

scalars

All as a function of ρ (normalized

flux surface)

Examples:

– 1D Current profiles – 1D Electron temperature profiles – 1D Radiation profiles

Challenges & Opportuni2es

SLIDE 7

Challenges & Opportunities

Signal Normalization & Outlier Detection

All signals placed on appropriate numerical scale ~ O(1)
Rescale signals from different experimental systems (tokamaks) such

that the same “meaning” of the signal on the various machines gets mapped to the same numerical value after re-scaling Approaches: Physics-based (e.g. density divided by empirical “Greenwald Density Limit”) Data-based (e.g. all signals are divided by their standard deviation) Challenge: Need rapid training time to determine best approach from these options

SLIDE 8

DEEP LEARNING RECURRENT NEURAL NETS (RNN) APPROACH Julian Kates-Harbeck, DOE CSGF Fellow from Harvard U.

→ Rapid development of new GPU-compatible predictive software with results benchmarked vs. those from extensive SVM analysis

Most Promising Approach to Analysis of Higher Dimensional Signals via Deep Learning RNN with rapid training 1D Targets: (i) radial temperature profiles; (ii) density profiles; & (iii) radiation profiles DL RNN Benefits:

- Captures more physics to improve predictive accuracy
- Rapid progress toward addressing challenges of more data and longer

training time → modern HPC training (e.g., via GPU’s & MPI)

- Neural Networks able to efficiently extract salient physics features from

higher-D data

- Associated timely improvements in accuracy of ML/DL predictions

SLIDE 9

CLASSIFICATION

Binary Classification Problem:

○ Shots are Disruptive (D) or Non-Disruptive (ND)

Supervised ML techniques:

○ Physics domain scientists combine knowledge base of

bservationally validated information with advanced statistical/ML

predictive methods. Shots can be labeled D/ND retrospectively.

Machine Learning (ML) Methods Engaged:

Basic SVM approach initiated by JET team leading to APODIS software; à enabled efficient, rapid progress toward development & deployment at PPPL of: New Deep Learning Recurrent Neural Net (stacked LSTM) software

Approach: (i) examine appropriately normalized data; (ii) use training set

to generate model; (iii) use trained model to classify new samples → Targeted multi-D data analysis requires new signal representations

SLIDE 10

Identify Signals

Classifiers

Preprocessing and feature extraction Train model, Hyper parameter tuning

All data placed on appropriate numerical scale ~ O(1) e.g., Data-based with all signals divided by their standard deviation Princeton/PPPL DL predictions now advancing to multi-D time trace signals (beyond zero-D)

Machine Learning Workflow

Normalization

Measured sequential data arranged in patches of equal length for training

Use model for prediction

All available data analyzed;
Train LSTM (Long Short Term

Memory Network) iteratively;

Evaluate using ROC (Receiver

Operating Characteristics) and cross-validation loss for every epoch (equivalent of entire data set for each iteration) Apply ML/DL software on new data

SLIDE 11

JET Disruption Data

# Shots Disruptive Nondisruptive Totals Carbon Wall 324 4029 4353 Beryllium Wall (ILW) 185 1036 1221 Totals 509 5065 5574

Sample 7 Signals of zero-D time traces (07)

Data Size (GB) Plasma Current 1.8 Mode Lock Amplitude 1.8 Plasma Density 7.8 Radiated Power 30.0 Total Input Power 3.0 d/dt Stored Diamagnetic Energy 2.9 Plasma Internal Inductance 3.0

JET produces ~ Terabyte (TB) of data per day ~55 GB data collected from each JET shot ➔ Well over 350 TB total amount with multi- dimensional data yet to be analyzed

SLIDE 12

Deep Recurrent Neural Networks (RNNs): Basic Description

“Deep”

○ Hierarchical representation of complex data, building up salient features automatically ○ Obviating the need for hand tuning, feature engineering, and feature selection

“Recurrent”

○ Natural notion of time and memory à i.e., at every time-step, the output depends on ■ Last Internal state “s(t-1)” Recurrence! ■ Current input x(t) ○ The internal state can act as memory and accumulate information of what has happened in the past

Image adapted from: colah.github.io

Internal State (“memory/ context”)

SLIDE 13

FRNN (“Fusion Recurrent Neural Net”) Code Performance (ROC Plot)

True Posi8ves: 93.5% False Posi8ves: 7.5% True Posi8ves: 90.0% False Posi8ves: 5.0% Performance Tradeoff: Tune True Positives (good: correctly caught disruption) vs. False Positives (bad: safe shot incorrectly labeled disruptive). RNN Data:

Testing 1200 shots

from Jet ILW campaigns (C28-C30)

All shots used, no

signal filtering or removal of shots Jet SVM* work:

990 shots from same

campaigns

Filtering of signals,

ad hoc removal of shots with abnormal signals

TP 80 to 90%, FP 5%

*Vega, Jesús, et al. "Results of the JET real-time disruption predictor in the ITER-like wall campaigns." Fusion Engineering and Design 88.6 (2013): 1228-1231.

SLIDE 14

RNNs: HPC Innova2ons Engaged

GPU training

Neural networks use dense tensor manipulations, efficient use of GPU FLOPS
Over 10x speedup better than multicore node training (CPU’s)

Distributed Training via MPI

Linear scaling:

Key benchmark of “time to accuracy”: we can

train a model that achieves the same results nearly N times faster with N GPUs Scalable

to 100s or >1000’s of GPU’s on Leadership

Class Facilities

TB’s of data and more
Example: Best model training time on full

dataset (~40GB, 4500 shots) of 0D signals training ○ SVM (JET) : > 24hrs ○ RNN ( 20 GPU’s) : ~40min

SLIDE 15

Fusion Recurrent Neural Net (FRNN) Description

Python deep learning code for disruption prediction in fusion (tokamak)

experiments – Reference: https://github.com/PPPLDeepLearning/plasma-python

Implements distributed data parallel synchronous RNN training

– Tensorflow & Theano backends with MPI for communication – FRNN code workflow is characteristic

f typical distributed deep learning software

– Core modules:

Models: Python classes necessary to construct, train,

and optimize deep RNN models.

Pre-process: arrange data into patches for stateful training; normalize
Primitives: Python objects for key plasma physics abstractions
Utils: a set of auxiliary functions for pre-processing, performance evaluation, and

learning curves analysis

SLIDE 16

Master W Data Worker (GPU) Model(W) Worker (GPU) Model(W) Data MPI_REDUCE MPI_BCAST

...

Distributed Training

Keep N model replicas
Each computes gradient

steps on a mini-batch, with different subsets of the data

The gradients are

synchronized (averaged) and the updates are made to a global set of parameters

The global parameters are

broadcast to the N models

Efficient communication

using custom MPI implementation Batch Batch

SLIDE 17

Runtime: computation time Parallel Efficiency Communication: each batch of data requires time for synchronization

Scaling Summary

SLIDE 18

FRNN scaling results on GPU’s (Part-1)

Tests on Princeton University’s “Tiger” cluster

– 200 Tesla P100 PCIe GPUs, Intel Omni-path fabric – Tensorflow+MPI, CUDA8, CuDNN 5.1

GPU type/backend Tepoch [s] K20X (theano) 1352 P100 (theano) 694 P100 (tensorflow) 515

Average Time per Epoch

SLIDE 19

FRNN Scaling Results on GPU’s (Part-2)

Tests on OLCF Titan CRAY supercomputer

– OLCF DD AWARD: Enabled Scaling Studies on Titan currently up to 6000 GPU’s – Total ~ 18.7K Tesla K20X Kepler GPUs Tensorflow+MPI (using Singularity containers), CUDA7.5, CuDNN 5.1

SLIDE 20

Hyperparameter Tuning

Currently exploring many approaches to optimize performance

Higher Dimensional Signals

Demands processing much more data to train models

Signal Normalization

Identifying and testing various approaches to determine best choices

Challenges & Opportunities Summary

→ All with Common Need for rapid training time HPC engagement (e.g. via GPU’s and MPI) is key

SLIDE 21

CURRENT PERSPECTIVE Forecasting disruptions using machine learning is an important application of a general idea: à Use multi outcome prediction to distinguish disruption types/scenarios à Eventually move from prediction to active control

(including Reinforcement learning and Synthetic diagnostics

à Increasingly large and diverse data sets require building scalable systems to take advantage of leadership class computing facilities

SLIDE 22

Fusion Deep Learning (FRNN) Technical Summary

FRNN à a distributed data-parallel approach to train deep neural networks

(stacked LSTM’s);

Replica of the model is kept on each “worker” à processing different mini-

batches of the training dataset in parallel;

Results on each worker are combined after each epoch using MPI;
Model parameters are synchronized via parameter averaging à with

learning rate adjusted after each epoch to improve convergence

Stochastic gradient descent (SGD) used for large-scale optimization with

parallelization via mini-batch training to reduce communication cost. à Challenge: scaling studies to examine if convergence rate saturates/ decreases with increasing mini-batch size (to thousands of GPU’s). à Targeted Large HPC Systems with P-100’s for Performance Scaling Studies: (1) “PIZ-DAINT” Cray XC50 @ CSCS (Switzerland) with > 4K GPU’S; (2) “SATURN V” @ NVIDIA with ~ 1K GPU’s; (3) “TSUBAME 3” @ TITECH with ~ 3K GPU’s; & (4) “SUMMIT-DEV” @ OLCF.

SLIDE 23

Fusion Energy Mission:
- Accelerate demonstration of the scientific & technical feasiblity of delivering Fusion Power
- Most critical associated problem is to avoid/mitigate large-scale major disruptions.
ML Relevance to HPC:
- Rapid Advances on development of predictive methods via large-data-driven “machine-

learning” statistical methods

- Approach Focus: Deep Learning/Recurrent Neural Nets (RNNs)
- Significance: Exciting alternative predictive approach to “hypothesis-driven/first

principles” exascale predictive methods

- Complementarity: Exascale HPC needed to introduce/establish Supervised ML Classifiers

with associated features

Associated Challenge:

→ Improvements over zero-D SVM-based machine-learning needed to achieve > 95% success rate, <5% false positives at least 30 ms before disruptions -- with portability of software to ITER via enhanced physics fidelity (capturing multi-D) with improvement in execution time enabled by access to advanced HPC hardware (e.g., large GPU systems).

Fusion Big Data ML/DL Application Summary

SLIDE 24

EXTRA SLIDES

SLIDE 25

FRNN Scaling Results on GPU’s: Theano backend (Back-up Slide)

Tests on OLCF Titan CRAY supercomputer

– Theano backend with MPI communication – Successful runs up to 1200 K20X Kepler GPUs – Graph compilation time becomes a bottleneck (beyond 1.2K GPU’s)

SLIDE 26

Recurrent Neural Networks (RNN’s)

Common Theme: sequential data

e.g. image classification e.g. image captioning e.g. sentiment analysis e.g. machine translation e.g. time series prediction, disruption forecasting

Output Hidden Input

SLIDE 27

Patching Sequence Data (Part-1)

Length of shots in e.g. JET data varies by orders of magnitude:

– Minimum length: 1400 – Mean length: ~27,000 – Max: ~40,000 time-steps

Data parallel synchronous training

demands amounts of data passed to the model replica is about the same size

A patch is a subset of shot's time/signal profile having a fixed length
Patch size is approximately equal to the minimum shot length

– Note: patch size equals the largest number less or equal to the minimum shot length divisible by the LSTM model length

Timesteps Shot lengths

SLIDE 28

Patching Sequence Data (Part-2)

Shape of data: (batch size, number of time steps, dimension of data)

– Batch size: 256, tunable parameter – Number of timesteps: LSTM model length, tunable – Dimension of data: e.g, 7 Zero-D signals à 7D signal

Timesteps

The i-th and the (batch size + i-th) examples are consecutive in time;

à RNN internal state is not reset unless we start a new chunk Patch length Instead of padding to the max, we get patch-sized subset of shot from the end Timesteps On the evaluation step, we use zero- padding to the maximum shot sequence length