Deep Learning Acceleration of Progress Toward Delivery of Fusion - - PowerPoint PPT Presentation

deep learning acceleration of progress toward delivery of
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Acceleration of Progress Toward Delivery of Fusion - - PowerPoint PPT Presentation

Deep Learning Acceleration of Progress Toward Delivery of Fusion Energy William M. Tang Princeton University/Princeton Plasma Physics Laboratory (PPPL) GPU TECHNOLOGY CONFERENCE -- GTC-2017 San Jose, California May 10, 2017 Co-authors: Julian


slide-1
SLIDE 1

Deep Learning Acceleration of Progress

Toward Delivery of Fusion Energy

William M. Tang Princeton University/Princeton Plasma Physics Laboratory (PPPL) GPU TECHNOLOGY CONFERENCE -- GTC-2017 San Jose, California May 10, 2017

Co-authors: Julian Kates-Harbeck, Alexey Svyatkovskiy, Kyle Felker, Eliot Feibush, Michael Churchill

slide-2
SLIDE 2

CNN’s “MOONSHOTS for 21st CENTURY” (Hosted by Fareed Zakaria) – Five segments (broadcast in Spring, 2015 on CNN) exploring “exciting futuristic endeavors in science & technology” in the 21st century (1) Human Mission to Mars (2) 3D Printing of a Human Heart (3) Creating a Star on Earth: Quest for Fusion Energy (4) Hypersonic Aviation (5) Mapping the Human Brain CNN Moonshots Series: “Creating a Star on Earth” à “takes a fascinating look at how harnessing the energy of nuclear fusion reactions may create a virtually limitless energy source.”

slide-3
SLIDE 3

Application Domain: MAGNETIC FUSION ENERGY (MFE) ITER ~$25B facility located in France & involving 7 governments representing over half

  • f world’s population

à dramatic next-step for Magnetic Fusion Energy (MFE) producing a sustained burning plasma

  • - Today: 10 MW(th) for 1 second with gain ~1
  • - ITER: 500 MW(th) for >400 seconds with gain >10

magnets plasma magnetic field

“Tokamak” Device

slide-4
SLIDE 4

SITUATION ANALYSIS Most critical problem for MFE: avoid/mitigate large-scale major disruptions

  • Approach: Use of big-data-driven statistical/machine-learning (ML) predictions for the
  • ccurrence of disruptions in EUROFUSION facility “Joint European Torus (JET)”
  • Current Status: ~ 8 years of R&D results (led by JET) using Support Vector Machine

(SVM) ML on zero-D time trace data executed on CPU clusters yielding ~ reported success rates in mid-80% range for JET 30 ms before disruptions , BUT > 95% with false alarm rate < 3% actually needed for ITER (Reference – P. DeVries, et al. (2015)

  • Princeton Team Goals include:

(i) improve physics fidelity via development of new ML multi-D, time-dependent software including better classifiers; (ii) develop “portable” (cross-machine) predictive software beyond JET to other devices and eventually ITER; and (iii) enhance execution speed of disruption analysis for very large datasets à development & deployment of advanced ML software via Deep Learning Recurrent Neural Networks

slide-5
SLIDE 5

Plasma Disruption Characteristics

Large-scale macroscopic instabilities:

  • Loss of confinement – ends fusion reaction
  • Intense radiation – damaging concentration

in small areas

  • Current quench – produces high magnetic

forces Time Scale: Milliseconds (ms) à Need at least 30ms warning to mitigate à accurate/rapid prediction is necessary Consequences: More severe with higher volume to surface area ratio à ITER cannot tolerate disruptions at maximum current ! Present Day Approaches: Hypothesis-based first principles simulations; simple statistical/threshold models with regression analysis; and “Shallow Machine Learning ” (e.g. small NNs, SVM, Random Forests, ….)

slide-6
SLIDE 6

ρ = 0 ρ = 1

Mazon, Didier, Christel Fenzi, and Roland Sabot. "As hot as it gets." Nature Physics 12.1 (2016): 14-17.

Higher Dimensional Signals

  • At each timestep: arrays instead of

scalars

  • All as a function of ρ (normalized

flux surface)

  • Examples:

– 1D Current profiles – 1D Electron temperature profiles – 1D Radiation profiles

Challenges & Opportuni2es

slide-7
SLIDE 7

Challenges & Opportunities

Signal Normalization & Outlier Detection

  • All signals placed on appropriate numerical scale ~ O(1)
  • Rescale signals from different experimental systems (tokamaks) such

that the same “meaning” of the signal on the various machines gets mapped to the same numerical value after re-scaling Approaches: Physics-based (e.g. density divided by empirical “Greenwald Density Limit”) Data-based (e.g. all signals are divided by their standard deviation) Challenge: Need rapid training time to determine best approach from these options

slide-8
SLIDE 8

DEEP LEARNING RECURRENT NEURAL NETS (RNN) APPROACH Julian Kates-Harbeck, DOE CSGF Fellow from Harvard U.

→ Rapid development of new GPU-compatible predictive software with results benchmarked vs. those from extensive SVM analysis

Most Promising Approach to Analysis of Higher Dimensional Signals via Deep Learning RNN with rapid training 1D Targets: (i) radial temperature profiles; (ii) density profiles; & (iii) radiation profiles DL RNN Benefits:

  • - Captures more physics to improve predictive accuracy
  • - Rapid progress toward addressing challenges of more data and longer

training time → modern HPC training (e.g., via GPU’s & MPI)

  • - Neural Networks able to efficiently extract salient physics features from

higher-D data

  • - Associated timely improvements in accuracy of ML/DL predictions
slide-9
SLIDE 9

CLASSIFICATION

  • Binary Classification Problem:

○ Shots are Disruptive (D) or Non-Disruptive (ND)

  • Supervised ML techniques:

○ Physics domain scientists combine knowledge base of

  • bservationally validated information with advanced statistical/ML

predictive methods. Shots can be labeled D/ND retrospectively.

  • Machine Learning (ML) Methods Engaged:

Basic SVM approach initiated by JET team leading to APODIS software; à enabled efficient, rapid progress toward development & deployment at PPPL of: New Deep Learning Recurrent Neural Net (stacked LSTM) software

  • Approach: (i) examine appropriately normalized data; (ii) use training set

to generate model; (iii) use trained model to classify new samples → Targeted multi-D data analysis requires new signal representations

slide-10
SLIDE 10

Identify Signals

  • Classifiers

Preprocessing and feature extraction Train model, Hyper parameter tuning

All data placed on appropriate numerical scale ~ O(1) e.g., Data-based with all signals divided by their standard deviation Princeton/PPPL DL predictions now advancing to multi-D time trace signals (beyond zero-D)

Machine Learning Workflow

Normalization

Measured sequential data arranged in patches of equal length for training

Use model for prediction

  • All available data analyzed;
  • Train LSTM (Long Short Term

Memory Network) iteratively;

  • Evaluate using ROC (Receiver

Operating Characteristics) and cross-validation loss for every epoch (equivalent of entire data set for each iteration) Apply ML/DL software on new data

slide-11
SLIDE 11

JET Disruption Data

# Shots Disruptive Nondisruptive Totals Carbon Wall 324 4029 4353 Beryllium Wall (ILW) 185 1036 1221 Totals 509 5065 5574

Sample 7 Signals of zero-D time traces (07)

Data Size (GB) Plasma Current 1.8 Mode Lock Amplitude 1.8 Plasma Density 7.8 Radiated Power 30.0 Total Input Power 3.0 d/dt Stored Diamagnetic Energy 2.9 Plasma Internal Inductance 3.0

JET produces ~ Terabyte (TB) of data per day ~55 GB data collected from each JET shot ➔ Well over 350 TB total amount with multi- dimensional data yet to be analyzed

slide-12
SLIDE 12

Deep Recurrent Neural Networks (RNNs): Basic Description

  • “Deep”

○ Hierarchical representation of complex data, building up salient features automatically ○ Obviating the need for hand tuning, feature engineering, and feature selection

  • “Recurrent”

○ Natural notion of time and memory à i.e., at every time-step, the output depends on ■ Last Internal state “s(t-1)” Recurrence! ■ Current input x(t) ○ The internal state can act as memory and accumulate information of what has happened in the past

Image adapted from: colah.github.io

Internal State (“memory/ context”)

slide-13
SLIDE 13

FRNN (“Fusion Recurrent Neural Net”) Code Performance (ROC Plot)

True Posi8ves: 93.5% False Posi8ves: 7.5% True Posi8ves: 90.0% False Posi8ves: 5.0% Performance Tradeoff: Tune True Positives (good: correctly caught disruption) vs. False Positives (bad: safe shot incorrectly labeled disruptive). RNN Data:

  • Testing 1200 shots

from Jet ILW campaigns (C28-C30)

  • All shots used, no

signal filtering or removal of shots Jet SVM* work:

  • 990 shots from same

campaigns

  • Filtering of signals,

ad hoc removal of shots with abnormal signals

  • TP 80 to 90%, FP 5%

*Vega, Jesús, et al. "Results of the JET real-time disruption predictor in the ITER-like wall campaigns." Fusion Engineering and Design 88.6 (2013): 1228-1231.

slide-14
SLIDE 14

RNNs: HPC Innova2ons Engaged

GPU training

  • Neural networks use dense tensor manipulations, efficient use of GPU FLOPS
  • Over 10x speedup better than multicore node training (CPU’s)

Distributed Training via MPI

Linear scaling:

  • Key benchmark of “time to accuracy”: we can

train a model that achieves the same results nearly N times faster with N GPUs Scalable

  • to 100s or >1000’s of GPU’s on Leadership

Class Facilities

  • TB’s of data and more
  • Example: Best model training time on full

dataset (~40GB, 4500 shots) of 0D signals training ○ SVM (JET) : > 24hrs ○ RNN ( 20 GPU’s) : ~40min

slide-15
SLIDE 15

Fusion Recurrent Neural Net (FRNN) Description

  • Python deep learning code for disruption prediction in fusion (tokamak)

experiments – Reference: https://github.com/PPPLDeepLearning/plasma-python

  • Implements distributed data parallel synchronous RNN training

– Tensorflow & Theano backends with MPI for communication – FRNN code workflow is characteristic

  • f typical distributed deep learning software

– Core modules:

  • Models: Python classes necessary to construct, train,

and optimize deep RNN models.

  • Pre-process: arrange data into patches for stateful training; normalize
  • Primitives: Python objects for key plasma physics abstractions
  • Utils: a set of auxiliary functions for pre-processing, performance evaluation, and

learning curves analysis

slide-16
SLIDE 16

Master W Data Worker (GPU) Model(W) Worker (GPU) Model(W) Data MPI_REDUCE MPI_BCAST

...

Distributed Training

  • Keep N model replicas
  • Each computes gradient

steps on a mini-batch, with different subsets of the data

  • The gradients are

synchronized (averaged) and the updates are made to a global set of parameters

  • The global parameters are

broadcast to the N models

  • Efficient communication

using custom MPI implementation Batch Batch

slide-17
SLIDE 17

Runtime: computation time Parallel Efficiency Communication: each batch of data requires time for synchronization

Scaling Summary

slide-18
SLIDE 18

FRNN scaling results on GPU’s (Part-1)

  • Tests on Princeton University’s “Tiger” cluster

– 200 Tesla P100 PCIe GPUs, Intel Omni-path fabric – Tensorflow+MPI, CUDA8, CuDNN 5.1

GPU type/backend Tepoch [s] K20X (theano) 1352 P100 (theano) 694 P100 (tensorflow) 515

Average Time per Epoch

slide-19
SLIDE 19

FRNN Scaling Results on GPU’s (Part-2)

  • Tests on OLCF Titan CRAY supercomputer

– OLCF DD AWARD: Enabled Scaling Studies on Titan currently up to 6000 GPU’s – Total ~ 18.7K Tesla K20X Kepler GPUs Tensorflow+MPI (using Singularity containers), CUDA7.5, CuDNN 5.1

slide-20
SLIDE 20
  • Hyperparameter Tuning

Currently exploring many approaches to optimize performance

  • Higher Dimensional Signals

Demands processing much more data to train models

  • Signal Normalization

Identifying and testing various approaches to determine best choices

Challenges & Opportunities Summary

→ All with Common Need for rapid training time HPC engagement (e.g. via GPU’s and MPI) is key

slide-21
SLIDE 21

CURRENT PERSPECTIVE Forecasting disruptions using machine learning is an important application of a general idea: à Use multi outcome prediction to distinguish disruption types/scenarios à Eventually move from prediction to active control

(including Reinforcement learning and Synthetic diagnostics

à Increasingly large and diverse data sets require building scalable systems to take advantage of leadership class computing facilities

slide-22
SLIDE 22

Fusion Deep Learning (FRNN) Technical Summary

  • FRNN à a distributed data-parallel approach to train deep neural networks

(stacked LSTM’s);

  • Replica of the model is kept on each “worker” à processing different mini-

batches of the training dataset in parallel;

  • Results on each worker are combined after each epoch using MPI;
  • Model parameters are synchronized via parameter averaging à with

learning rate adjusted after each epoch to improve convergence

  • Stochastic gradient descent (SGD) used for large-scale optimization with

parallelization via mini-batch training to reduce communication cost. à Challenge: scaling studies to examine if convergence rate saturates/ decreases with increasing mini-batch size (to thousands of GPU’s). à Targeted Large HPC Systems with P-100’s for Performance Scaling Studies: (1) “PIZ-DAINT” Cray XC50 @ CSCS (Switzerland) with > 4K GPU’S; (2) “SATURN V” @ NVIDIA with ~ 1K GPU’s; (3) “TSUBAME 3” @ TITECH with ~ 3K GPU’s; & (4) “SUMMIT-DEV” @ OLCF.

slide-23
SLIDE 23
  • Fusion Energy Mission:
  • - Accelerate demonstration of the scientific & technical feasiblity of delivering Fusion Power
  • - Most critical associated problem is to avoid/mitigate large-scale major disruptions.
  • ML Relevance to HPC:
  • - Rapid Advances on development of predictive methods via large-data-driven “machine-

learning” statistical methods

  • - Approach Focus: Deep Learning/Recurrent Neural Nets (RNNs)
  • - Significance: Exciting alternative predictive approach to “hypothesis-driven/first

principles” exascale predictive methods

  • - Complementarity: Exascale HPC needed to introduce/establish Supervised ML Classifiers

with associated features

  • Associated Challenge:

→ Improvements over zero-D SVM-based machine-learning needed to achieve > 95% success rate, <5% false positives at least 30 ms before disruptions -- with portability of software to ITER via enhanced physics fidelity (capturing multi-D) with improvement in execution time enabled by access to advanced HPC hardware (e.g., large GPU systems).

Fusion Big Data ML/DL Application Summary

slide-24
SLIDE 24

EXTRA SLIDES

slide-25
SLIDE 25

FRNN Scaling Results on GPU’s: Theano backend (Back-up Slide)

  • Tests on OLCF Titan CRAY supercomputer

– Theano backend with MPI communication – Successful runs up to 1200 K20X Kepler GPUs – Graph compilation time becomes a bottleneck (beyond 1.2K GPU’s)

slide-26
SLIDE 26

Recurrent Neural Networks (RNN’s)

Common Theme: sequential data

e.g. image classification e.g. image captioning e.g. sentiment analysis e.g. machine translation e.g. time series prediction, disruption forecasting

Output Hidden Input

slide-27
SLIDE 27

Patching Sequence Data (Part-1)

  • Length of shots in e.g. JET data varies by orders of magnitude:

– Minimum length: 1400 – Mean length: ~27,000 – Max: ~40,000 time-steps

  • Data parallel synchronous training

demands amounts of data passed to the model replica is about the same size

  • A patch is a subset of shot's time/signal profile having a fixed length
  • Patch size is approximately equal to the minimum shot length

– Note: patch size equals the largest number less or equal to the minimum shot length divisible by the LSTM model length

Timesteps Shot lengths

slide-28
SLIDE 28

Patching Sequence Data (Part-2)

  • Shape of data: (batch size, number of time steps, dimension of data)

– Batch size: 256, tunable parameter – Number of timesteps: LSTM model length, tunable – Dimension of data: e.g, 7 Zero-D signals à 7D signal

Timesteps

  • The i-th and the (batch size + i-th) examples are consecutive in time;

à RNN internal state is not reset unless we start a new chunk Patch length Instead of padding to the max, we get patch-sized subset of shot from the end Timesteps On the evaluation step, we use zero- padding to the maximum shot sequence length