Th The Fron ronti tier of f Def efine-by by-Run Dee eep Lea - - PowerPoint PPT Presentation

th the fron ronti tier of f def efine by by run dee eep
SMART_READER_LITE
LIVE PREVIEW

Th The Fron ronti tier of f Def efine-by by-Run Dee eep Lea - - PowerPoint PPT Presentation

Th The Fron ronti tier of f Def efine-by by-Run Dee eep Lea earning Fram rameworks GTC 2019 @ San Jose. Mar. 20, 2019 Seiya Tokui, Preferred Networks, Inc. S9380 Deep Learning Framework for fast iterative research/development 2 Def


slide-1
SLIDE 1

Th The Fron ronti tier of f Def efine-by by-Run Dee eep Lea earning Fram rameworks

GTC 2019 @ San Jose. Mar. 20, 2019 Seiya Tokui, Preferred Networks, Inc.

S9380

slide-2
SLIDE 2

Deep Learning Framework for fast iterative research/development

2

slide-3
SLIDE 3

Def efine-by by-Run fr frameworks

3

by default from 2.0

slide-4
SLIDE 4

x = numpy.array(…) h1 = layer1(x, W1) h2 = layer2(h1, W2) loss = loss_func(h2) loss.backward() W1.array -= lr * W1.grad W2.array -= lr * W2.grad Write forward prop as a plain Python script. Variables hold how they were computed. Use it to compute the gradient.

4

slide-5
SLIDE 5

Deep learning framework

  • ptimized for the Define-by-Run API design

5

slide-6
SLIDE 6

6

✓ Model description ✓ Distributed training ✓ Serialization, export …… Everything is optimized for Define-by-Run style programming

slide-7
SLIDE 7

class Linear(chainer.Link): def __init__(self, n_in, n_out): super().__init__() with self.init_scope(): self.W = chainer.Parameter(I.HeNormal(), (n_in, n_out)) self.b = chainer.Parameter(0, (n_out,)) def forward(self, x): return x @ self.W + self.b

Tie parameters to the forward code using OOP.

7

slide-8
SLIDE 8

8

class MLP(chainer.Chain): def __init__(self): super().__init__() with self.init_scope(): self.l1 = Linear(784, 200) self.l2 = Linear(200, 100) self.l3 = Linear(100, 10) def forward(self, x): h1 = F.relu(self.l1(x)) h2 = F.relu(self.l2(h1)) return self.l3(h2)

Object structure = composition of NN fragments

slide-9
SLIDE 9

9

for batch in iterator: x, t = converter(batch) loss = loss_fun(x, t) loss.backward()

  • ptimizer.update()

model.cleargrad()

Every part is plain, customizable Python code

# fetch the next minibatch # concat, transfer to the device # forward prop # backprop # update parameters # cleanup gradients

slide-10
SLIDE 10

10

Fast GPU computation

slide-11
SLIDE 11

11

slide-12
SLIDE 12

12

import numpy as np def logsumexp(x): x_max = x.max(axis=1, keepaxis=True) x0 = x – x_max lse = np.log(np.exp(x0).sum(axis=1)) lse += x_max return lse x = np.array([...], dtype=np.float32) print(logsumexp(x))

slide-13
SLIDE 13

13

import cupy as cp def logsumexp(x): x_max = x.max(axis=1, keepaxis=True) x0 = x – x_max lse = cp.log(cp.exp(x0).sum(axis=1)) lse += x_max return lse x = cp.array([...], dtype=np.float32) print(logsumexp(x))

slide-14
SLIDE 14

14

import cupy as cp, numpy as np def logsumexp(x): x_max = x.max(axis=1, keepaxis=True) x0 = x – x_max lse = np.log(np.exp(x0).sum(axis=1)) lse += x_max return lse x = cp.array([...], dtype=np.float32) print(logsumexp(x))

slide-15
SLIDE 15

15

✓ cuDNN support (conv, pooling, LSTM, …) ✓ Easy custom kernel compiled at runtime ✓ FP16 support

slide-16
SLIDE 16

17

load & make minibatch forward backward parameter update

DALI, multiprocessing float16 mode, TensorCore Distributed training

slide-17
SLIDE 17

18

Mixed precision training > TensorCore support

automatically available

> Techniques for mixed precision training

  • ptimizer.set_loss_scale(scale)
  • ptimizer.use_fp32_update()

> mixed16 mode (coming soon)

CHAINER_DTYPE=mixed16

slide-18
SLIDE 18

19

Distributed training

slide-19
SLIDE 19

20

Forward Backward Optimize Forward Backward Optimize Forward Backward Optimize Forward Backward Optimize

ALL-REDUCE

Process 0 on node 0, GPU 0 Process 1 on node 0, GPU 1 Process 2 on node 1, GPU 0 Process 3 on node 1, GPU 1

slide-20
SLIDE 20

21

Data parallelism

comm = chainermn.create_communicator() device = comm.intra_rank # use this device

  • ptimizer = chainermn.create_multi_node_optimizer(…, comm)

Scaled to V100x512 environment (https://arxiv.org/abs/1809.00778)

slide-21
SLIDE 21

22

Model parallelism

> Each node computes different part of the network (model itself is parallelized ) > MPI communication primitives with backprop

# rank 0 phi = send(x, comm, rank=1) h = recv(comm, rank=1, delegate_variable=phi) # rank 1 x = recv(comm, rank=0) h = f(x) phi = send(h, comm, rank=0)

slide-22
SLIDE 22

23

slide-23
SLIDE 23

24

Model parallelism

> send returns a pseudo variable φ. It simulates the topology of full computational graph > Collective communication routines, e.g. bcast, scatter, allgather etc., are also available # rank 0 phi = send(x, comm, rank=1) h = recv(comm, rank=1, delegate_variable=phi) loss(h).backward() # rank 1 x = recv(comm, rank=0) h = f(x) phi = send(h, comm, rank=0) phi.backward()

slide-24
SLIDE 24

25

Domain specific add-on packages

slide-25
SLIDE 25

26

✓ Support standard computer vision tasks

classification, object detection, semantic/instance segmentation

✓ Simple, unified interface

easy to use and compose, optimized for computer vision workloads

✓ Guaranteed reproduction

every method implemented is confirmed to reproduce the same performance as the original paper

slide-26
SLIDE 26

27

✓ Wide range of Deep RL methods covered

DQN, Categorical DQN, IQN, DDPG, A3C, ACER, NSQ, PCL, PPO, TRPO

✓ Clean API and abstraction

easy to combine multiple orthogonal design choices, e.g. discrete/continuous actions, recurrent models, async training, ...

✓ Environment support

compatible with OpenAI Gym interface

slide-27
SLIDE 27

28

Chainer Chemistry ChainerUI

slide-28
SLIDE 28

29

What is needed for modern deep learning frameworks?

slide-29
SLIDE 29

Speed

Faster trial-and-error Larger scale

Environment support

Quick adoption of new hardware/environment

Quick Deployment

Quick application of research outcome

slide-30
SLIDE 30

ChainerX

included in Chainer v6 beta1

slide-31
SLIDE 31

Cha ChainerX = Nu NumPy-like nda ndarray + + aut autograd

  • in C++ w/ a thin binding layer

= far less host-side overhead

  • with pluggable device backends

= open to quickly add a new device support

  • with pure C++ API

= available for Python-free native apps

Speed Environment Support Quick Deployment

slide-32
SLIDE 32

33

ChainerX (with C++ API)

ChainerX Python API High level API (Chainer)

Native backend CUDA backend custom backend …

Existing code using Chainer Low-overhead computation written in Python Portable code with much less overhead in C++

NumPy CuPy

slide-33
SLIDE 33

Cha ChainerX Pyt ython API: I: chainerx nam namespace

import chainerx as chx x = chx.ones((2, 3), dtype=chx.float32, device='cuda:0') y = (x + 1).require_grad() z = chx.exp(y).sum() z.backward()

> NumPy compatible API > NN specific functions

conv, batch_norm, …

> Device support

> require_grad() to make array

differentiable

slide-34
SLIDE 34

Cha Chainer on

  • n Cha

ChainerX

arr = chx.ones((2, 3), dtype=chx.float32) x = chainer.Variable(arr) y = model(x) y.backward()

> Wraps chx.ndarray with Variable

> FunctionNode fallbacks the

computation to NumPy/CuPy > Uses ChainerX (C++) computational graph with lower

  • verhead in backprop
slide-35
SLIDE 35

Cha ChainerX C+ C++ API

chainerx::Array x = chainerx::ones( {2, 3}, chainerx::Dtype::kFloat32, chainerx::GetDevice("cuda:0")); chainerx::Array y = (x + 1).RequireGrad(); chainerx::Array z = chainerx::Exp(y).Sum(); chainerx::Backward(z);

> Has almost one-to-one mapping to Python API > Runs without CPython environment

slide-36
SLIDE 36

Host logic overhead

Framework/API Time per iteration (=fwd+bwd+update, msec) Chainer on NumPy 14.48 Chainer on ChainerX 7.54 ChainerX Python 1.88 PyTorch 2.45

slide-37
SLIDE 37

Cha ChainerX: Ro Roadmap

38

v6 v7 future

ChainerX Basic ops Integration to Chainer Wide coverage of ops Ready for most users C++ API made more accessible Easier deploy Wider coverage of “compiled models” Wide coverage of ops Ready for most users C++ API made more accessible

  • May. 2019
  • Nov. 2019

2020+

slide-38
SLIDE 38

Python Chainer ONNX+ ChainerX VM

Execution with ChainerX

Vendor-specific graph formats Native binary

Tracing (ONNX-Chainer) Translation (Chainer to ONNX)

Graph-based optimization Graph-based autodiff Dynamic shape Control flows

Chai Chainer Com Compiler ht https://gi github.c .com/pfnet-re research/chainer-compiler

slide-39
SLIDE 39

40

slide-40
SLIDE 40

41

> Pioneering define-by-run API design > Being made faster and more portable with ChainerX and Chainer Compiler

WE ARE HIRING!

@ChainerOfficial on Twitter https://bit.ly/join-chainer-slack https://preferred-networks.jp/en/jobs