Simple and Efficient Learning with Automatic Operation Batching - - PowerPoint PPT Presentation

simple and efficient learning with automatic operation
SMART_READER_LITE
LIVE PREVIEW

Simple and Efficient Learning with Automatic Operation Batching - - PowerPoint PPT Presentation

Simple and Efficient Learning with Automatic Operation Batching Graham Neubig joint work w/ Yoav Goldberg and Chris Dyer http://dynet.io/autobatch/ in https://github.com/neubig/howtocode-2017 Neural Networks w/ Complicated Structures Words


slide-1
SLIDE 1

Simple and Efficient Learning with Automatic Operation Batching

Graham Neubig http://dynet.io/autobatch/

joint work w/ Yoav Goldberg and Chris Dyer

in

https://github.com/neubig/howtocode-2017

slide-2
SLIDE 2

Neural Networks w/ Complicated Structures

Phrases Words Sentences

Alice gave a message to Bob

PP NP VP VP S

Dynamic Decisions

a=1 a=1 a=2

slide-3
SLIDE 3

Neural Net Programming Paradigms

slide-4
SLIDE 4

What is Necessary for Neural Network Training

  • define computation
  • add data
  • calculate result (forward)
  • calculate gradients (backward)
  • update parameters
slide-5
SLIDE 5

Paradigm 1: Static Graphs
 (Tensorflow, Theano)

  • define
  • for each data point:
  • add data
  • forward
  • backward
  • update
slide-6
SLIDE 6

Advantages/Disadvantages

  • f Static Graphs
  • Advantages:
  • Can be optimized at definition time
  • Easy to feed data to GPUs, etc., via data iterators
  • Disadvantages:
  • Difficult to implement nets with varying structure (trees,

graphs, flow control)

  • Need to learn big API that implements flow control in the

“graph” language

slide-7
SLIDE 7

Paradigm 2: 
 Dynamic+Eager Evaluation
 (PyTorch, Chainer)

  • for each data point:
  • define/add data/forward
  • backward
  • update
slide-8
SLIDE 8

Advantages/Disadvantages

  • f Dynamic+Eager Evaluation
  • Advantages:
  • Easy to implement nets with varying structure,

API is closer to standard Python/C++

  • Easy to debug because errors occur immediately
  • Disadvantages:
  • Cannot be optimized at definition time
  • Hard to serialize graphs w/o program logic,

decide device placement, etc.

slide-9
SLIDE 9

Paradigm 3: 
 Dynamic+Lazy Evaluation (DyNet)

  • for each data point:
  • define/add data
  • forward
  • backward
  • update
slide-10
SLIDE 10

Advantages/Disadvantages

  • f Dynamic+Lazy Evaluation
  • Advantages:
  • Easy to implement nets with varying structure,


API is closer to standard Python/C++

  • Can be optimized at definition time (this

presentation!)

  • Disadvantages:
  • Harder to debug because errors occur immediately
  • Still hard to serialize graphs w/o program logic,

decide device placement, etc.

slide-11
SLIDE 11

Efficiency Tricks:
 Operation Batching

slide-12
SLIDE 12

Efficiency Tricks:
 Mini-batching

  • On modern hardware 10 operations of size 1 is

much slower than 1 operation of size 10

  • Minibatching combines together smaller operations

into one big one

slide-13
SLIDE 13

Minibatching

slide-14
SLIDE 14

Manual Mini-batching

  • DyNet has special minibatch operations for lookup

and loss functions, everything else automatic

  • You need to:
  • Group sentences into a mini batch (optionally, for

efficiency group sentences by length)

  • Select the “t”th word in each sentence, and send

them to the lookup and loss functions

slide-15
SLIDE 15

Example Task: Sentiment

I hate this movie I love this movie I do n’t hate this movie

very good good neutral bad very bad very good good neutral bad very bad very good good neutral bad very bad

slide-16
SLIDE 16

Continuous Bag of Words (CBOW)

I hate this movie + bias = scores + + +

lookup lookup lookup lookup

W

=

slide-17
SLIDE 17

I

Batching CBOW

I hate this movie + + +

lookup lookup lookup lookup

love that movie

slide-18
SLIDE 18

Mini-batched Code Example

slide-19
SLIDE 19

Mini-batching Sequences

this is an example </s> this is another </s> </s> Padding Loss Calculation Mask

1
 1

  • 1


1

  • 1


1

  • 1


1

  • 1

  • Take Sum
slide-20
SLIDE 20

Bi-directional LSTM

I hate this movie + bias = scores

W

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM concat

slide-21
SLIDE 21

Tree-structured RNN/LSTM

I hate this movie + bias = scores

W

RNN RNN RNN

slide-22
SLIDE 22

And What About These?

Phrases Words Sentences

Alice gave a message to Bob

PP NP VP VP S

Dynamic Decisions

a=1 a=1 a=2

slide-23
SLIDE 23

Automatic Operation Batching

slide-24
SLIDE 24

Automatic Mini-batching!

  • Innovatd by TensorFlow Fold (faster than unbatched, but

implementation relatively complicated)

  • DyNet Autobatch (basically effortless implementation)
slide-25
SLIDE 25

Programming Paradigm

for minibatch in training_data: loss_values = [] for x, y in minibatch: loss_values.append(calculate_loss(x,y)) loss_sum = sum(loss_values) loss_sum.forward() loss_sum.backward() trainer.update() Just write a for loop! Batching occurs here

slide-26
SLIDE 26

Under the Hood

  • Each node has “profile”, same profile → batchable
  • Batch and execute items with their dependencies satisfied
slide-27
SLIDE 27

Challenges

  • This goes in your training loop:


must be blazing fast!

  • DyNet’s C++ implementation is highly optimized
  • Profiles stored as hash functions
  • Minimize memory allocation overhead
slide-28
SLIDE 28

Synthetic Experiments

  • Fixed-length RNN → ideal case for manual batching
  • How close can we get?
slide-29
SLIDE 29

Real NLP Tasks

  • Variably Lengthed RNN, RNN w/ character

embeddings, tree LSTM, dependency parser

slide-30
SLIDE 30

Let’s Try it Out!

http://dynet.io/autobatch/ https://github.com/neubig/howtocode-2017