Parallelizable StackLSTM Shuoyang Ding Philipp Koehn NAACL 2019 - - PowerPoint PPT Presentation

parallelizable stacklstm
SMART_READER_LITE
LIVE PREVIEW

Parallelizable StackLSTM Shuoyang Ding Philipp Koehn NAACL 2019 - - PowerPoint PPT Presentation

Parallelizable StackLSTM Shuoyang Ding Philipp Koehn NAACL 2019 Structured Prediction Workshop Minneapolis, MN, United States June 7th, 2019 Outline What is StackLSTM? Parallelization Problem Homogenizing Computation


slide-1
SLIDE 1

Parallelizable StackLSTM

Shuoyang Ding Philipp Koehn NAACL 2019 Structured Prediction Workshop Minneapolis, MN, United States June 7th, 2019

slide-2
SLIDE 2

Parallelizable StackLSTM

Outline

  • What is StackLSTM?
  • Parallelization Problem
  • Homogenizing Computation
  • Experiments

2

slide-3
SLIDE 3

What is StackLSTM?

slide-4
SLIDE 4

Parallelizable StackLSTM

A Partial Tree

4

slide-5
SLIDE 5

Parallelizable StackLSTM

Good Edge?

5

slide-6
SLIDE 6

Parallelizable StackLSTM

Good Edge?

6

slide-7
SLIDE 7

Parallelizable StackLSTM

LSTM?

7

slide-8
SLIDE 8

Parallelizable StackLSTM

:(

8

slide-9
SLIDE 9

Parallelizable StackLSTM

StackLSTM

  • An LSTM whose states are stored in a stack
  • Computation is conditioned on the stack operation

9

Dyer et al. (2015) Ballesteros et al. (2017)

slide-10
SLIDE 10

Parallelizable StackLSTM

StackLSTM

10

slide-11
SLIDE 11

Parallelizable StackLSTM

Push ,

11

slide-12
SLIDE 12

Parallelizable StackLSTM

Pop

12

slide-13
SLIDE 13

Parallelizable StackLSTM

Push 61

13

slide-14
SLIDE 14

Parallelizable StackLSTM

Push years

14

slide-15
SLIDE 15

Parallelizable StackLSTM

Push old

15

slide-16
SLIDE 16

Parallelizable StackLSTM

Pop

16

slide-17
SLIDE 17

Parallelizable StackLSTM

Pop

17

slide-18
SLIDE 18

Parallelizable StackLSTM

Pop

18

slide-19
SLIDE 19

Parallelizable StackLSTM

Push ,

19

slide-20
SLIDE 20

Parallelizable StackLSTM

Pop

20

slide-21
SLIDE 21

Parallelizable StackLSTM

Push will

21

slide-22
SLIDE 22

Parallelizable StackLSTM

Push join

22

slide-23
SLIDE 23

Parallelizable StackLSTM

:)

23

slide-24
SLIDE 24

Parallelization Problem

slide-25
SLIDE 25

Parallelizable StackLSTM

LSTM

25

slide-26
SLIDE 26

Parallelizable StackLSTM

LSTM

26

slide-27
SLIDE 27

Parallelizable StackLSTM

Batched LSTM

27

slide-28
SLIDE 28

Parallelizable StackLSTM

Batched… StackLSTM?

28

slide-29
SLIDE 29

Parallelizable StackLSTM

:(

29

slide-30
SLIDE 30

Parallelizable StackLSTM

Wouldn’t it be nice if…

30

slide-31
SLIDE 31

Homogenizing Computation

slide-32
SLIDE 32

Parallelizable StackLSTM

Push

32

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + 1;

slide-33
SLIDE 33

Parallelizable StackLSTM

Push

33

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + 1;

slide-34
SLIDE 34

Parallelizable StackLSTM

Push

34

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + 1;

slide-35
SLIDE 35

Parallelizable StackLSTM

Push

35

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + 1;

slide-36
SLIDE 36

Parallelizable StackLSTM

Push

36

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + 1;

slide-37
SLIDE 37

Parallelizable StackLSTM

Pop

37

  • update stack top pointer

p(t+1) = p(t) - 1;

slide-38
SLIDE 38

Parallelizable StackLSTM

Pop

38

  • update stack top pointer

p(t+1) = p(t) - 1;

slide-39
SLIDE 39

Parallelizable StackLSTM

Observation 1

39

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + 1;

  • update stack top pointer

p(t+1) = p(t) - 1;

slide-40
SLIDE 40

Parallelizable StackLSTM

Observation 1

40

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + op;

  • update stack top pointer

p(t+1) = p(t) + op;

Use op = +1 for push and

  • p = -1 for pop
slide-41
SLIDE 41

Parallelizable StackLSTM

Observation 1

The computation performed for Pop

  • peration is a subset of Push operation.

41

slide-42
SLIDE 42

Parallelizable StackLSTM

Observation 2

Is it safe to do the other computations for push for pop as well?

42

slide-43
SLIDE 43

Parallelizable StackLSTM

Observation 2

43

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + op;

  • update stack top pointer

p(t+1) = p(t) + op;

slide-44
SLIDE 44

Parallelizable StackLSTM

Observation 2

44

  • update stack top pointer

p(t+1) = p(t) + op;

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • update stack top pointer

p(t+1) = p(t) + op;

  • write new hidden state

to h_{p(t) + 1};

slide-45
SLIDE 45

Parallelizable StackLSTM

Observation 2

A write will always happen before the stack top pointer advances.

45

slide-46
SLIDE 46

Parallelizable StackLSTM

Observation 2

If one wants to write anything in the higher position than the current stack top pointer…

46

slide-47
SLIDE 47

Parallelizable StackLSTM

Observation 2

If one wants to write anything in the higher position than the current stack top pointer… Just do it!

47

slide-48
SLIDE 48

Parallelizable StackLSTM

Observation 2

48

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + op;

  • update stack top pointer

p(t+1) = p(t) + op;

slide-49
SLIDE 49

Parallelizable StackLSTM

Observation 2

49

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + op;

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + op;

slide-50
SLIDE 50

Parallelizable StackLSTM

Done!

50

  • read the stack top

hidden state h_{p(t)};

  • perform LSTM forward

computation with x(t) and h_{p(t)};

  • write new hidden state

to h_{p(t) + 1};

  • update stack top pointer

p(t+1) = p(t) + op;

slide-51
SLIDE 51

Experiments

slide-52
SLIDE 52

Parallelizable StackLSTM

Benchmark

Transition-based dependency parsing

  • n Stanford Dependency Treebank

PyTorch, Single K80 GPU

52

slide-53
SLIDE 53

Parallelizable StackLSTM

Hyperparameters

  • Largely following Dyer et al. (2015); Ballesteros et
  • al. (2017), except:
  • Adam w/ ReduceLROnPlateau and warmup
  • Arc-Hybrid w/o composition function
  • Slightly larger models (200 hidden, 200 state, 48

action embedding) perform better

53

slide-54
SLIDE 54

Parallelizable StackLSTM

Speed

54

slide-55
SLIDE 55

Parallelizable StackLSTM

Speed

55

slide-56
SLIDE 56

Parallelizable StackLSTM

Performance

56

batch size 91 91.5 92 92.5 93 8 16 32 64 128 256

Ours Ballesteros 2017

slide-57
SLIDE 57

Conclusion

slide-58
SLIDE 58

Parallelizable StackLSTM

Conclusion

  • We propose a parallelization scheme for StackLSTM

architecture.

  • Together with a different optimizer, we are able to train

parsers of comparable performance within 1 hour.

58

paper code slides

https://github.com/shuoyangd/hoolock