Exploiting Randomness in Neural Networks Daniele Di Sarli Mauriana - - PowerPoint PPT Presentation

exploiting randomness in neural networks
SMART_READER_LITE
LIVE PREVIEW

Exploiting Randomness in Neural Networks Daniele Di Sarli Mauriana - - PowerPoint PPT Presentation

Exploiting Randomness in Neural Networks Daniele Di Sarli Mauriana Pesaresi seminars - 2020 Recurrent Neural Network error = (output expected) 2 error W Backpropagation Through Time PREDICTION PATTERNS, INTERACTIONS , 3, 2,


slide-1
SLIDE 1

Exploiting Randomness in Neural Networks

Daniele Di Sarli Mauriana Pesaresi seminars - 2020

slide-2
SLIDE 2

Recurrent Neural Network

slide-3
SLIDE 3

Backpropagation Through Time

error = (output – expected)2

∂error ∂W

slide-4
SLIDE 4

…, 3, 2, 1.5, 0.75, 1, -2.3, 4, … PATTERNS, INTERACTIONS PREDICTION

slide-5
SLIDE 5

Reservoir Readout

slide-6
SLIDE 6

Reservoir Readout

slide-7
SLIDE 7

Reservoir Readout

Echo State Network

slide-8
SLIDE 8

(a) (b) (c) (d)

slide-9
SLIDE 9

Cover’s theorem

slide-10
SLIDE 10

Echo State Property

slide-11
SLIDE 11

Echo State Network starter pack

  • 1. Randomly initialize the weights (sparse)
  • 2. Rescale the weights to guarantee contractivity of the state

transition function (=> ESP)

  • 3. Feed data, collect states
  • 4. Compute optimal linear regression parameters
slide-12
SLIDE 12

«RC […] provides explanations of why biological brains can carry out accurate computations with an “inaccurate” and noisy physical substrate» In the primary visual cortex, «computations are performed by complex dynamical systems while information about results of these computations is read out by simple linear classifiers.»

— Nikolić et al. — Lukoševičius et al.

slide-13
SLIDE 13

My work

slide-14
SLIDE 14

From Strubell, E., Ganesh, A., McCallum, A.: Energy and Policy Considerations for Deep Learning in NLP Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019

100000 200000 300000 400000 500000 600000 700000 Transformer w/ neural arch. search Car, avg incl. fuel, 1 lifetime

CO2 emissions (lbs)

LSTM GRU Transformer BERT Natural Language Processing

slide-15
SLIDE 15

‘My input sentence’

RNN

  • 0.76, 0.35, … -0.02

sentence embedding linear classifier input sequence (word embeddings)

Text Classification pipeline

TRAINING

slide-16
SLIDE 16

‘My input sentence’

ESN

  • 0.76, 0.35, … -0.02

sentence embedding linear classifier input sequence (word embeddings)

Text Classification pipeline

TRAINING

slide-17
SLIDE 17

Question Classification

What was the name of the first Russian astronaut to do a spacewalk? HUMAN What's the tallest building in New York City? LOCATION … also ABBREVIATION, ENTITY, DESCRIPTION, and NUMERIC VALUE

slide-18
SLIDE 18

Improvements are needed

  • Bidirectional
  • Attention
  • Multi-ring

What's the tallest building in New York City?

slide-19
SLIDE 19

Improvements are needed

  • Bidirectional
  • Attention
  • Multi-ring

What's the tallest building in New York City?

slide-20
SLIDE 20

Improvements are needed

  • Bidirectional
  • Attention
  • Multi-ring
slide-21
SLIDE 21

Results

88 90 92 94 96 98 100 A d a

  • C

N N B i

  • L

S T M P a r a g r a p h V e c t

  • r

T r a n s f

  • r

m e r + C N N B i

  • G

R U B i

  • E

S N B i

  • E

S N ( e n s e m b l e ) B i

  • E

S N

  • A

t t

Accuracy

  • urs

< 1.6M params 200M+ params, heavy transfer learning

slide-22
SLIDE 22

Results

100 200 300 400 500 600 Bi-GRU Bi-ESN Bi-ESN (ensemble) Bi-ESN-Att

Training time

6 sec 7.5 min

88 90 92 94 96 98 100 A d a

  • C

N N B i

  • L

S T M P a r a g r a p h V e c t

  • r

T r a n s f

  • r

m e r + C N N B i

  • G

R U B i

  • E

S N B i

  • E

S N ( e n s e m b l e ) B i

  • E

S N

  • A

t t

Accuracy

  • urs

< 1.6M params 200M+ params, heavy transfer learning

slide-23
SLIDE 23

How old was the youngest president of the United States ? When was Ulysses S. Grant born ? Who invented the instant Polaroid camera ? What is nepotism ? Where is the Mason/Dixon line ? What is the capital of Zimbabwe ? What are Canada 's two territories ?

slide-24
SLIDE 24

Wrap up

  • A path towards efficient, effective ML models must be taken
  • Heavier understanding/exploitation of the architectural properties
  • f RNN models can help towards that goal
  • Analysis is preliminary, but WIP results are encouraging
slide-25
SLIDE 25

References

1. Di Sarli, D., Gallicchio, C., & Micheli, A. (2019, November). Question Classification with Untrained Recurrent Embeddings. In International Conference of the Italian Association for Artificial Intelligence. 2. Jaeger, H., & Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science. 3. Lukoševičius, M., & Jaeger, H. (2009). Reservoir computing approaches to recurrent neural network training. Computer Science Review. 4. Nikolić, D., Haeusler, S., Singer, W., & Maass, W. (2007). Temporal dynamics of information content carried by neurons in the primary visual cortex. In Advances in neural information processing systems.