Seminar C2NLU, Schlo Dagstuhl, Wadern, Germany 24-January-2017 From - - PowerPoint PPT Presentation

seminar c2nlu schlo dagstuhl wadern germany 24 january
SMART_READER_LITE
LIVE PREVIEW

Seminar C2NLU, Schlo Dagstuhl, Wadern, Germany 24-January-2017 From - - PowerPoint PPT Presentation

Seminar C2NLU, Schlo Dagstuhl, Wadern, Germany 24-January-2017 From Bayes Decision Rule to Neural Networks for Human Language Technology Hermann Ney T. Alkhouli, P. Bahar, K. Irie, J.-T. Peter Human Language Technology and Pattern


slide-1
SLIDE 1

Seminar C2NLU, Schloß Dagstuhl, Wadern, Germany 24-January-2017 From Bayes Decision Rule to Neural Networks for Human Language Technology Hermann Ney

  • T. Alkhouli, P. Bahar, K. Irie, J.-T. Peter

Human Language Technology and Pattern Recognition RWTH Aachen University, Aachen, Germany IEEE Distinguished Lecturer 2016/17

  • H. Ney: From Bayes to Neural Nets

c RWTH 1 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-2
SLIDE 2

Human Language Technology (HLT) Automatic Speech Recognition (ASR)

we want to preserve this great idea

Statistical Machine Translation (SMT)

wir wollen diese große Idee erhalten we want to preserve this great idea

Handwriting Recognition (Text Image Recognition)

we want to preserve this great idea

tasks: – speech recognition – machine translation – handwriting recognition – sign language

  • H. Ney: From Bayes to Neural Nets

c RWTH 2 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-3
SLIDE 3

Human Language Technology: Speech and Language characteristic properties:

  • well-defined ’classification’ tasks:

– due to 5000-year history of (written!) language – well-defined goal: characters or words (= full forms) of the language

  • easy task for humans (in native language!)
  • hard task for computers

(as the last 50 years have shown!) unifying view:

  • formal task: input string → output string
  • output string: string of words/characters in a natural language
  • models of context and dependencies: strings in input and output

– within input and output string – across input and output string

  • abstract view of language understanding (?):

mapping: natural language → formal language

  • H. Ney: From Bayes to Neural Nets

c RWTH 3 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-4
SLIDE 4

ASR: what is the problem? – ambiguities at all levels – interdependencies of decisions approach: – score hypotheses – probabilistic framework – statistical decision theory (CMU and IBM 1975; Bahl & Jelinek+ 1983) various terminologies: – pattern recognition – statistical learning – connectionism – machine learning important: string context!

SPEECH SIGNAL ACOUSTIC ANALYSIS RECOGNIZED SENTENCE SENTENCE KNOWLEDGE SOURCES SEARCH: INTERACTION OF KNOWLEDGE SOURCES WORD PHONEME LANGUAGE MODEL PRONUNCIATION LEXICON PHONEME MODELS SEGMENTATION AND CLASSIFICATION SYNTACTIC AND SEMANTIC ANALYSIS WORD BOUNDARY DETECTION AND LEXICAL ACCESS HYPOTHESES HYPOTHESES HYPOTHESES

  • H. Ney: From Bayes to Neural Nets

c RWTH 4 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-5
SLIDE 5

machine translation

  • interaction between

three models (or knowledge sources): – alignment model – lexicon model – language model

  • handle interdependences,

ambiguities and conflicts by Bayes decision rule as for speech recognition

KNOWLEDGE SOURCES ALIGNMENT MODEL BILINGUAL LEXICON SENTENCE IN SOURCE LANGUAGE GENERATION: INTERACTION OF KNOWLEDGE SOURCES SENTENCE IN TARGET LANGUAGE LANGUAGE MODEL WORD POSITION RE-ORDERING SYNTACTIC AND SEMANTIC ANALYSIS LEXICAL CHOICE ALIGNMENT HYPOTHESES SENTENCE HYPOTHESES WORD+POSITION HYPOTHESES

  • H. Ney: From Bayes to Neural Nets

c RWTH 5 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-6
SLIDE 6

Bayes Decision Rule 1) performance measure or loss function (e. g. edit or Levenshtein distance) between correct output sequence W and hypothesized output sequence ˜ W : L[W, ˜ W ] 2) probabilistic dependence pr(W |X) between input string X = x1...xt...xT and output string W = w1...wn...wN (e. g. empirical distribution of a representative sample) 3)

  • ptimum performance: Bayes decision rule minimizes expected loss:

X → ˆ W (X) := arg min

˜ W W

pr(W |X) · L[W, ˜ W ]

  • Under these two conditions:

L[W, ˜ W ] : satisfies triangle inequality max

W

{pr(W |X)} > 0.5 we have the MAP rule (MAP = maximum-a-posteriori) [Schlüter & Nussbaum+ 12]: X → ˆ W (X) := arg max

W

  • pr(W |X)
  • Since [Bahl & Jelinek+ 83], this simpified Bayes decision rule is widely used

for speech recognition, handwriting recognition, machine translation, ...

  • H. Ney: From Bayes to Neural Nets

c RWTH 6 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-7
SLIDE 7

Statistical Approach to HLT Tasks

Probabilistic Models Performance Measure (Loss Function) Training Criterion Bayes Decision Rule (Efficient Algorithm) Training Data Output Parameter Estimates Evaluation Optimization (Efficient Algorithm) Test Data

  • H. Ney: From Bayes to Neural Nets

c RWTH 7 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-8
SLIDE 8

Statistical Approach and Machine Learning four ingredients:

  • performance measure: error measure (e.g. edit distance)

we have to decide how to judge the quality of the system output (ASR: edit distance; SMT: edit distance + block movements)

  • probabilistic models with suitable structures:

to capture the dependencies within and between input and output strings – elementary observations: Gaussian mixtures, log-linear models, support vector machines (SVM), multi-layer perceptron (MLP), ... – strings: n-gram Markov chains, CRF, Hidden Markov models (HMM), recurrent neural nets (RNN), LSTM-RNN, CTC, ANN-based models of attention, ...

  • training criterion:

– ideally should be linked to performance criterion (end-to-end training) two important issues: – what is a suitable training criterion? – what is a suitable optimization strategy?

  • Bayes decision rule:

to generate the output word sequence – combinatorial problem (efficient algorithms) – should exploit structure of models examples: dynamic programming and beam search, A∗ and heuristic search, ...

  • H. Ney: From Bayes to Neural Nets

c RWTH 8 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-9
SLIDE 9

HLT and Neural Networks

  • acoustic modelling
  • language modelling (for ASR and SMT)
  • machine translation
  • H. Ney: From Bayes to Neural Nets

c RWTH 9 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-10
SLIDE 10

History: ANN in Acoustic Modelling

  • 1988 [Waibel & Hanazawa+ 88]:

phoneme recognition using time-delay neural networks (using CNNs!)

  • 1989 [Bridle 89]:

softmax operation for probability normalization in output layer

  • 1990 [Bourlard & Wellekens 90]:

– for squared error criterion, ANN outputs can be interpreted as class posterior probabilities (rediscovered: Patterson & Womack 1966) – they advocated the use of MLP outputs to replace the emission probabilities in HMMs

  • 1993 [Haffner 93]: sum over label-sequence posterior probabilities in hybrid HMMs
  • 1994 [Robinson 94]: recurrent neural network

– competitive results on WSJ task – his work remained a singularity in ASR first clear improvements over the state of the art: – 2008 handwriting: Graves using LSTM-RNN and CTC – 2011 speech: Hinton & Li Deng using deep FF MLP and hybrid HMM – more ...

  • H. Ney: From Bayes to Neural Nets

c RWTH 10 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-11
SLIDE 11

What is Different Now after 25 Years? feedforward neural network (FF-NN; multi-layer perceptron, MLP): – operations: matrix · vector – nonlinear activation function ANN outputs: probability estimates comparison for ASR: today vs. 1989-1994:

  • number of hidden layers:

10 (or more) rather than 2-3

  • number of output nodes (phonetic labels):

5000 rather than 50

  • optimization strategy:

practical experience and heuristics, e.g. layer-by-layer pretraining

  • much more computing power
  • verall result:

– huge improvement by ANN – WER is (nearly) halved !!

  • H. Ney: From Bayes to Neural Nets

c RWTH 11 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-12
SLIDE 12

Recurrent Neural Network: String Processing principle for string processing over time t = 1, ..., T : – introduce a memory (or context) component to keep track of history – result: there are two types of input: memory ht−1 and observation xt

  • extensions:

– bidirectional variant [Schuster & Paliwal 1997] – feedback of output labels – long short-term memory [Hochreiter & Schmidhuber 97; Gers & Schraudolph+ 02] – deep hidden layers

  • H. Ney: From Bayes to Neural Nets

c RWTH 12 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-13
SLIDE 13

Recurrent Neural Network: Details of Long Short-Term Memory

forget gate input gate

  • utput

gate net input tanh

ingredients: – separate memory vector ct in addition to ht – use of gates to control information flow – (additional) effect: make backpropagation more robust

  • H. Ney: From Bayes to Neural Nets

c RWTH 13 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-14
SLIDE 14

Acoustic Modelling: HMM and ANN (CTC: similar [Graves & Fernandez+ 06] ) – why HMM? mechanism for time alignment (or dynamic time warping) – critical bottleneck: emission probability model requires density estimation! – hybrid approach: replace HMM emission probability by label posterior probabilities,

  • i. e. by ANN output after suitable re-scaling

time A L E X

  • H. Ney: From Bayes to Neural Nets

c RWTH 14 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-15
SLIDE 15

Acoustic Modelling: Improvements by ANNs QUAERO English Eval 2013 (competitive system) Language Model PP Acoustic Model WER[%] Count Fourgram 131.2 Gaussian Mixture 19.2 deep MLP 10.7 LSTM-RNN 10.4 + LSTM-RNN 92.0 Gaussian Mixture 16.5 deep MLP 9.3 LSTM-RNN 9.3 acoustic models: – acoustic input features: optimized for model – sequence discriminative training (MMI/MPE), not (yet) for LSTM-RNN (end-to-end training) remarks: – overal improvements by ANNS: 50% relative (same amount of training data!) – lion’s share of improvement: acoustic model

  • H. Ney: From Bayes to Neural Nets

c RWTH 15 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-16
SLIDE 16

Language Modeling and Artificial Neural Networks

  • why a separate language model?
  • we need a model to approximate the true posterior distribution p(wN

1 |xT 1 ):

separation of prior probability p(wN

1 ) of word sequence wN 1 = w1...wn...wN

in the posterior probability used in Bayes decision rule: p(wN

1 |xT 1 ) =

p(wN

1 ) · p(xT 1 |wN 1 )

  • ˜

w ˜

N 1 , ˜

N p( ˜

w ˜

N 1 ) · p(xT 1 | ˜

w ˜

N 1 )

– advantage: huge amounts of training data for p(wN

1 ) without annotation

– extension: from generative to log-linear modelling p(wN

1 |xT 1 ) =

qα(wN

1 ) · qβ(wN 1 |xT 1 )

  • ˜

w ˜

N 1 , ˜

N qα( ˜

w ˜

N 1 ) · qβ( ˜

w ˜

N 1 |xT 1 )

  • note about prior p(wN

1 ) or q(wN 1 ): pure SYMBOLIC processing

  • ANN: help here too!
  • H. Ney: From Bayes to Neural Nets

c RWTH 16 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-17
SLIDE 17

Language Modeling and Artificial Neural Networks History:

  • 1989 [Nakamura & Shikano 89]:

English word category prediction based on neural networks

  • 1993 [Castano & Vidal+ 93]:

Inference of stochastic regular languages through simple recurrent networks

  • 2000 [Bengio & Ducharme+ 00]:

A neural probabilistic language model

  • 2007 [Schwenk 07]: Continuous space language models

2007 [Schwenk & Costa-jussa+ 07]: Smooth bilingual n-gram translation (!)

  • 2010 [Mikolov & Karafiat+ 10]:

Recurrent neural network based language model

  • 2012 RWTH Aachen [Sundermeyer & Schlüter+ 12]:

LSTM recurrent neural networks for language modeling today: ANNs in language (and translation!) show competitive results.

  • H. Ney: From Bayes to Neural Nets

c RWTH 17 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-18
SLIDE 18

ANNs in Language Modelling goal of language modelling: compute the prior p(wN

1 ) of a word sequence wN 1

– how plausible is this word sequence wN

1 (independently of observation xT 1 !) ?

– measure of language model quality: perplexity P P , i. e. effective vocabulary size log P P = −1/N ·

N

  • n=1

log p(wn|wn−1 ) results on QUAERO English (like before): – vocabulary size: 150k words – training text: 50M words – test set: 39k words perplexity PP on test data: approach PP baseline: count model 163.7 10-gram MLP 136.5 RNN 125.2 LSTM-RNN 107.8 10-gram MLP with 2 layers 130.9 LSTM-RNN with 2 layers 100.5 important result: improvement of PP by 40%

  • H. Ney: From Bayes to Neural Nets

c RWTH 18 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-19
SLIDE 19

Interpolated Language Models: Perplexity and WER

  • linear interpolation of TWO models: count model + ANN model
  • recognition experiments:

due to unlimited history, RNN LMs require re-design of ASR search

  • perplexity and word error rate on test data:

Models PP WER[%] count model 131.2 12.4 + 10-gram MLP 112.5 11.5 + Recurrent NN 108.1 11.1 + LSTM-RNN 96.7 10.8 + 10-gram MLP with 2 layers 110.2 11.3 + LSTM-RNN with 2 layers 92.0 10.4

  • experimental result:

– significant improvements by ANN language models – best improvement in perplexity: 30% reduction (from 131 to 92) – empirical observation: power law between perplexity and WER (cube to square root) [Klakow & Peters 02]

  • H. Ney: From Bayes to Neural Nets

c RWTH 19 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-20
SLIDE 20

Extended Range: Perplexity vs. Word Error Rate empirical power law: W ER = α · P P β [Klakow & Peters 02]

11 12 14 16 18 20 22 25 28 31

100 125 160 200 250 315 400 500 630 800 1000 1250 1600 2000

Word Error Rate (%) Perplexity Count-based + Feedforward + RNN + LSTM Regression

  • H. Ney: From Bayes to Neural Nets

c RWTH 20 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-21
SLIDE 21

Local Word Error Rate vs. Local Perplexity (3-word window, 20 bins) concept: perplexity and WER: local rather than global empirical power law: W ER = α · P P β

4 5 6 8 10 12 15 20 26

4 10 20 50 100 300 800 2000 5000 10000

Word Error Rate (%) Local Perplexity (window of +/- 1 word) Count + LSTM Regression

  • H. Ney: From Bayes to Neural Nets

c RWTH 21 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-22
SLIDE 22

ANN and SMT History of ANN based approaches to SMT:

  • 1997 [Neco & Forcada 97]:

asynchronous translations with recurrent neural nets.

  • 1997 [Castano & Casacuberta 97, Castano & Casacuberta+ 97]:

machine translation using neural networks and finite-state models

  • 2007 [Schwenk & Costa-jussa+ 07]:

smooth bilingual n-gram translation

  • 2012 [Le & Allauzen+ 12, Schwenk 12]:

continuous space translation models with neural networks

  • 2014 [Devlin & Zbib+ 14]:

fast and robust neural networks for SMT

  • 2014 [Sundermeyer & Alkhouli+ 14]:

recurrent bi-directional LSTM-RNN for SMT

  • 2015 [Bahdanau & Cho+ 15]:

joint learning to align and translate

  • H. Ney: From Bayes to Neural Nets

c RWTH 22 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-23
SLIDE 23

How to Use ANN-based Lexicon and Alignment Models? we consider the translation from source sentence f J

1 = f1...fj...fJ

to target sentence eI

1 = e1...ei...eI

  • alignments:

from target to source positions reason: more convenient for decoding

  • maximum approximation

with coverage constraint

  • stand-alone decoder
  • alternative method:

rescoring of N-best lists

  • training (ideally):

re-alignments rather than GIZA++

SOURCE POSITION T A R G E T P O S I T I O N i i-1 j

  • H. Ney: From Bayes to Neural Nets

c RWTH 23 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-24
SLIDE 24

ANN-based Translation Modelling [RWTH: Alkhouli, Guta, Peter, Sundermeyer: EMNLP 14, EMNLP 15, WMT 15, ACL 16, WMT 16] translation: from source sentence f J

1 = f1...fj...fJ

to target sentence eI

1 = e1...ei...eI

alignments from target i to source j: i → j = bi reason: more convenient for decoding

SOURCE POSITION TARGET POSITION i i-1 j

we re-write the translation probability p(eI

1|f J 1 ):

p(eI

1|f J 1 ) =

  • bI

1

p(eI

1, bI 1|f J 1 ) =

  • bI

1

  • i

p(bi, ei|bi−1 , ei−1 , f J

1 ) = ...

=

  • bI

1

  • i
  • p(bi|bi−1

, ei−1 , f J

1 ) · p(ei|bi 0, ei−1

, f J

1 )

  • with extended lexicon model p(ei|...) and extended alignment model p(bi|...)
  • H. Ney: From Bayes to Neural Nets

c RWTH 24 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-25
SLIDE 25

Modelling Assumptions: Extended Lexicon compare the extended lexicon model with the IBM-style lexicon p

  • ei
  • bi

0, ei−1

, f J

1

  • vs.

p(ei|fbi) simplifying assumptions about dependencies for lexicon model:

  • on target predecessor ei−1 and on source words f bi+m

bi−m around source position j = bi:

p

  • ei
  • ei−1, f bi+m

bi−m , ?

  • extended lexicon: lexicon table modelled by an FF-NN
  • on the full history of target words ei−1

and of source words f bi

b1:

p

  • ei
  • fbi,
  • ei′, fbi′

i′=i−1

i′=0

, ?

  • extended lexicon: lexicon table modelled by an RNN (e.g. LSTM)
  • alternative for a limited history of source and target words:

count-based lexicon model (with smoothing) similar concept for extended alignment model

  • H. Ney: From Bayes to Neural Nets

c RWTH 25 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-26
SLIDE 26

Extended Lexicon: Illustration alignments: i → j = bi and j → i = aj

  • H. Ney: From Bayes to Neural Nets

c RWTH 26 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-27
SLIDE 27

Experimental Results three translation tasks:

  • Ge → En: WMT 2015
  • En → Ro: WMT 2016
  • Ch → En: BOLT project

conditions: – no synthetic data – no additional LM data

  • H. Ney: From Bayes to Neural Nets

c RWTH 27 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-28
SLIDE 28

Extended Lexicon and Alignment Model (Ongoing Work) ANN for lexicon and alignment model: De → En En →Ro Ch → En WMT 2015 WMT 2016 BOLT Model BLEU % TER % BLEU % TER % BLEU % TER % phrase-based system 28.1 53.2 24.5 59.3 17.9 67.7 FF-NN: source and target 23.1 59.4 23.3 60.4 15.8 70.3 LSTM-RNN: source only 22.9 57.0 22.8 59.7

  • LSTM-RNN: source and target

24.0 57.2 24.1 59.1 16.9 69.7

  • bservations:

– best results: LSTM-RNN with source and target words – phrase-based system: better (?) with additional LM data future steps: – subword units: underway – joint training of models and alignment

  • H. Ney: From Bayes to Neural Nets

c RWTH 28 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-29
SLIDE 29

Attention-based NN MT [Bahdanau & Cho+ 15] GRU: gated recurrence unit (similar to LSTM-RNN)

. . . . . . . . . . . .

yi+1 si+1 ci+1 α(j|i + 1), j = 1, . . . , J yi si ci α(j|i), j = 1, . . . , J yi−1 si−1 ci−1 α(j|i − 1), j = 1, . . . , J

. . . . . . . . . . . .

h... hj−1 hj hj+1 h... f... fj−1 fj fj+1 f...

  • H. Ney: From Bayes to Neural Nets

c RWTH 29 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-30
SLIDE 30

Attention-Based Neural MT: LSTM vs. GRU (Ongoing Work) remarks about baseline system and software: – starting point: software by Montreal based on Blocks – extension by RWTH: number categories + UNK → BLEU +1.0% – extension by RWTH: GRU units are replaced by LSTM units De → En En →Ro Ch → En WMT 2015 WMT 2016 BOLT Model BLEU % TER % BLEU % TER % BLEU % TER % LSTM 27.6 54.6 18.8 68.2 GRU 27.0 55.0 18.8 68.0

  • H. Ney: From Bayes to Neural Nets

c RWTH 30 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-31
SLIDE 31

Attention-Based Neural MT (Ongoing Work) De → En En →Ro Ch → En WMT 2015 WMT 2016 BOLT Model BLEU % TER % BLEU % TER % BLEU % TER % baseline: whole words 25.1 56.4 22.6 60.6 17.3 69.7 subword units (byte pair enc.) 27.6 54.6 18.8 68.2 + alignment feedback 27.9 54.1 19.5 67.0 + guided alignment (GIZA++) 19.7 66.3 + dropout 20.2 65.1 + alignment feedback 20.5 64.8 ensemble of 4 best 28.8 53.3 21.7 63.7 result: significant improvement over baseline system

  • H. Ney: From Bayes to Neural Nets

c RWTH 31 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-32
SLIDE 32

Comparison: Best Systems De → En En →Ro Ch → En WMT 2015 WMT 2016 BOLT BLEU % TER % BLEU % TER % BLEU % TER % phrase-based system 28.1 53.2 24.5 59.3 17.9 67.7 extended lexicon and alignment 24.0 57.2 24.1 59.1 16.9 69.7 attention-based MT: single system 20.5 64.8 ensemble 28.8 53.3 21.7 63.7 remarks:

  • best system overall: attention-based MT

specifically for Ch → En

  • extended lexicon and alignment:

– subword units – joint training of models and alignment (end-to-end) – how to replace maximum by sum ?

  • system combinations ?
  • H. Ney: From Bayes to Neural Nets

c RWTH 32 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-33
SLIDE 33

Statistical Approach and Machine Learning for HLT

  • deep neural networks: new age of ML ?
  • nly one example of probabilistic models
  • spirit of end-to-end design: holistic view

– decision process: Bayes decision rule – training procedure: open questions

  • two questions in end-to-end training:

– suitable training criterion: link to performance criterion – numerical optimization strategy

  • end-to-end training in ASR and SMT: sequence discriminative training

bottleneck: optimization strategy

  • specific future challenges for ANNs in general:

– complex mathematical models that are hard to analyze – question: can we find suitable mathematical approximations with more explicit descriptions of the dependencies and level interactions and of the performance criterion?

  • characters vs. whole words:

combination of both concepts

  • H. Ney: From Bayes to Neural Nets

c RWTH 33 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-34
SLIDE 34

END

  • H. Ney: From Bayes to Neural Nets

c RWTH 34 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-35
SLIDE 35

REFERENCES

  • H. Ney: From Bayes to Neural Nets

c RWTH 35 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-36
SLIDE 36

References

[Bahdanau & Cho+ 15] D. Bahdanau, K. Cho, Y. Bengio: Neural machine translation by jointly learning to align and

  • translate. Int. Conf. on Learning and Representation (ICLR), San Diego, CA, May 2015.

[Bahl & Jelinek+ 83] L. R. Bahl, F. Jelinek, R. L. Mercer: A Maximum Likelihood Approach to Continuous Speech

  • Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 5, pp. 179-190, March 1983.

[Bahl & Brown+ 86] L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer: Maximum mutual information estimation

  • f hidden Markov parameters for speech recognition. IEEE Int. Conf. on Acoustics, Speech and Signal

Processing (ICASSP), Tokyo, pp.49-52, April 1986. [Beck & Schlüter+ 15] E. Beck, R. Schlüter, H. Ney: Error Bounds for Context Reduction and Feature Omission, Interspeech, Dresden, Germany, Sep. 2015. [Bengio & Ducharme+ 00] Y. Bengio, R. Ducharme, P. Vincent: A neural probabilistic language model. Advances in Neural Information Processing Systems (NIPS), pp. 933-938, Denver, CO, USA, Nov. 2000. [Botros & Irie+ 15] R. Botros, K. Irie, M. Sundermeyer, H. Ney: On Efficient Training of Word Classes and Their Application to Recurrent Neural Network Language Models. Interspeech, pp.1443-1447, Dresden, Germany,

  • Sep. 2015.

[Bourlard & Wellekens 90] H. Bourlard, C. J. Wellekens: ’Links between Markov Models and Multilayer Perceptrons’, in D.S. Touretzky (ed.): "Advances in Neural Information Processing Systems I", Morgan Kaufmann Pub., San Mateo, CA, pp.502-507, 1989. [Bridle 89] J. S. Bridle: Probabilistic Interpretation of Feedforward Classification Network Outputs with Relationships to Statistical Pattern Recognition, in F. Fogelman-Soulie, J. Herault (eds.): ’Neuro-computing: Algorithms, Architectures and Applications’, NATO ASI Series in Systems and Computer Science, Springer, New York, 1989.

  • H. Ney: From Bayes to Neural Nets

c RWTH 36 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-37
SLIDE 37

[Brown & Della Pietra+ 93] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, R. L. Mercer: Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, Vol. 19.2, pp. 263-311, June 1993. [Castano & Vidal+ 93] M.A. Castano, E. Vidal, F. Casacuberta: Inference of stochastic regular languages through simple recurrent networks. IEE Colloquium on Grammatical Inference: Theory, Applications and Alternatives,

  • pp. 16/1-6, Colchester, UK, April 1993.

[Castano & Casacuberta 97] M. Castano, F. Casacuberta: A connectionist approach to machine translation. European Conf. on Speech Communication and Technology (Eurospeech), pp. 91–94, Rhodes, Greece,

  • Sep. 1997.

[Castano & Casacuberta+ 97] M. Castano, F. Casacuberta, E. Vidal: Machine translation using neural networks and finite-state models. Int. Conf. on Theoretical and Methodological Issues in Machine Translation (TMI), pp. 160-167, Santa Fe, NM, USA, July 1997. [Dahl & Yu+ 12] G. E. Dahl, D. Yu, L. Deng, A. Acero: Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Tran. on Audio, Speech and Language Processing, Vol. 20, No. 1,

  • pp. 30-42, Jan. 2012.

[Devlin & Zbib+ 14] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, J. Makhoul: Fast and Robust Neural Network Joint Models for Statistical Machine Translation. Annual Meeting of the ACL, pp. 1370–1380, Baltimore, MA, June 2014. [Forcada & Carrasco 05] M. L. Forcada, R. C. Carrasco: Learning the initial state of a second-order recurrent neural network during regular language inference. Neural Computation, Vol. 7, No. 5, pp. 923-930, Sep. 2005. [Fritsch & Finke+ 97] J. Fritsch, M. Finke, A. Waibel: Adaptively Growing Hierarchical Mixtures of Experts. NIPS, Advances in Neural Information Processing Systems 9, MIT Press, pp. 459-465, 1997. [Gers & Schmidhuber+ 00] F. A. Gers, J. Schmidhuber, F. Cummin: Learning to forget: Continual prediction with

  • LSTM. Neural computation, Vol 12, No. 10, pp. 2451-2471, 2000.

[Gers & Schraudolph+ 02] F. A. Gers, N. N. Schraudolph, J. Schmidhuber: Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, Vol. 3, pp. 115-143, 2002.

  • H. Ney: From Bayes to Neural Nets

c RWTH 37 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-38
SLIDE 38

[Graves & Fernandez+ 06] A. Graves, S. Fernández, F Gomez, J. Schmidhuber: Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Int.Conf. on Machine Learning, Pittsburgh, USA, pp. 369-376, 2006. [Haffner 93] P. Haffner: Connectionist Speech Recognition with a Global MMI Algorithm. 3rd Europ. Conf. on Speech Communication and Technology (Eurospeech’93), Berlin, Germany, Sep. 1993. [Hermansky & Ellis+ 00] H. Hermansky, D. W. Ellis, S. Sharma: Tandem connectionist feature extraction for conventional HMM systems. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 1635-1638, Istanbul, Turkey, June 2000. [Hinton & Osindero+ 06] G. E. Hinton, S. Osindero, Y. Teh: A fast learning algorithm for deep belief nets. Neural Computation, Vol. 18, No. 7, pp. 1527-1554, July 2006. [Hochreiter & Schmidhuber 97] S. Hochreiter, J. Schmidhuber: Long short-term memory. Neural Computation,

  • Vol. 9, No. 8, pp. 1735–1780, Nov. 1997.

[Ivakhnenko 71] A. G. .Ivakhnenko: Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, Vol. 1, No. 4, pp. 364-378, Oct. 1971. [Klakow & Peters 02] D. Klakow, J. Peters: Testing the correlation of word error rate and perplexity. Speech Communication, pp. 19–28, 2002. [Koehn & Och+ 03] P. Koehn, F. J. Och, D. Marcu: Statistical Phrase-Based Translation. HLT-NAACL 2003,

  • pp. 48-54, Edmonton, Canada, May-June 2003.

[Le & Allauzen+ 12] H.S. Le, A. Allauzen, F. Yvon: Continuous space translation models with neural networks. NAACL-HLT 2012, pp. 39-48, Montreal, QC, Canada, June 2002. [LeCun & Bengio+ 94] Y. LeCun, Y. Bengio: Word-level training of a handwritten word recognizer based on convolutional neural networks. Int. Conf. on Pattern Recognition, Jerusalem, Israel, pp. 88-92, Oct. 1994. [Mikolov & Karafiat+ 10] T. Mikolov, M. Karafiat, L. Burget, J. ernocky, S. Khudanpur: Recurrent neural network based language model. Interspeech, pp. 1045-1048, Makuhari, Chiba, Japan, Sep. 2010.

  • H. Ney: From Bayes to Neural Nets

c RWTH 38 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-39
SLIDE 39

[Nakamura & Shikano 89] M. Nakamura, K. Shikano: A Study of English Word Category Prediction Based on Neural Networks. ICASSP 89, p. 731-734, Glasgow, UK, May 1989. [Neco & Forcada 97] R. P. Neco, M. L. Forcada: Asynchronous translations with recurrent neural nets. IEEE Int. Conf. on Neural Networks, pp. 2535-2540, June 1997. [Ney 03] H. Ney: On the Relationship between Classification Error Bounds and Training Criteria in Statistical Pattern Recognition. First Iberian Conf. on Pattern Recognition and Image Analysis, Puerto de Andratx, Spain, Springer LNCS Vol. 2652, pp. 636-645, June 2003. [Och & Ney 03] F. J. Och, H. Ney: A Systematic Comparison of Various Alignment Models. Computational Linguistics, Vol. 29, No. 1, pp. 19-51, March 2003. [Och & Ney 04] F. J. Och, H. Ney: The Alignment Template Approach to Statistical Machine Translation. Computational Linguistics, Vol. 30, No. 4, pp. 417-449, Dec. 2004. [Och & Tillmann+ 99] F. J. Och, C. Tillmann, H. Ney: Improved Alignment Models for Statistical Machine

  • Translation. Joint ACL/SIGDAT Conf. on Empirical Methods in Natural Language Processing and Very Large

Corpora, College Park, MD, pp. 20-28, June 1999. [Povey & Woodland 02] D. Povey, P.C. Woodland: Minimum phone error and I-smoothing for improved discriminative training. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 105–108, Orlando, FL, May 2002. [Printz & Olsen 02] H. Printz, P. A. Olsen: Theory and practice of acoustic confusability. Computer Speech and Language, pp. 131–164, Jan. 2002. [Robinson 94] A. J. Robinson: An Application of Recurrent Nets to Phone Probability Estimation. IEEE Trans. on Neural Networks, Vol. 5, No. 2, pp. 298-305, March 1994. [Schlüter & Nussbaum+ 12] R. Schlüter, M. Nussbaum-Thom, H. Ney: Does the Cost Function Matter in Bayes Decision Rule? IEEE Trans. PAMI, No. 2, pp. 292–301, Feb. 2012.

  • H. Ney: From Bayes to Neural Nets

c RWTH 39 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-40
SLIDE 40

[Schlüter & Nussbaum-Thom+ 13] R. Schlüter, M. Nußbaum-Thom, E. Beck, T. Alkhouli, H. Ney: Novel Tight Classification Error Bounds under Mismatch Conditions based on f-Divergence. IEEE Information Theory Workshop, pp. 432–436, Sevilla, Spain, Sep. 2013. [Schuster & Paliwal 97] M. Schuster, K. K. Paliwal: Bidirectional Recurrent Neural Networks. IEEE Trans. on SIgnal Processing, Vol. 45, No. 11, pp. 2673-2681, Nov. 1997. [Schwenk 07] H. Schwenk: Continuous space language models. Computer Speech and Language, Vol. 21, No. 3,

  • pp. 492–518, July 2007.

[Schwenk 12] H. Schwenk: Continuous Space Translation Models for Phrase-Based Statistical Machine

  • Translation. 24th Int. Conf. on Computational Linguistics (COLING), Mumbai, India, pp. 1071–1080, Dec. 2012.

[Schwenk & Costa-jussa+ 07] H. Schwenk , M. R. Costa-jussa, J. A. R. Fonollosa: Smooth bilingual n-gram

  • translation. Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural

Language Learning (EMNLP-CoNLL), pp. 430–438, Prague, June 2007. [Schwenk & Déchelotte+ 06] H. Schwenk, D. Déchelotte, J. L. Gauvain: Continuous Space Language Models for Statistical Machine Translation. COLING/ACL 2006, pp. 723–730, Sydney, Australia July 2006. [Solla & Levin+ 88] S. A. Solla, E. Levin, M. Fleisher: Accelerated Learning in Layered Neural Networks. Complex Systems, Vol.2, pp. 625-639, 1988. [Sundermeyer & Alkhouli+ 14] M. Sundermeyer, T. Alkhouli, J. Wuebker, H. Ney: Translation Modeling with Bidirectional Recurrent Neural Networks. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 14–25, Doha, Qatar, Oct. 2014. [Sundermeyer & Ney+ 15] M. Sundermeyer, H. Ney, R. Schlüter: From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. on Audio, Speech, and Language Processing, Vol. 23, No. 3,

  • pp. 13–25, March 2015.

[Sundermeyer & Schlüter+ 12] M. Sundermeyer, R. Schlüter, H. Ney: LSTM neural networks for language

  • modeling. Interspeech, pp. 194–197, Portland, OR, USA, Sep. 2012.
  • H. Ney: From Bayes to Neural Nets

c RWTH 40 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-41
SLIDE 41

[Utgoff & Stracuzzi 02] P. E. Utgoff, D. J. Stracuzzi: Many-layered learning. Neural Computation, Vol. 14, No. 10,

  • pp. 2497-2539, Oct. 2002.

[Vaswani & Zhao+ 13] A. Vaswani, Y. Zhao, V. Fossum, D. Chiang: Decoding with Large-Scale Neural Language Models Improves Translation. Conf. on Empirical Methods in Natural Language Processing (EMNLP, pp. 1387–1392, Seattle, Washington, USA, Oct. 2013. [Vogel & Ney+ 96] S. Vogel, H. Ney, C. Tillmann: HMM-based word alignment in statistical translation. Int. Conf. on Computational Linguistics (COLING), pp. 836-841, Copenhagen, Denmark, Aug. 1996. [Waibel & Hanazawa+ 88] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. L. Lang: Phoneme Recognition: Neural Networks vs. Hidden Markov Models. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New York, NY, pp.107-110, April 1988. [Zens & Och+ 02] R. Zens, F. J. Och, H. Ney: Phrase-Based Statistical Machine Translation. 25th Annual German

  • Conf. on AI, pp. 18–32, LNAI, Springer 2002.
  • H. Ney: From Bayes to Neural Nets

c RWTH 41 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27

slide-42
SLIDE 42

END

  • H. Ney: From Bayes to Neural Nets

c RWTH 42 Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27