Long live robustness! Michael L. Seltzer Microsoft Research REVERB - PowerPoint PPT Presentation

Robustness is dead! Long live robustness! Michael L. Seltzer Microsoft Research REVERB 2014 | May 10, 2014 Collaborators: Dong Yu, Yan Huang, Frank Seide, Jinyu Li, Jui-Ting Huang

Golden age of speech recognition • More investment, more languages, and more data than ever before REVERB 2014 2

Golden age of robustness? • Yes! • Many large scale deployments in challenging environments REVERB 2014 3

Golden age of robustness? • No! Robust ASR LVCSR • No overlap in software, tools, systems means no common ground REVERB 2014 4

Finding common ground… • DNNs + Software Tools + GPUs have democratized the field • Diplomacy through back propagation! • Whoo-hoo! • Anyone can get state of the art ASR with one machine and free software • Lowest barrier to entry in memory (recall Aurora 2) • Uh-oh! • DNN systems achieve excellent performance without noise robustness • Aurora 4, voice search, Chime, AMI meetings • What to make of all this? Is robustness dead? REVERB 2014 5

This talk • A look at DNN acoustic models with an emphasis on issues of robustness • What is a DNN? • Why do they work? • When do they fail? • How can they be improved? • Goals: • Emphasize analysis over intensive comparisons and system descriptions • Open up the black box a bit • Show what DNNs do well so we can improve what they don’t REVERB 2014 7

A quick overview of deep neural networks • Catchy name for MLP with “many” 𝑻 hidden layers • In: context window of frames 𝒊 𝑂 • Out: senone posterior probability • Training with back propagation to maximize the conditional likelihood at the frame or sequence level 𝒊 2 • Optimization important & difficult, pre-training helps 𝒊 1 • At runtime, convert posteriors to scaled likelihoods and decode as 𝒘 usual REVERB 2014 8

Deep Neural Networks raise all boats… • All tasks improve Voice Search Switchboard Vocabulary Size Meetings (ICSI, AMI) Broadcast News WSJ Aurora 4 TIMIT Aurora 2 Amount of Data REVERB 2014 9

Deep Neural Networks raise all boats… • All phonemes improve [Huang 2014] REVERB 2014 10

Deep Neural Networks raise all boats… • All SNRs improve [Huang 2014] REVERB 2014 11

The power of depth • Accuracy increases with depth # of Layers x SWBD WER (%) Aurora 4 WER(%) # of Neurons [300hrs] [10hrs] 1 x 2k 24.2 --- 3 x 2k 18.4 14.2 5 x 2k 17.2 13.8 7 x 2k 17.1 13.7 9 x 2k 17.0 13.9 REVERB 2014 12

The power of depth • Depth is not just a way to add parameters # of Layers x SWBD WER (%) Aurora 4 WER(%) # of Neurons [300hrs] [10hrs] 1 x 2k 24.2 --- 3 x 2k 18.4 14.2 5 x 2k 17.2 13.8 7 x 2k 17.1 13.7 9 x 2k 17.0 13.9 1 x 16k 22.1 -- REVERB 2014 13

Why have DNNs been successful? • Many simple nonlinearities combine to 𝒕 log-linear form arbitrarily complex nonlinearities classifier 𝒊 𝑂 • Single classifier shares all parameters and internal representations nonlinear feature 𝒊 2 • Joint feature learning & classifier design extraction • Unlike tandem or bottleneck systems 𝒊 1 𝒘 • Features at higher layers more invariant and discriminative than at lower layers REVERB 2014 14

How is invariance achieved? • How do DNNs achieve invariance in the representation? • Consider forward propagation: 𝒊 𝑚+1 = 𝜏 𝑿 𝑚 𝒊 𝑚 = 𝑔(𝒊 𝑚 ) +? ? ? 𝒊 𝑚+1 𝒊 𝑚 +𝜀 𝑚 REVERB 2014 15

How is invariance achieved? • Forward propagation: +𝜀 𝑚+1 ℎ 𝑚+1 = 𝜏 𝑋 𝑚 ℎ 𝑚 = 𝑔(ℎ 𝑚 ) 𝒊 𝑚+1 𝜀 𝑚+1 = 𝜏 𝑋 𝑚 ℎ 𝑚 + 𝜀 𝑚 − 𝜏 𝑋 𝑚 ℎ 𝑚 𝜖𝑔 𝜖ℎ ≈ 𝑔 ℎ + 𝜀 − 𝑔 ℎ 𝜀 𝜀 𝑚+1 ≈ 𝜏 ′ 𝑋 𝑈 𝜀 𝑚 𝑚 ℎ 𝑚 𝑋 𝒊 𝑚 𝑚 +𝜀 𝑚 𝑈 𝜀 𝑚+1 < 𝑒𝑗𝑏𝑕 ℎ 𝑚+1 ∘ 1 − ℎ 𝑚+1 𝑋 𝜀 𝑚 𝑚 REVERB 2014 16

How is invariance achieved? 𝑈 𝜀 𝑚+1 < 𝑒𝑗𝑏𝑕 ℎ 𝑚+1 ∘ 1 − ℎ 𝑚+1 𝑋 𝜀 𝑚 𝑚 80% • The first term always <=0.25 70% • Much smaller when saturated h>0.99 60% • Higher layers are more sparse h<0.01 50% 40% 30% 20% 10% 0% 1 2 3 4 5 6 REVERB 2014 17

How is invariance achieved? 𝑈 𝜀 𝑚+1 < 𝑒𝑗𝑏𝑕 ℎ 𝑚+1 ∘ 1 − ℎ 𝑚+1 𝑋 𝜀 𝑚 𝑚 80% • The first term always <=0.25 70% • Much smaller when saturated h>0.99 60% • Higher layers are more sparse h<0.01 50% 40% 30% • For large networks, most weights 20% are very small 10% • SWBD: 98% of weights < 0.5 0% 1 2 3 4 5 6 REVERB 2014 18

How is invariance achieved? 𝑈 𝑒𝑗𝑏𝑕 ℎ 𝑚+1 ∘ 1 − ℎ 𝑚+1 𝑋 • “ 𝜀 gain” < 1 on average 𝑚 • Variation shrinks from one layer SWB Dev Set 1.4 to the next average maximum • Maximum is > 1 1.2 • Enlarge 𝜀 near decision boundaries 1 • More discriminative 0.8 • For input “close” to training data, 0.6 each layer improves invariance 0.4 • Increases robustness 0.2 0 1 2 3 4 5 6 REVERB 2014 19

Visualizing invariance with t-SNE [van der Maaten 2008] • A way to visualize high dimensional data in a low dimensional space • Preserves neighbor relations • Use to examine input and internal representations of a DNN • Consider a parallel pair of utterances: • a noise-free utterance recorded with a close-talking microphone • the same utterance corrupted by restaurant noise at 10 dB SNR REVERB 2014 20

Visualizing invariance with t-SNE • Features REVERB 2014 21

Visualizing invariance with t-SNE • 1 st layer REVERB 2014 22

Visualizing invariance with t-SNE • 3 rd layer REVERB 2014 23

Visualizing invariance with t-SNE • 6 th layer Silence REVERB 2014 24

Invariance improves robustness • DNNS are robust to small variations of the training data • Explicitly normalizing these variations is less important/effective • Network is already doing it • Removing too much variability from the data may hinder generalization Preprocessing Technique Task DNN Relative Imp VTLN (speaker) SWBD <1% [Seide 2011] C-MMSE (noise) Aurora4/VS <0% [Seltzer 2013] IBM/IRM Masking (noise) Aurora 4 <0% [Sim 2014] REVERB 2014 25

The end of robustness? “The more training data used, the greater the chance that a new sample can be trivially related to samples in the training data, thereby lessening the need for any complex reasoning that may be beneficial in the cases of sparse training data.” [Brill 2002] • “The unreasonable effectiveness of data” [Halevy 2009] • In DNN terms: with more data, the likelier a new sample lies within 𝜀 of a training example REVERB 2014 26

(Un)fortunately, data is not a panacea • Even in the best cases, performance gaps persist • Noisy is 2X WER of Clean (Aurora 4, VS) • Unseen environments 2X WER of seen noises with MST (Aurora 4) • Farfield is 2X WER of Close-talk (Meetings) • Some scenarios cannot support large training sets • Low resource languages • Mismatch is sometimes unavoidable • New devices, environments • Sometimes modeling assumptions are wrong • Speech separation, reverberation REVERB 2014 27

Robustness: the triumphant return! • Systems still need to be more robust to variability • speaker, environment, device • Guiding principles: • Exposure to variability is good (multi-condition training) • Limiting variability can harm performance • Close relationship to desired objective function is desirable REVERB 2014 28

Approach 1: Decoupled Preprocessing • Processing independent of downstream activity • Pro: simple • Con: removes variability • Biggest success: beamforming [Swietojanski 2013] Preprocessing REVERB 2014 29

Approach 2: Integrated Preprocessing • Treat preprocessing as initial “layers” of model • Optimize parameters with back propagation • Examples: Mask estimation [Narayanan 2014], Mel optimization [Sainath 2013] • Pro: should be “optimal” for the model • Con: expensive, hard to “move the needle” Back-prop Preprocessing REVERB 2014 30

Approach 3: Augmented information Knowledge + • Augment model with Auxiliary information informative side information • Nodes (input, hidden, output) • Objective function • Pros: • preserves variability • adds knowledge • operates on representation • Con: • No physical model REVERB 2014 31

Example: co-channel speech separation [Weng 2014] • Create multi-style training data • Train 2 DNNs • Frame-level SNR to label • Jointly decode both hypotheses hyp 2 • Add trained adaptive penalty to penalize frequent switching • Speech Separation Challenge: hyp 1 • IBM Superhuman: 21.6% WER • Proposed: 20.0% WER REVERB 2014 32

Example 2: noise aware training/adaptation 𝒕 • Similar motivation to noise-adaptive training of GMM acoustic models • Give network cues about source of 𝒊 𝑂 variability [Seltzer 2013] • Preserve variability in training data 𝒊 1 𝒘 𝒐 REVERB 2014 33

Example 2: noise aware training/adaptation 𝒕 • Similar motivation to noise-adaptive training of GMM acoustic models • Give network cues about source of 𝒊 𝑂 variability [Seltzer 2013] • Preserve variability in training data • Works for speaker adaptation [Saon 2013] 𝒊 1 𝒘 𝒋 REVERB 2014 34

Long live robustness! Michael L. Seltzer Microsoft Research REVERB - PowerPoint PPT Presentation

Robustness is dead! Long live robustness! Michael L. Seltzer Microsoft Research REVERB 2014 | May 10, 2014 Collaborators: Dong Yu, Yan Huang, Frank Seide, Jinyu Li, Jui-Ting Huang Golden age of speech recognition More investment, more

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

Robustness? Robustness ? Robustness?

Robustness and Generalization Huan Xu The University of Texas at Austin Department of Electrical

Where Are We? Lecture 9 Robustness through Training 1 Robustness Explicit Handling of Noise

Robustness and SMC Adam Pechner Overview What is Robustness and why do we care? Different

S9932: LEARNING TO BOOST S9932: LEARNING TO BOOST ROBUSTNESS FOR ROBUSTNESS FOR AUTONOMOUS

Trade-off between Efficiency and Robustness Doctoral Colloqium @ SenSys18, Shenzhen Robert

Limits on Robustness to Adversarial Examples Elvis Dohmatob Criteo AI Lab October 2, 2019 Elvis

Point sets, Maps and Navigation - II D.A. Forsyth Robustness is a serious problem Robustness is

Algorithms in Nature Network robustness Slides adapted from Carl Kingsford Network robustness

Matrix Robustness, with an Application to Power System Observability Matthias Brosemann Jochen

Robustness and independence of voice timbre features under live performance acoustic degradations

Live Objects Live Objects Live Objects Live Objects Krzys Ostrowski, Ken Birman, Danny Dolev

PRIVATE EVENTS PrivateEvents@ACL-LIVE.com (512)404-1318 ACL LIVE: A Black Box for events

Love Case Packing Live Auction Slides Proper Packaging and Handling Procedure How to create live

Intro to Live S treaming Andy Beach Techgeist, Inc @ andybeach Types of Live S treams

Allan Rocha, Usman Alim, Julio Daniel Silva, and Mario Costa Sousa Interactive Modeling,

CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael

A different kind of functional language John Reppy University of Chicago / NSF November 2012

SoK: Privacy on Mobile Devices Its Complicated Chad Spensky , Jeffrey Stewart, Arkady

Consistency Control Algorithms for Web Caching Leon Cao University of Waterloo February 28,

OS Structure and Performance OS Structure and Performance Reconsidering the Kernel Interface

The Voting Experience: TWO YEARS OF PROGRESS ON ELECTION ADMINISTRATION The Voting Experience:

1/7/2019 Launch Sites Volunteers Incentives Survey Instrument Training