Confidence Estimation for Black Box Automatic Speech Recognition - PowerPoint PPT Presentation

Confidence Estimation for Black Box Automatic Speech Recognition Systems using Lattice Recurrent Neural Networks ICASSP 2020 A. Kastanos ⋆ , A. Ragni ⋆ † , M.J.F. Gales ⋆ April 15, 2020 ⋆ Dept of Engineering, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, UK † Dept of Computer Science, University of Sheffield, 211 Portobello, Sheffield S1 4DP, UK

Introduction Audio Input c 0 =0 . 85 c 1 =0 . 1 c 2 =0 . 9 Black-Box quick brown fox ASR t 0 t 1 t 2 t 3 Figure 1: Overview of a black-box ASR system • Cloud-based ASR solutions are becoming the norm • Increasing complexity of ASR • Fewer companies can afford to build their own systems • The internal states of black-box systems are inaccessible • Word-based confidence scores are an indication of reliability 1

Speech Recognition and Confidence Scores c 0 =0 . 85 c 1 =0 . 1 c 2 =0 . 9 quick brown fox t 0 t 1 t 2 t 3 Figure 2: One-best word sequence with a word-level confidence score How do we typically obtain confidence scores? • Word posterior probability - known to be overly confident [1] • Decision tree mapping requires calibration • Can we do better? 2

Deep Learning for Confidence Estimation c i − 1 c i c i +1 h i − 1 h i +1 h i � � � h i − 1 h i +1 h i RNN Unit x i − 1 x i x i +1 Concatenation � � � h i − 1 h i h i +1 Figure 3: Bi-directional RNN for confidence prediction on one-best sequences • Bi-directional RNN to predict if each word is correct • What kind of features are available? • What if we have access to complicated structures? 3

Features Audio Input c 0 =0 . 85 c 1 =0 . 1 c 2 =0 . 9 Black-Box quick brown fox ASR t 0 t 1 t 2 t 3 Acoustic Language Lexicon Model Model Figure 4: Detailed look at ASR features Can we extract these features? • Sub-word level information • Competing hypotheses • Lattice features 4

Sub-word Unit Encoder ˜ g i c i − 1 c i c i +1 attn ( · ) h i − 1 h i +1 h i z ( j − 1) z ( j ) z ( j +1) i i i � � � h i − 1 h i +1 h i − − − → − − → − − − → . . . . . . z ( j − 1) z ( j ) z ( j +1) i i i RNN Unit x i − 1 x i x i +1 g ( j − 1) g ( j ) g ( j +1) RNN Unit Concatenation i i i Concatenation � � � h i − 1 h i +1 h i ← − − − ← − − ← − − − . . . . . . z ( j − 1) z ( j ) z ( j +1) i i i Figure 5: Word confidence classifier Figure 6: Sub-word feature extractor • Given a lexicon, we can extract grapheme features • fox → { f , o , x } • Convert a variable length grapheme sequence into a fixed size • Deep learning to aggregate features 5

Alternative Hypothesis Representations An intermediate step in generating a one-best sequence is the generation of lattices . f a s t brown quick fox crown quit ox young weak Figure 7: Lattice From lattices, we can obtain confusion networks by clustering arcs. quit young ox brown t 0 quick t 1 t 2 t 3 crown fox Figure 8: Confusion network How do we handle non-sequential models? 6

Lattice Recurrent Neural Networks A generalisation of bi-directional RNNs to handle multiple incoming arcs: fast RNN Unit quick brown fox crown Combination quit ox � h 1 e a k g w u n y o Figure 9: Red nodes have multiple incoming arcs, while � � h s h i x 1 i blue nodes only have one. � x i h N i Attention to learn relative importance [2]: → − − → x N i � h i = α j h j Figure 10: Arc merging mechanism j ∈− → N i as implemented by LatticeRNN [3] 7

Extracting Lattice Features fast w n quick o fox r b quit crown brown ox i t q u ox young young weak t 0 quick t 1 t 2 t 3 crown fox Figure 11: Arc matching • Match arcs to the corresponding lattice arc • What kind of features could we extract? • Acoustic and Language model scores • Lattice embeddings • Hypothesis density 8

Experiments (One-best) Large gains are obtained by introducing additional information. Features NCE AUC word words 0.0358 0.7496 +duration 0.0541 0.7670 + posteriors 0.2765 0.9033 + mapping 0.2911 0.9121 sub-word + embedding 0.2936 0.9127 + duration 0.2944 0.9129 +encoder 0.2978 0.9139 Table 1: Impact of word and sub-word features. IARPA BABEL Georgian (25 hours). 9

Experiments (Confusion Networks) Significant gains from alternative hypotheses and basic lattice features. Features NCE AUC word (all) 0.2911 0.9121 +confusions 0.2934 0.9201 +sub-word 0.2998 0.9228 +lattice 0.3004 0.9231 Table 2: Impact of competing hypothesis information. IARPA BABEL Georgian (25 hours). 10

Conclusion • Prevalence of black-box ASR • Limited ability to assess transcription reliability • Confidence estimates can be improved by providing available information • Deep learning approach for incorporating sub-word features • Deep learning framework for introducing lattice features 11

References G. Evermann and P.C. Woodland, “Posterior probability decoding, confidence estimation and system combination,” 2000. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, � Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems , 2017, pp. 5998–6008. Q. Li, P. M. Ness, A. Ragni, and M. J. F. Gales, “Bi-directional lattice recurrent neural networks for confidence estimation,” in ICASSP , 2019. 12

Thank you Figure 12: Source code: https://github.com/alecokas/BiLatticeRNN-Confidence 13

Confidence Estimation for Black Box Automatic Speech Recognition - PowerPoint PPT Presentation

Confidence Estimation for Black Box Automatic Speech Recognition Systems using Lattice Recurrent Neural Networks ICASSP 2020 A. Kastanos , A. Ragni , M.J.F. Gales April 15, 2020 Dept of Engineering, University of Cambridge,

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Paradoxes in Probability How probability continues to amuse me! Let's play a game! Box A Box B

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

THE LISTING PRESENTATION A Natural Close! CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE Hi

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Black Box Scanning Tool + White Box Testing Tool Toshis Black Box Scanning Tool Same

A recipe for black box functors Maru Sarazola and Brendan Fong What is a black box functor? In

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Design of LDPC Lattice Network Codes Based on Construction D Paulo Branco Danilo Silva

Relational semantics, ordered algebras, and quantifiers for deductive systems Tommaso Moraschini

Revaz Grigolia Ivane Javakhishvili Tbilisi State University, Department of Mathematics Georgian

Multilinear tools through filters on groups Joshua Maglione jmaglione@math.uni-bielefeld.de

Towards Systematic Air Traffic Management in a Regular Lattice Horst Hering, Richard Irvine

Draft Lattice Builder A Software Tool for Constructing Lattice Rules Pierre LEcuyer and David

3900 ) : experiment xperiment, the theory ory, lattic lattice Z c ( 3900 Based on:

A Brief Introduction to Lattice Coding Theory (in two parts) Brian Kurkoski 1 Introduction