Confidence Estimation for Black Box Automatic Speech Recognition - - PowerPoint PPT Presentation

confidence estimation for black box automatic speech
SMART_READER_LITE
LIVE PREVIEW

Confidence Estimation for Black Box Automatic Speech Recognition - - PowerPoint PPT Presentation

Confidence Estimation for Black Box Automatic Speech Recognition Systems using Lattice Recurrent Neural Networks ICASSP 2020 A. Kastanos , A. Ragni , M.J.F. Gales April 15, 2020 Dept of Engineering, University of Cambridge,


slide-1
SLIDE 1

Confidence Estimation for Black Box Automatic Speech Recognition Systems using Lattice Recurrent Neural Networks

ICASSP 2020

  • A. Kastanos⋆, A. Ragni⋆†, M.J.F. Gales⋆

April 15, 2020

⋆ Dept of Engineering, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, UK † Dept of Computer Science, University of Sheffield, 211 Portobello, Sheffield S1 4DP, UK

slide-2
SLIDE 2

Introduction

Audio Input Black-Box ASR t0 t1 t2 t3 quick brown fox c0=0.85 c1=0.1 c2=0.9 Figure 1: Overview of a black-box ASR system

  • Cloud-based ASR solutions are becoming the norm
  • Increasing complexity of ASR
  • Fewer companies can afford to build their own systems
  • The internal states of black-box systems are inaccessible
  • Word-based confidence scores are an indication of reliability

1

slide-3
SLIDE 3

Speech Recognition and Confidence Scores

t0 t1 t2 t3 quick brown fox c0=0.85 c1=0.1 c2=0.9

Figure 2: One-best word sequence with a word-level confidence score

How do we typically obtain confidence scores?

  • Word posterior probability - known to be overly confident [1]
  • Decision tree mapping requires calibration
  • Can we do better?

2

slide-4
SLIDE 4

Deep Learning for Confidence Estimation

ci−1 ci ci+1 hi−1 hi hi+1 xi−1 xi xi+1

  • hi−1
  • hi
  • hi+1
  • hi−1
  • hi
  • hi+1

RNN Unit Concatenation Figure 3: Bi-directional RNN for confidence prediction on one-best sequences

  • Bi-directional RNN to predict if each word is correct
  • What kind of features are available?
  • What if we have access to complicated structures?

3

slide-5
SLIDE 5

Features

Audio Input Black-Box ASR t0 t1 t2 t3 quick brown fox c0=0.85 c1=0.1 c2=0.9 Acoustic Model Lexicon Language Model Figure 4: Detailed look at ASR features

Can we extract these features?

  • Sub-word level information
  • Competing hypotheses
  • Lattice features

4

slide-6
SLIDE 6

Sub-word Unit Encoder

ci−1 ci ci+1 hi−1 hi hi+1 xi−1 xi xi+1

  • hi−1
  • hi
  • hi+1
  • hi−1
  • hi
  • hi+1

RNN Unit Concatenation

Figure 5: Word confidence classifier

− − − → z(j−1)

i

− − → z(j)

i

− − − → z(j+1)

i

. . . . . . z(j−1)

i

z(j)

i

z(j+1)

i

g(j−1)

i

g(j)

i

g(j+1)

i

← − − − z(j−1)

i

← − − z(j)

i

← − − − z(j+1)

i

. . . . . . RNN Unit Concatenation attn(·) ˜ gi

Figure 6: Sub-word feature extractor

  • Given a lexicon, we can extract grapheme features
  • fox → { f, o, x }
  • Convert a variable length grapheme sequence into a fixed size
  • Deep learning to aggregate features

5

slide-7
SLIDE 7

Alternative Hypothesis Representations

An intermediate step in generating a one-best sequence is the generation

  • f lattices.

quit quick weak crown fox

  • x

f a s t brown young Figure 7: Lattice

From lattices, we can obtain confusion networks by clustering arcs.

t0 t1 t2 t3 quick brown fox quit young

  • x

crown Figure 8: Confusion network

How do we handle non-sequential models?

6

slide-8
SLIDE 8

Lattice Recurrent Neural Networks

A generalisation of bi-directional RNNs to handle multiple incoming arcs:

quit quick w e a k crown fox

  • x

fast brown y

  • u

n g Figure 9: Red nodes have multiple incoming arcs, while blue nodes only have one.

Attention to learn relative importance [2]: − → h i =

  • j∈−

→ N i

αj − → h j

  • h1

x1

  • hNi

xNi

  • hs

i

  • hi

xi RNN Unit Combination Figure 10: Arc merging mechanism as implemented by LatticeRNN [3] 7

slide-9
SLIDE 9

Extracting Lattice Features

q u i t quick weak crown fox

  • x

fast b r

  • w

n young t0 t1 t2 t3 quick young fox quit brown

  • x

crown

Figure 11: Arc matching

  • Match arcs to the corresponding lattice arc
  • What kind of features could we extract?
  • Acoustic and Language model scores
  • Lattice embeddings
  • Hypothesis density

8

slide-10
SLIDE 10

Experiments (One-best)

Large gains are obtained by introducing additional information. Features NCE AUC word words 0.0358 0.7496 +duration 0.0541 0.7670 + posteriors 0.2765 0.9033 + mapping 0.2911 0.9121 sub-word + embedding 0.2936 0.9127 + duration 0.2944 0.9129 +encoder 0.2978 0.9139

Table 1: Impact of word and sub-word features. IARPA BABEL Georgian (25 hours). 9

slide-11
SLIDE 11

Experiments (Confusion Networks)

Significant gains from alternative hypotheses and basic lattice features. Features NCE AUC word (all) 0.2911 0.9121 +confusions 0.2934 0.9201 +sub-word 0.2998 0.9228 +lattice 0.3004 0.9231

Table 2: Impact of competing hypothesis information. IARPA BABEL Georgian (25 hours). 10

slide-12
SLIDE 12

Conclusion

  • Prevalence of black-box ASR
  • Limited ability to assess transcription reliability
  • Confidence estimates can be improved by providing available

information

  • Deep learning approach for incorporating sub-word features
  • Deep learning framework for introducing lattice features

11

slide-13
SLIDE 13

References

  • G. Evermann and P.C. Woodland,

“Posterior probability decoding, confidence estimation and system combination,” 2000. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.

  • Q. Li, P. M. Ness, A. Ragni, and M. J. F. Gales,

“Bi-directional lattice recurrent neural networks for confidence estimation,” in ICASSP, 2019.

12

slide-14
SLIDE 14

Thank you

Figure 12: Source code: https://github.com/alecokas/BiLatticeRNN-Confidence 13