Meetings Research at ICSI Barbara Peskin reporting on work of: - - PowerPoint PPT Presentation

meetings research at icsi
SMART_READER_LITE
LIVE PREVIEW

Meetings Research at ICSI Barbara Peskin reporting on work of: - - PowerPoint PPT Presentation

Meetings Research at ICSI Barbara Peskin reporting on work of: Don Baron, Sonali Bhagat, Hannah Carvey, Rajdip Dhillon, Dan Ellis, David Gelbart, Adam Janin, Ashley Krupski, Nelson Morgan, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, Chuck


slide-1
SLIDE 1

29-30 January 2003 M4 Meeting, Sheffield 1

Meetings Research at ICSI

Barbara Peskin

reporting on work of:

Don Baron, Sonali Bhagat, Hannah Carvey, Rajdip Dhillon, Dan Ellis, David Gelbart, Adam Janin, Ashley Krupski, Nelson Morgan, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, Chuck Wooters International Computer Science Institute Berkeley, CA

slide-2
SLIDE 2

29-30 January 2003 M4 Meeting, Sheffield 2

Overview

  • Automatic Speech Recognition (ASR) Research

– Baseline performance – Language modeling exploration – Far-field acoustics – Speech activity detection

  • Sentence Segmentation & Disfluency Detection
  • Dialogue Acts: Annotation & Automatic Modeling
slide-3
SLIDE 3

29-30 January 2003 M4 Meeting, Sheffield 3

ASR Research: Baselines

Meeting data formed a track of NIST’s RT-02 evaluation

  • Eval data (and limited dev) was available from 4 sites

– test on 10-minute excerpts from 2 meetings from each site –

  • nly 5 transcribed meetings for dev (not all sites represented)

– evaluation included both close-talking and table-top recordings – close-talking test used hand-segmented turns; far-field used automatic chopping

  • We used a Switchboard-trained recognizer from SRI

– no Meeting data was used to train the models! – waveforms were downsampled to 8 kHz (for telephone bandwidth) – recognizer used gender-dependent models, feature normalization, VTLN, speaker adaptation (MLLR) and speaker-adaptive training (SAT), bigram lattice generation with trigram expansion, then interpolated class 4-gram LM N-best rescoring, … (fairly standard Hub 5 evaluation system)

slide-4
SLIDE 4

29-30 January 2003 M4 Meeting, Sheffield 4

Baselines (cont’d)

word error rates (WER) on Meeting track of RT-02:

  • Performance on close-talking mics quite comparable to SWB
  • Table just shows bottom-line numbers, but incremental improvements

at each recognition stage parallel those on SWB

  • Overall, far-field WER’s about twice as high as close-talking
  • CMU data worst for close-talking (they used lapel mics, not headset)

but difference disappears on far-field *note: table-top mic system was somewhat simplified (bigram LM, etc.)

– insufficient gains from full system to justify added complexity

  • 61.6

61.6 69.7 64.5 53.6 table-top mic * 30.2 36.0 35.2 36.8 47.9 25.9 close-talking mic

SWB all NIST LDC CMU ICSI

Data source ⇒

slide-5
SLIDE 5

29-30 January 2003 M4 Meeting, Sheffield 5

A Language Modeling Experiment

Problem:

RT-02 recognizer does not speak the Meetings language (many OOV words, unknown n-grams, etc.)

Experiment:

– train Meeting LM on 270k words of data from 28 ICSI meetings (excluding RT-02’s dev & eval meetings) – include all words from these meetings in recognizer’s vocabulary (~1200 new words) – interpolate Meeting LM with SWB-trained LM – choose interpolation weights by minimizing perplexity on 2 ICSI RT-02 dev meetings – test on 2 ICSI eval meetings using simplified recognition protocol

0.5% 1.5% OOV 28.4% 30.6% WER

Interpolated LM SWB LM

slide-6
SLIDE 6

29-30 January 2003 M4 Meeting, Sheffield 6

Far-Field Acoustics

  • Far-field performance was improved by applying Weiner filtering

techniques developed for the Aurora program

– On RT-02 dev set, WER dropped 64.1% → 61.7%

  • Systematically addressed far-field acoustics using Digits Task

– Model as convolutive distortion (reverb) followed by additive distortion (bkg noise) – For additive noise: used Weiner filtering approach, as above – For reverb: used long-term log spectral subtraction (similar to CMS but longer window) – See [D. Gelbart & N. Morgan, ICSLP-2002] for details

  • Also explored PZM (high-quality) vs “PDA” (cheap mic) performance

– “PDA” performance much worse, but above techniques greatly reduced difference – Error rates roughly comparable after processing as above

7.2% 8.2% 24.8% 26.3%

far

2.7% 3.1% 3.6% 4.1%

near both log spec subtr noise reducn baseline WER on Mtg Digits

slide-7
SLIDE 7

29-30 January 2003 M4 Meeting, Sheffield 7

Speech Activity Detection

Detecting regions of speech activity is a challenge for Meeting data, even on close-talking channels (due to cross-talk, etc.)

  • Standard echo cancellation techniques ineffective (due to head movement)
  • We devised an algorithm which performs SAD on close-talking channel,

using information from all recorded channels

– First, detect speech region candidates on each channel separately, using a standard two-class HMM with min duration constraints – Then compute cross-correlations between channels and threshold them to suppress detections due to cross-talk – Key feature is normalization of energy features on each channel not only for channel min but also by average across all channels

  • Greatly reduces error rates

– Frame error rate for speech/nonspeech detection: 18.6% → 13.7% → 12.0% – WER for SWB-trained recognizer: within 10% (rel) of hand-segmented result; (cf. unsegmented waveforms 75% higher largely due to cross-talk insertions) Note: details can be found in [T. Pfau, D. Ellis, and A. Stolcke, ASRU-2001].

slide-8
SLIDE 8

29-30 January 2003 M4 Meeting, Sheffield 8

“Hidden Event” Modeling

  • Detect events implicit in speech stream (e.g. sentence and topic

breaks, disfluency locations, …) using prosodic & lexical cues

  • Developed by Shriberg, Stolcke, et al. at SRI (for topic and

sentence segmentation of Broadcast News and Switchboard)

  • 3 main ingredients

– Hidden event language model built from n-grams over words and event labels – Prosodic model built from features (phone & pause durations, pitch, energy) extracted

within window around each interword boundary; classifies via decision trees

– Model combination using HMM defined from hidden event LM and incorporating

  • bservation likelihoods for states from prosodic decision tree posteriors
  • Meetings work used parallel feature databases (true words, ASR)

to detect sentence boundaries and disfluencies

– for true: LM better than prosody – for recognized: prosody better than LM – combining models always helps, even when one is much better Note: details can be found in [D. Baron, E. Shriberg, and A. Stolcke, ICSLP-2002].

slide-9
SLIDE 9

29-30 January 2003 M4 Meeting, Sheffield 9

Dialogue Acts

To understand what’s going on in a meeting, we need more than the words ⇒ DA’s tell the role of an utterance in the discourse; use to spot topic shifts, floor grabbing, agreement / disagreement, etc.

e.g. Yeah. (as backchannel)

  • Yeah. (as response)

Yeah? (as question) Yeah! (as exclamation)

  • Hand labeling now with goal of automatic labeling later
  • Using set of 58 tags refined for this work, based on SWB-DAMSL conventions
  • Using cues from both prosody and words
  • Currently more than 20 meetings (over 20 hours of speech) hand labeled
  • Started work on automatic modeling (in collaboration with SRI)

A draft of the DA spec is available at our Meetings website: http://www.icsi.berkeley.edu/Speech/mr/

slide-10
SLIDE 10

29-30 January 2003 M4 Meeting, Sheffield 10

Summary

  • Meetings support an amazing range of speech & language

research (nearly “ASR complete”)

  • We are just starting to tap some of the possibilities, including

– Automatically transcribing natural, spontaneous multi-party speech – Enriching language models to handle new / specialized topics – Detecting speech activity, segmenting speech stream, labeling talkers – Dealing with far-field acoustics – Moving beyond the words to model

  • hidden events such as sentence breaks and disfluencies
  • dialogue acts and discourse structure
  • We look forward to continued collaboration with the M4 community

to tackle the challenges posed by Meeting data