29-30 January 2003 M4 Meeting, Sheffield 1
Meetings Research at ICSI
Barbara Peskin
reporting on work of:
Meetings Research at ICSI Barbara Peskin reporting on work of: - - PowerPoint PPT Presentation
Meetings Research at ICSI Barbara Peskin reporting on work of: Don Baron, Sonali Bhagat, Hannah Carvey, Rajdip Dhillon, Dan Ellis, David Gelbart, Adam Janin, Ashley Krupski, Nelson Morgan, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, Chuck
29-30 January 2003 M4 Meeting, Sheffield 1
reporting on work of:
29-30 January 2003 M4 Meeting, Sheffield 2
29-30 January 2003 M4 Meeting, Sheffield 3
– test on 10-minute excerpts from 2 meetings from each site –
– evaluation included both close-talking and table-top recordings – close-talking test used hand-segmented turns; far-field used automatic chopping
– no Meeting data was used to train the models! – waveforms were downsampled to 8 kHz (for telephone bandwidth) – recognizer used gender-dependent models, feature normalization, VTLN, speaker adaptation (MLLR) and speaker-adaptive training (SAT), bigram lattice generation with trigram expansion, then interpolated class 4-gram LM N-best rescoring, … (fairly standard Hub 5 evaluation system)
29-30 January 2003 M4 Meeting, Sheffield 4
– insufficient gains from full system to justify added complexity
29-30 January 2003 M4 Meeting, Sheffield 5
– train Meeting LM on 270k words of data from 28 ICSI meetings (excluding RT-02’s dev & eval meetings) – include all words from these meetings in recognizer’s vocabulary (~1200 new words) – interpolate Meeting LM with SWB-trained LM – choose interpolation weights by minimizing perplexity on 2 ICSI RT-02 dev meetings – test on 2 ICSI eval meetings using simplified recognition protocol
29-30 January 2003 M4 Meeting, Sheffield 6
– Model as convolutive distortion (reverb) followed by additive distortion (bkg noise) – For additive noise: used Weiner filtering approach, as above – For reverb: used long-term log spectral subtraction (similar to CMS but longer window) – See [D. Gelbart & N. Morgan, ICSLP-2002] for details
– “PDA” performance much worse, but above techniques greatly reduced difference – Error rates roughly comparable after processing as above
far
near both log spec subtr noise reducn baseline WER on Mtg Digits
29-30 January 2003 M4 Meeting, Sheffield 7
– First, detect speech region candidates on each channel separately, using a standard two-class HMM with min duration constraints – Then compute cross-correlations between channels and threshold them to suppress detections due to cross-talk – Key feature is normalization of energy features on each channel not only for channel min but also by average across all channels
– Frame error rate for speech/nonspeech detection: 18.6% → 13.7% → 12.0% – WER for SWB-trained recognizer: within 10% (rel) of hand-segmented result; (cf. unsegmented waveforms 75% higher largely due to cross-talk insertions) Note: details can be found in [T. Pfau, D. Ellis, and A. Stolcke, ASRU-2001].
29-30 January 2003 M4 Meeting, Sheffield 8
– Hidden event language model built from n-grams over words and event labels – Prosodic model built from features (phone & pause durations, pitch, energy) extracted
within window around each interword boundary; classifies via decision trees
– Model combination using HMM defined from hidden event LM and incorporating
– for true: LM better than prosody – for recognized: prosody better than LM – combining models always helps, even when one is much better Note: details can be found in [D. Baron, E. Shriberg, and A. Stolcke, ICSLP-2002].
29-30 January 2003 M4 Meeting, Sheffield 9
29-30 January 2003 M4 Meeting, Sheffield 10
– Automatically transcribing natural, spontaneous multi-party speech – Enriching language models to handle new / specialized topics – Detecting speech activity, segmenting speech stream, labeling talkers – Dealing with far-field acoustics – Moving beyond the words to model