Microphone Array Processing M4 Progress Report Iain McCowan - PowerPoint PPT Presentation

Microphone Array Processing M4 Progress Report Iain McCowan January 28, 2003

Objective and Aims Objective � • to demonstrate viability and advantages of microphone arrays for speech acquisition in meetings Aims � 1. measurement and analysis of speaker turns 2. benchmark microphone array against close-talking microphones for speech recognition 3. precise tracking of people

Progess in past 6 months 1. measurement and analysis of speaker turns • location based speaker segmentation 2. speech recognition evaluation • comparison between lapel,array and single distant microphone on small vocab task 3. audio-visual speaker tracking (Daniel)

measuring and analysing speaker turns � speaker turn segmention important for • selecting audio for playback • speaker recognition • speaker adaptation for recognition • segmenting speech transcriptions � also... • analysis of speaker turns could be useful to detect higher level dialogue actions (monologues, general discussion, ...) � but traditional techniques struggle in meetings • multiple speakers, significant proportion of overlapping speech (~15% of words)

location based speaker segmentation � Assumptions • distinct source locations can be associated with distinct speakers • speech sounds dominate others in meetings � Proposed Technique • Measurement : source location of principal sounds represented by microphone pair time delays as features (vector with 1 value per microphone pair) • Model : GMM distribution characterising centroid of known speaker location (set manually from vector of theoretical delay values) • System : incorporate GMM’s into minimum duration HMM for all (4) speakers, segment using Viterbi decoding. • to appear in Lathoud,ICASSP 03

location based speaker segmentation � experiments • data: 20 minutes, including 5 minutes from each of 4 distinct speaker locations. spliced together to give segments between 5-20 seconds. • evaluation: • frame accuracy (FA, % of correctly labeled frames) • precision, recall, F-measure • results system FA precision recall F location 99.1% 0.98 0.98 0.98 LPCC 88.3% 0.81 0.73 0.77

location based speaker segmentation � extension to dual speaker overlap segments • manually constructed HMM of alternating short speaker turns. 6 additional classes in the HMM (+4 individual speaker classes) • data: same but each speaker change had 5-15 seconds of overlap • results test set FA precision recall F no overlap 99.1% 0.98 0.98 0.98 overlap 94.1% (85.5%) 0.94 0.86 0.90

measuring and analysing speaker turns � extensions... • current system measures activity of each speaker location • simpler detection of overlap � ongoing work • remove limiting assumptions • remove a priori knowledge of locations • automatic clustering of locations • allow many-many relationship between speaker-location • couple with speaker clustering/identification based on standard LPCC features • not all sounds are speech • classify detected segments as speech/non-speech • analysis of speaker turns • recognition of higher-level structure, such as overlap, dialogues, monologues, discusssions, etc... • to be discussed more in meeting segmentation work tomorrow...

speech recognition evaluation data collection � • re-recorded Numbers 95 corpus in meeting room, across a circular 8-element microphone array, and 3 lapel microphones • loud-speakers used (lapel microphones attached to material just below speaker) • scenarios 1. single speaker (~ 20dB) 2. one overlapping speaker (2 different locations) (~ 0dB) 3. two overlapping speakers (~ -3dB) • will be made publicly available in conjunction with OGI

speech recognition evaluation • GMM/HMM recognition system (HTK) • in each case, adaptation from clean models using development set • first results (baseline on clean test set 6.3% WER) • for single speaker in normal conditions • lapel microphone and microphone array give 7% word error rate • single table-top microphone gave 10% word error rate. • with a competing speaker (overlapping speech) at same level • lapel microphone gives 27% word error rate • microphone array gives 19% word error rate • single table-top microphone gives 60% word error rate • two competing speakers • lapel 35%, array 26%, single table-top 74% • indicates that array can be as good as, or better than lapel microphones for speech recognition • but, comparing with unenhanced lapel at this point...

speech recognition evaluation

speech recognition � for more details see Moore, ICASSP 03 � future work • benchmark against lapel microphones on large vocabulary task (M4 data, ICSI data???) • additional (post-beamforming) enhancement in case of detected overlapping speech (dual channel techniques)

audio-visual speaker tracking � use audio source localisation to help a visual tracker to • initialise • recover from tracking errors / visual occlusion � see Daniel’s presentation...

summary � microphone arrays proving to be useful devices for recording and analysing meetings • facilitate accurate speaker turn segmentation (esp. multi- speaker overlap) • comparable speech recognition performance to (unenhanced) lapel microphones in ideal case, better in case of noise (eg overlap speech) • accurate tracking of speakers � also... • developing prototype stand-alone real-time system (8 inputs, 8 outputs, Analog Devices TigerSHARC, Firewire o/p)

relevance to M4 partners � collaboration • sharing of location-based speech activity features to facilitate multi-modal research • provide ‘mixed’ single audio channel as alternative to simple addition of lapel channels � comparison • array vs close-talking microphone speaker turn segmentation • provide array output signal for comparison with lapels on (eventual) large vocablary recognition system • compare with close-talking microphone enhancement (??) during overlap segments on Numbers task

Microphone Array Processing M4 Progress Report Iain McCowan - PowerPoint PPT Presentation

Microphone Array Processing M4 Progress Report Iain McCowan January 28, 2003 Objective and Aims Objective to demonstrate viability and advantages of microphone arrays for speech acquisition in meetings Aims 1. measurement

Microphone Array Processing for Distant Speech Recognition From close-talking microphones to

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

A synthetic aperture microphone array for the meeting room TNO TPD Synthetic aperture microphone

Review We can declare an array of any type, even other arrays A 2D array is an array of

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

Very Large Array Project The Expanded Observing with the Jansky VLA Gustaaf van Moorsel Array

Array Code Generation 1. Array code generation 2. Surprises in memory access 3. Lessons learned

SMO: An Integrated Approach To Intra-Array And Inter-Array Storage Optimization Somashekaracharya

Arrays Weather Problem Array Declaration Accessing Elements Arrays and for Loops Array length

x86 ARRAYS RECALL ARRAYS char foo[80]; An array of 80 characters int bar[40]; An array of

Handling array size limitations Handling array size limitations Issue: array size is fixed

Microphone Array Processing : A Quick Update Iain McCowan Guillaume Lathoud, Darren Moore,

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

An NSF Facility Atacama Large Millimeter/submillimeter Array Karl G. Jansky Very Large Array

Chapter 7 Utilities for High-Level Descriptions Part 2 1 benyamin@mehr.sharif.edu Array Type

aHomestake Array and Wiener Filtering Array Coherence Wiener Filtering Velocity Measurements

Paper ID ID: : 157 157 Object cosegmentation usin ing deep Sia iamese network Prerana

Directed connected operators: Asymmetric hierarchies for image filtering and segmentation

Statistical image segmentation with Bayesian approach Tapio Helin Institute of Mathematics,

0117401: Operating System Chapter 8: Main Memory

NuMI Neutrino Flux Predictions Alexander Radovic College of William and Mary Alexander Radovic

NuMI Neutrino Flux Prediction Leo Aliaga The 10th International Workshop on Neutrino Beams and

Real-Time Capable Robust Noise Reduction Single microphone. Real-time capable.

Generative and discriminative classification techniques Machine Learning and Object Recognition

Sambuz

Useful Links

Newsletter

Mail Us