Microphone Array Processing M4 Progress Report Iain McCowan - - PowerPoint PPT Presentation

microphone array processing
SMART_READER_LITE
LIVE PREVIEW

Microphone Array Processing M4 Progress Report Iain McCowan - - PowerPoint PPT Presentation

Microphone Array Processing M4 Progress Report Iain McCowan January 28, 2003 Objective and Aims Objective to demonstrate viability and advantages of microphone arrays for speech acquisition in meetings Aims 1. measurement


slide-1
SLIDE 1

Microphone Array Processing

M4 Progress Report Iain McCowan January 28, 2003

slide-2
SLIDE 2

Objective and Aims

  • Objective
  • to demonstrate viability and advantages of

microphone arrays for speech acquisition in meetings

  • Aims

1. measurement and analysis of speaker turns 2. benchmark microphone array against close-talking microphones for speech recognition 3. precise tracking of people

slide-3
SLIDE 3

Progess in past 6 months

  • 1. measurement and analysis of speaker turns
  • location based speaker segmentation
  • 2. speech recognition evaluation
  • comparison between lapel,array and single

distant microphone on small vocab task

  • 3. audio-visual speaker tracking (Daniel)
slide-4
SLIDE 4

measuring and analysing speaker turns

speaker turn segmention important for

  • selecting audio for playback
  • speaker recognition
  • speaker adaptation for recognition
  • segmenting speech transcriptions

also...

  • analysis of speaker turns could be useful to detect higher

level dialogue actions (monologues, general discussion, ...)

but traditional techniques struggle in meetings

  • multiple speakers, significant proportion of overlapping

speech (~15% of words)

slide-5
SLIDE 5

location based speaker segmentation

Assumptions

  • distinct source locations can be associated with distinct

speakers

  • speech sounds dominate others in meetings

Proposed Technique

  • Measurement : source location of principal sounds

represented by microphone pair time delays as features (vector with 1 value per microphone pair)

  • Model : GMM distribution characterising centroid of known

speaker location (set manually from vector of theoretical delay values)

  • System : incorporate GMM’s into minimum duration HMM

for all (4) speakers, segment using Viterbi decoding.

  • to appear in Lathoud,ICASSP 03
slide-6
SLIDE 6

location based speaker segmentation

experiments

  • data: 20 minutes, including 5 minutes from each of 4

distinct speaker locations. spliced together to give segments between 5-20 seconds.

  • evaluation:
  • frame accuracy (FA, % of correctly labeled frames)
  • precision, recall, F-measure
  • results

0.77 0.73 0.81 88.3% LPCC 0.98 0.98 0.98 99.1% location F recall precision FA system

slide-7
SLIDE 7

location based speaker segmentation

extension to dual speaker overlap segments

  • manually constructed HMM of alternating short

speaker turns. 6 additional classes in the HMM (+4 individual speaker classes)

  • data: same but each speaker change had 5-15

seconds of overlap

  • results

0.90 0.86 0.94 94.1% (85.5%)

  • verlap

0.98 0.98 0.98 99.1% no overlap F recall precision FA test set

slide-8
SLIDE 8

measuring and analysing speaker turns

extensions...

  • current system measures activity of each speaker location
  • simpler detection of overlap
  • ngoing work
  • remove limiting assumptions
  • remove a priori knowledge of locations
  • automatic clustering of locations
  • allow many-many relationship between speaker-location
  • couple with speaker clustering/identification based on standard LPCC

features

  • not all sounds are speech
  • classify detected segments as speech/non-speech
  • analysis of speaker turns
  • recognition of higher-level structure, such as overlap, dialogues,

monologues, discusssions, etc...

  • to be discussed more in meeting segmentation work tomorrow...
slide-9
SLIDE 9

speech recognition evaluation

  • data collection
  • re-recorded Numbers 95 corpus in meeting room, across a

circular 8-element microphone array, and 3 lapel microphones

  • loud-speakers used (lapel microphones attached to

material just below speaker)

  • scenarios

1. single speaker (~ 20dB) 2.

  • ne overlapping speaker (2 different locations) (~ 0dB)

3. two overlapping speakers (~ -3dB)

  • will be made publicly available in conjunction with OGI
slide-10
SLIDE 10
slide-11
SLIDE 11

speech recognition evaluation

  • GMM/HMM recognition system (HTK)
  • in each case, adaptation from clean models using development set
  • first results (baseline on clean test set 6.3% WER)
  • for single speaker in normal conditions
  • lapel microphone and microphone array give 7% word error rate
  • single table-top microphone gave 10% word error rate.
  • with a competing speaker (overlapping speech) at same level
  • lapel microphone gives 27% word error rate
  • microphone array gives 19% word error rate
  • single table-top microphone gives 60% word error rate
  • two competing speakers
  • lapel 35%, array 26%, single table-top 74%
  • indicates that array can be as good as, or better than lapel

microphones for speech recognition

  • but, comparing with unenhanced lapel at this point...
slide-12
SLIDE 12

speech recognition evaluation

slide-13
SLIDE 13

speech recognition

for more details see Moore, ICASSP 03 future work

  • benchmark against lapel microphones on large

vocabulary task (M4 data, ICSI data???)

  • additional (post-beamforming) enhancement in case of

detected overlapping speech (dual channel techniques)

slide-14
SLIDE 14

audio-visual speaker tracking

use audio source localisation to help a visual

tracker to

  • initialise
  • recover from tracking errors / visual occlusion

see Daniel’s presentation...

slide-15
SLIDE 15

summary

microphone arrays proving to be useful devices for

recording and analysing meetings

  • facilitate accurate speaker turn segmentation (esp. multi-

speaker overlap)

  • comparable speech recognition performance to

(unenhanced) lapel microphones in ideal case, better in case of noise (eg overlap speech)

  • accurate tracking of speakers

also...

  • developing prototype stand-alone real-time system (8

inputs, 8 outputs, Analog Devices TigerSHARC, Firewire

  • /p)
slide-16
SLIDE 16

relevance to M4 partners

collaboration

  • sharing of location-based speech activity features to

facilitate multi-modal research

  • provide ‘mixed’ single audio channel as alternative to simple

addition of lapel channels

comparison

  • array vs close-talking microphone speaker turn

segmentation

  • provide array output signal for comparison with lapels on

(eventual) large vocablary recognition system

  • compare with close-talking microphone enhancement (??)

during overlap segments on Numbers task