Auditory System For a Mobile Robot PhD Thesis Jean-Marc Valin - - PowerPoint PPT Presentation

auditory system for a mobile robot
SMART_READER_LITE
LIVE PREVIEW

Auditory System For a Mobile Robot PhD Thesis Jean-Marc Valin - - PowerPoint PPT Presentation

Auditory System For a Mobile Robot PhD Thesis Jean-Marc Valin Department of Electrical Engineering and Computer Engineering Universit de Sherbrooke, Qubec, Canada Jean-Marc.Valin@USherbrooke.ca Motivations Robots need information


slide-1
SLIDE 1

Auditory System For a Mobile Robot

Jean-Marc Valin Department of Electrical Engineering and Computer Engineering Université de Sherbrooke, Québec, Canada Jean-Marc.Valin@USherbrooke.ca

PhD Thesis

slide-2
SLIDE 2

Motivations

  • Robots need information about their

environment in order to be intelligent

  • Artificial vision has been popular for a

long time, but artificial audition is new

  • Robust audition is essential for human-

robot interaction (cocktail party effect)

slide-3
SLIDE 3

Approaches To Artificial Audition

  • Single microphone

– Human-robot interaction – Unreliable

  • Two microphones (binaural audition)

– Imitate human auditory system – Limited localisation and separation

  • Microphone array audition

– More information available – Simpler processing

slide-4
SLIDE 4

Objectives

  • Localise and track simultaneous moving

sound sources

  • Separate sound sources
  • Perform automatic speech recognition
  • Remain within robotics constraints

– complexity, algorithmic delay – robustness to noise and reverberation – weight/space/adaptability – moving sources, moving robot

slide-5
SLIDE 5

Experimental Setup

  • Eight microphones on

the Spartacus robot

  • Two configurations
  • Noisy conditions
  • Two environments
  • Reverberation time

– Lab (E1) 350 ms – Hall (E2) 1 s

cube (C1) shell(C2)

slide-6
SLIDE 6

Sound Source Localisation

slide-7
SLIDE 7

Approaches to Sound Source Localisation

  • Binaural

– Interaural phase difference (delay) – Interaural intensity difference

  • Microphone array

– Estimation through TDOAs – Subspace methods (MUSIC) – Direct search (steered beamformer)

  • Post-processing

– Kalman filtering – Particle filtering

slide-8
SLIDE 8

Steered Beamformer

  • Delay-and-sum beamformer
  • Maximise output energy
  • Frequency domain computation
slide-9
SLIDE 9

Spectral Weighting

  • Normal cross-correlation peaks are very wide
  • PHAse Transform (PHAT) has narrow peaks
  • Apply weighting

– Weight according to noise and reverberation – Models the precedence effect

  • Sensitivity is decreased after a loud sound
slide-10
SLIDE 10

Direction Search

  • Finding directions with highest energy
  • Fixed number
  • f sources Q=4
  • Lookup-and-sum

algorithm

  • 25 times less

complex

slide-11
SLIDE 11

Post-Processing: Particle Filtering

  • Need to track sources over time
  • Steered beamformer output is noisy
  • Representing pdf as

particles

  • One set of (1000)

particles per source

  • State=[position, speed]
slide-12
SLIDE 12

Particle Filtering Steps

1) Prediction 2) Instantaneous probabilities estimation

– As a function of steered

beamformer energy

slide-13
SLIDE 13

Particle Filtering Steps (cont.)

3) Source-observation assignment

– Need to know which observation is related to

which tracked source

– Compute

  • : Probability

that q is a false alarm

  • : Probability that q

is source j

  • : Probability

that q is a new source

slide-14
SLIDE 14

Particle Filtering Steps (cont.)

4) Particle weights update

– Merging past and present information – Taking into account source-observation

assignment

5) Addition or removal of sources 6) Estimation of source positions

– Weighted mean of the particle positions

7) Resampling

slide-15
SLIDE 15

Localisation Results (E1)

Detection accuracy over distance Localisation accuracy

slide-16
SLIDE 16

Tracking Results

Two sources crossing with C2

  • Video

E1 E2

slide-17
SLIDE 17

Tracking Results (cont.)

Four moving sources with C2

E1 E2

slide-18
SLIDE 18

Sound Source Separation & Speech Recognition

slide-19
SLIDE 19

Overview of Sound Source Separation

  • Frequency domain processing

– Simple, low complexity

  • Linear source separation
  • Non-linear post-filter

Geometric source separation Microphones

X nk ,l Smk ,l

Sources Post- filter

Y mk ,l  Smk ,l

Separated Sources Tracking information

slide-20
SLIDE 20

Geometric Source Separation

  • Frequency domain:
  • Constrained optimization

– Minimize correlation of the outputs: – Subject to geometric constraint:

  • Modifications to original GSS algorithm

– Instantaneous computation of correlations – Regularisation

slide-21
SLIDE 21

Multi-Source Post-Filter

slide-22
SLIDE 22

Interference Estimation

  • Source separation leaks

– Incomplete adaptation – Inaccuracy in localization – Reverberation/diffraction – Imperfect microphones

  • Estimation from other separated sources
slide-23
SLIDE 23

Reverberation Estimation

  • Exponential decay model
  • Example: 500 Hz

frequency bin

slide-24
SLIDE 24

Results (SNR)

Input Delay- and- sum GSS GSS + single- source GSS + multi- source

  • 7,5
  • 5
  • 2,5

2,5 5 7,5 10 12,5 15 Source 1 Source 2 Source 3 SNR (dB)

  • Three speakers
  • C2 (shell), E1 (lab)
slide-25
SLIDE 25

GSS only Post-filter (no dere- verb.) Proposed system 50% 55% 60% 65% 70% 75% 80% 85% 90%

E2, C2, 3 speakers

Right Front Left Word correct (%)

Speech Recognition Accuracy (Nuance)

  • Proposed post-filter reduces errors by 50%
  • Reverberation removal helps in E2 only
  • No significant difference between C1 and C2
  • Digit recognition
  • 3 speakers: 83%
  • 2 speakers: 90%

microphone separated

slide-26
SLIDE 26

Man vs. Machine

Listener 1 Listener 2 Listener 3 Listener 4 Listener 5 Pro- posed system 50% 55% 60% 65% 70% 75% 80% 85% 90% Word correct (%)

  • How does a human compare?
  • Is it fair?

– Yes and no!

slide-27
SLIDE 27

Real-Time Application

  • Video from AAAI conference
slide-28
SLIDE 28

Speech Recognition With Missing Feature Theory

  • Speech is transformed into features (~12)
  • Not all features are reliable
  • MFT = ignore unreliable features

– Compute missing feature mask – Use the mask to compute probabilities

slide-29
SLIDE 29

Missing Feature Mask

Interference: unreliable Stationary noise: reliable

black: reliable white: unreliable

slide-30
SLIDE 30

Results (MFT)

  • Japanese isolated word recognition

(SIG2 robot, CTK)

– 3 simultaneous sources – 200-word vocabulary – 30, 60, 90 degrees separation

GSS GSS+post- filter GSS+post- filter+MFT 10 20 30 40 50 60 70 80

Right Front Left

Word correct (%)

slide-31
SLIDE 31

Summary of the System

slide-32
SLIDE 32

Conclusion

  • What have we achieved?

– Localisation and tracking of sound sources – Separation of multiple sources – Robust basis for human-robot interaction

  • What are the main innovations?

– Frequency-domain steered beamformer – Particle filtering source-observation assignment – Separation post-filtering for multiple sources

and reverberation

– Integration with missing feature theory

slide-33
SLIDE 33

Where From Here?

  • Future work

– Complete dialogue system – Echo cancellation for the robot's own voice – Use human-inspired techniques – Environmental sound recognition – Embedded implementation

  • Other applications

– Video-conference: automatically follow

speaker with a camera

– Automatic transcription

slide-34
SLIDE 34

Questions? Comments?