robust automatic speech recognition through on line semi
play

"Robust Automatic Speech Recognition through on-line Semi Blind - PowerPoint PPT Presentation

"Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction" Francesco Nesta, Marco Matassoni {nesta, matassoni}@fbk.eu Fondazione Bruno Kessler-Irst, Trento (ITALY) For contacts:


  1. "Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction" Francesco Nesta, Marco Matassoni {nesta, matassoni}@fbk.eu Fondazione Bruno Kessler-Irst, Trento (ITALY) For contacts: http://shine.fbk.eu/people/nesta nesta@fbk.eu Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

  2. Introduction • Robust ASR has the goal to mimic the natural ability of humans to understand and recognize speech in adverse conditions, such as the case of speech contaminated by multiple competing interfering source signals. In this work we approach robustness on the CHIME challenge data following two key directions Source signal enhancement Features better matching the through statistical independence human auditory perception and multichannel data Semi Blind Source Extraction Gammatone Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

  3. BSS vs BSE Blind Source Separation (BSS) s ( k , l ) k = frequency bin index is a vector of N sources x ( k , l ) = H ( k ) s ( k , l ) l = frame index x ( k , l ) is a vector of M mixtures (i.e. mic numbers) • If N=M, the source signals are as: 1 k − y ( k , l ) = W ( k ) x ( k , l ) ≈ s ( k , l ) W ( k ) H ( ) = I (up to order and scaling ambiguity) In real-world N>M and may rapidly change over time! Blind Source Extraction (BSE) • The blind source extraction paradigm has been proposed to overcome those limitations[Takahashi, Saruwatari et all 2008]. • Mixtures are modeled as: Image at microphones of the sum of the interfering sources t t t x ( k , l ) = s ( k , l ) + n ( k , l ) = h ( k ) s ( k , l ) + n ( k , l ) Image at microphones of the target source Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

  4. Limits of BSE • We may estimate the noise in the mixture as: ~ T T t t n ( k , l ) = w ( k ) x ( k , l ) = w ( k ) [ h ( k ) s ( k , l ) + n ( k , l )] T t w ( k ) h ( k ) = 0 where T W w ( k ) ( k ) • If the target source is always active and dominant is one of the row of , e.g. estimated through ICA. ~ n ( k , l ) • Once an estimation of and of the target source is n ( k , l ) obtained the signals are filtered through a non-linear time- ~ t s ( k , l ) Postfiltering BSS/ICA varying filtering (e.g. Wiener filter, spectral subtraction,…), based on the estimation of the power spectral density of target source and noise signals. Main issues: ~ n n ( k , l ) ( k , l ) 1. Due to the scaling ambiguity, is a time-varying distorted approximation of . 2. The target source cannot be estimated with a single linear demixing. Incorrect estimation of the power spectral density of target and noise sources generates distortions in the recovered output signals! Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

  5. On-line Semi-blind source extraction (SBSE) The BSE is extended with a twofold modification: Frequency mixtures are modeled as the sum of the signals of the target source and of the M-1 most dominant interfering sources. The intermittingly activity of the (unknown) interfering sources is modeled by a time-varying mixing matrix which leads to a better estimation of the noise components in each frequency and time frame. H ( k , l ) y ( k , l ) = W ( k , l ) x ( k , l ) is a M x N(k,l) mixing matrix In order to better estimate the target source components a semi-blind source separation (SBSS) is realized. It nests a prior knowledge on w (k) directly the adaptation structure of ICA. Assumption : the mixing matrix of the target is estimated beforehand in the signal chunks where it dominates the interfering sources. Note : in a real-world application different strategies can be adopted to estimate the mixing matrix (e.g. as done in the demo presented at Interspeech 2011, a parallel batch off-line ICA can be applied on larger signals to supervise the on-line SBSS) Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

  6. SBSS as a constrained ICA adaptation • In order to guarantee that the first output is related to the target source, the ICA adaptation needs to be constrained, imposing − 1 t W ( , ) = [ h ( ) | ...] k l k • It can be obtained as: -If µ =0 an hard constraint is imposed (e.g. equivalent to SBSS applied to MCAEC [Nesta et. all 2009/2011]) t − 1 W ( k ) = [ h ( k ) | I ] prior 2 .. M ~ -If µ =1 no constraint is imposed x ( k , l ) = W ( k ) x ( k , l ) prior ~ y ( k , l ) = W ( k , l ) x ( k , l ) H W ∆ W ( k , l ) = { I − φ [ y ( k , l )] y ( k , l ) } ( k , l ) ∆ W ( k , l ) = [ µ ∆ W ( k , l ) | ∆ W ( k , l )] constr 1 2 .. M W ( k , l + 1 ) = W ( k , l ) + η [ ∆ W ( k , l )] constr Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

  7. Permutation and scaling ambiguity Permutation • If µ =0 the hard constraint avoids the permutation problem of frequency-domain BSS (on condition of an accurate mixing matrix prior). • If the constraint is partially released permutation need to be fixed (e.g. through the GSCT) Scaling • Scaling ambiguity can be solved through the Minimal Distortion Principle (MDP) only if N(k,l)=M. If N(k,l)>M and W (k) approaches the singularity, the MDP may considerably overestimate the • residual noise components not suppressed by the linear demixing . A simple solution : non-linear clipping limiting the overall filtering by unit gain. m y ( k , l ) ~ m m m y ( k , l ) = min(| y ( k , l ) |, | x ( k , l ) |) ~ ~ ~ m m m m | y ( k , l ) | ~ m x m ( k , l ) ~ ~ m − th Indicates the signal recorded at microphone y m ~ ( k , l ) m − th m − th Indicates the projected back image of the source signal at the microphone ~ m Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

  8. Channel-wise Wiener filter postfiltering • Constrained SBSS can only enhance the target source signal by linear time-varying demixing: • A post filtering is used to enhance the source of interest through a channel-wise adaptive Wiener filtering: t 1 2 P ( k , l ) ≈ E [| s ( k , l ) | ] PSD of the target source ~ ~ m m t P ( k , l ) ~ t m s ( k , l ) = x ( k , l ) ~ ~ m m t r P ( k , l ) + P ( k , l ) ~ ~ m m r 2 2 P ( k , l ) ≈ E [| y ( k , l ) | ] • For the 2-channel case: PSD of the noise ~ ~ m m  1 2 1 ˆ ˆ | s ( k , l ) | if s ( k , l ) > 0 ~ ~ 1 2  m m | s ( k , l ) | = , ~ m  0 otherwise 1 1 2 ˆ s ( k , l ) = y ( k , l ) − C ( k , l ) y ( k , l ) + o ( k , l ) ~ ~ ~ ~ ~ m m m m m Over-subtraction compensation 1 2 E [| y ( k , l ) || y ( k , l ) |] ~ ~ m m ( , ) = C k l ~ m 1 2 E [| y ( k , l ) | ] ~ m Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

  9. SBSE architecture BSS x(k,l) x(t) STFT RR-ICA Batch framing Win length=16384 Win shift=1024 h 1 (k) y 1 (t) SBSS y 1 (k,l), y 2 (k,l) x(k,l) y 2 (t) x(t) on-line Overlap-add STFT constrained ICA Win length=16384 h 1 (k) Win shift=1024 Extraction s 1 (k,l) s 1 (k,l) x(k,l) x(t) s 1 (t) Overlap-add Beamformer y 1 (t), y 2 (t) STFT Wiener filter Win length=1024 to ASR Win shift=64 Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

  10. Integration with Robust ASR (1/2) Acoustic features based on Gammatone analysis: • linear approximation of physiologically motivated processing performed by the cochlea • bandpass filters, whose impulse response is defined by: c − 1 − 2 π b t g ( t ) = at cos( 2 π f t + φ ) e c c c • filter center frequencies and bandwidths are derived from the filter’s Equivalent Rectangular Bandwidth • output of the Gammatone filter: x ( t ) = x ( t ) ∗ g ( t ) c c g c ( t ) where is the impulse response of the filter. Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction Francesco Nesta, Marco Matassoni {nesta,matassoni}@fbk.eu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend