AND REFERENCE CANCELLATION APPLIED TO FORENSIC AUDIO ENHANCEMENT - - PowerPoint PPT Presentation

and reference cancellation
SMART_READER_LITE
LIVE PREVIEW

AND REFERENCE CANCELLATION APPLIED TO FORENSIC AUDIO ENHANCEMENT - - PowerPoint PPT Presentation

MUSIC AND NOISE FINGERPRINTING AND REFERENCE CANCELLATION APPLIED TO FORENSIC AUDIO ENHANCEMENT ANIL ALEXANDER 1 , OSCAR FORTH 1 AND DONALD TUNSTALL 2 1 Oxford Wave Research Ltd, United Kingdom {anil|oscar}@oxfordwaveresearch.com 2 Digital Audio


slide-1
SLIDE 1

ANIL ALEXANDER1, OSCAR FORTH1 AND DONALD TUNSTALL2

1 Oxford Wave Research Ltd, United Kingdom

{anil|oscar}@oxfordwaveresearch.com

2 Digital Audio Corporation, USA

dtunstall@dacaudio.com Audio Engineering Society 46th Conference on Audio Forensics Denver, Colorado June 14-16, 2012

MUSIC AND NOISE FINGERPRINTING AND REFERENCE CANCELLATION APPLIED TO FORENSIC AUDIO ENHANCEMENT

slide-2
SLIDE 2

Introduction

 In surveillance audio recordings, it is common

to come across:

 Interfering music or a television playing in the background

in locations like pubs, cafes, cars, etc.

 Other speakers in the background who mask the speech of

interest

 Target speakers who turn on their music players or their

televisions, as they begin to speak, especially when they suspect they are being monitored, in order to mask their speech.

 The loud music or background noise drowns out the words

  • r makes the speech of the speakers hard to decipher and

transcribe.

slide-3
SLIDE 3

Research Questions

Is it possible to reduce or remove: I - interfering music from non- contemporaneous reference material and to bring the voice of the speaker to the forefront? II- background noises, and speech

  • f other speakers, music, etc.

from contemporaneous recordings made in the same acoustic environment to bring the voice of the main speaker to the forefront?

slide-4
SLIDE 4

Example (1,2): Car or Hotel Room

Hotel Room In a Car

Noise sources: Radio, television, music player Noise sources: road noise, car radio, other passengers

slide-5
SLIDE 5

Example (3): Pub/Hall with Music

Noise Sources: Television, Jukebox, Radio, Bar Noise, Other Speakers

slide-6
SLIDE 6

Research Question (I)

Is it possible to reduce or remove interfering music from non-contemporaneous reference material and to bring the voice of the speaker to the forefront? (Alexander and Forth, 2011)

slide-7
SLIDE 7

Why is this difficult ?

“Is it possible to reduce or remove interfering music and to bring the voice of the speaker to the forefront?”

 Straightforward subtraction of the audio will not remove

the music as the effects of the room are not considered

 Cancellation is sensitive to clipping and compression.

 Has often to be applied on a single channel of audio

(without simultaneous reference recordings).

 The exact song that is playing has to be identified and

perfectly time-aligned  time and labour intensive.

slide-8
SLIDE 8

Reducing Background Music

Tasks involved:

 Identifying the music/song being

played

 Aligning the tracks to the exact

moment in time, within the file being analysed, that the song or music begins

 Applying a noise- and distortion-robust

echo cancellation algorithm to remove

  • r reduce the music while mostly

leaving the target speech intact.

slide-9
SLIDE 9

Automatic Music Identification

 Commercial applications of acoustic fingerprinting are in

areas of identifying tunes, songs, videos, advertisements and radio broadcasts and anti-piracy initiatives.

 Recent proliferation of music identification systems such

as Shazam™.

 A short segment of audio (noisy, distorted or otherwise

poor) is sent through to an internet-based recognition server for identification.

 The server compares feature of this recording to a

pre-indexed database of songs.

 It selects the most probable candidate(s) for the song.

slide-10
SLIDE 10

Noise-Robust Audio Fingerprinting

Query audio 5 10 15 20 25 1000 2000 3000 4000 Match: 1-05 The Road To Hell (Part 2) at 179.744 sec 180 185 190 195 200 205 1000 2000 3000 4000

 Attributes for a ‘fingerprint’

[Wang (2003)]

 Temporally localized  Translation invariant  Robust  Sufficiently Entropic

 Spectral peak pairs are thus

temporally localized, robust to noise and transmission distortions

slide-11
SLIDE 11

Landmark-based Audio Fingerprinting Algorithm (1)

  • Peaks’ chosen based having

higher energy than neighbours

  • Spectrogram is reduced into

a ‘constellation map’ containing spectral peaks.

  • Pairs of peaks selected as

landmark ‘hashes’ that provide reference anchor points in time and frequency.

  • Landmark hash extraction is

performed on query audio.

slide-12
SLIDE 12

Landmark-based Audio Fingerprinting Algorithm (2)

∆t Query Audio Reference audio containing music or noise

Time of match (t)

Landmark hashes

Matching hashes

  • Constellation maps are then

compared to obtain the position in time when some of the hashes match, between the query and reference audio.

  • The file with the largest number
  • f hash matches is selected as

the reference audio file.

  • An accurate estimate of the time
  • f match is also returned by this

algorithm.

slide-13
SLIDE 13

Query audio 5 10 15 20 25 1000 2000 3000 4000 Match: 1-05 The Road To Hell (Part 2) at 179.744 sec 180 185 190 195 200 205 1000 2000 3000 4000

Landmark-based Audio Fingerprinting Algorithm (3)

Ellis (2009) Robust Landmark-Based Audio Fingerprinting

slide-14
SLIDE 14

Result Example - Time Domain

0.5 1 1.5 2 2.5 3 3.5 4.0

  • 0.5

0.5 Original Signal (Speech and Music) Time (s) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

  • 0.5

0.5 Identified Music Signal Time (s) 1 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

  • 0.5

0.5 Resulting Speech (Original - Music) Time (s)

Marked reduction in the noise floor

slide-15
SLIDE 15

Result Example- Frequency Domain

slide-16
SLIDE 16

Echo Cancellation (1)

 Echo cancellation suffers from a similar problem –

 playback from the speakers and simultaneous recordings from the

microphones

 the playback should not ‘seep in’ to the recording in the microphone

 An acoustic echo canceller could provide a good solution to the problem  Echo cancellation algorithms are generally LMS (Least Mean Square-

based) – either time domain or frequency domain approaches can be used

 In this application we use an echo canceller software module (compliant

with ITU-T G.167, G.168) specifications using Intel Performance Primitives (IPP) library and the DAC CARDINAL.

slide-17
SLIDE 17

LMS-based Echo Cancellation

Speech + residual noise/music (S+N’ –N”) (S+N) Speech + noise/music (N’) Identified time-aligned noise/music

+

  • Electronic

Response estimate (N”) Residual

slide-18
SLIDE 18

LMS / NLMS Coefficient Update

,

hn (i + 1) = hn (i) + Δhn (i)

Each FIR coefficient h, index n, updated each sample interval i as follows: Update increment, Δhn (i), computed by LMS algorithm as follows:

Δhn (i) = µ ∙ e (i) ∙ x (i - n)

NLMS uses a slightly different µ value, as follows:

Δhn (i) = µ’ ∙ e (i) ∙ x (i - n)

where µ’ is the specified µ value (or “adapt rate”), scaled inversely to the average input signal power

slide-19
SLIDE 19

Electronic Response Estimate

 FIR filter coefficients represent

an electronic simulation of the room’s acoustical environment

 Filter must have a sufficient

number of taps, N, to not only account for direct acoustic path (A), but also the longest significant reverberation path (B)

 At 16000 Hz sample rate,

required N for example at left would be 0.070s * 16000/s = 1120 taps

 We typically estimate the

minimum required filter length in milliseconds as 5 times largest dimension of the room in feet

A – Direct Path (13’) B – Longest significant path (70’) Sound: 1 ft ~ 1 msec

15’

slide-20
SLIDE 20

Time Alignment Drift

 If there is a speed differential between

the primary and reference tracks, the time alignment will “drift” as the processing progresses

 This

can be

  • bserved

in the FIR coefficient response as a movement of the “big spike” (the large coefficient associated with the direct path signal correlation), either to the right or the left

 If drift is significantly fast (e.g. more

than 1-2 coefficients every 5-10 seconds), the LMS algorithm will never be able to converge the FIR coefficients to an optimal solution

 Also, should the spike drift beyond

either the beginning or the end of the filter, all cancellation will be lost

slide-21
SLIDE 21

Research Question (II)

 “Is it possible to reduce or remove, from contemporaneous

recordings made in the same acoustic environment, interfering music, background noises, and speech of other speakers, to bring the voice of the main speaker to the forefront?”

 Will having two microphones in the same environment allow

for effective cancellation?

slide-22
SLIDE 22

Applying ‘Audio Fingerprinting’ to Background Noise

 Having two microphones in the same acoustic environment

perfectly time aligned can greatly help bringing out the voice

  • f one speaker over the other

 Rarely happens in practise

 Aligning noise is a more difficult problem as sufficient spectral

peaks may not be available in both recordings.

 Applying a less stringent criteria for matching, we can time-

align audio from the two independent recorders in the same acoustic environment accurately.

slide-23
SLIDE 23

Applications to Noise Identification

Speech + residual noise/music (S+N’ –N”) (S+N) Speech + noise/music (N’) Identified time-aligned noise/music

+

  • Electronic

Response estimate (N”) Residual

slide-24
SLIDE 24

Scenarios

 Scenario 1: Two independent recordings using two

smartphones in the same acoustic environment

 Scenario 2: Two fixed microphones in the same acoustic

environment

 Scenario 3: White noise interference

slide-25
SLIDE 25

Scenario 1: Two independent recordings using two smartphones

 Two mobile phones: an

iPhone 4S and an iPhone 3GS, were used to record a conversation between two speakers in the same acoustic environment.

 Two independent devices

with not synchronized to each other) in any way.

 Smaller number of hashes

  • bserved -sufficient for time

alignment

Queried test audio (iPhone 4S) matched and time-aligned against a reference recording (iPhone 3GS)

slide-26
SLIDE 26

Scenario 2: Two fixed microphones in the same acoustic environment

  • Interfering noise was a television broadcast (2 speakers in a room)
  • Relatively small number of matching hashes as compared with the

music

  • Land-marking experiments -> sufficient matches to time-align the two

files correctly

slide-27
SLIDE 27

Scenario 3: White Noise Interference

10 20 30 40 50 60

  • 1

1 Time (s) Original speech and white noise 10 20 30 40 50 60

  • 1

1 Time (s) Identified and aligned reference white noise 10 20 30 40 50 60

  • 1

1 Time (s) Resulting speech (Original - white noise)

  • White noise as the interfering

source.

  • Exceedingly difficult to find any

distinctive spectral peaks

  • The number of matching hashes

was significantly less than

  • bserved with either music or

regular noise

  • However, we were able to identify

a very small number of matching hashes that were sufficient to allow time-alignment.

  • Reference cancellation applied

using this time-alignment showed significant improvement in intelligibility.

slide-28
SLIDE 28

Limitations

 This method is not applicable to to

 Badly clipped recordings  Compressed recordings  Recordings where there is a ‘drift’ or stretch between

the playback time of the music (more applicable to analogue recordings)

 Note: What is extracted may still not be sufficient

quality for forensic voice comparison

slide-29
SLIDE 29

Conclusions

 A combination of audio-fingerprinting and echo

cancellation can be used to reduce the effect of interfering radio and television noises.

 This approach could be extended to non-music speech

sources by using two independent recordings in the same recording environment

 A significant improvement in the intelligibility is obtained

which could benefit forensic audio enhancement and transcription.

slide-30
SLIDE 30

References

 Avery Wang "An Industrial-Strength Audio Search Algorithm", Proc.

2003 ISMIR International Symposium on Music Information Retrieval, Baltimore, MD, Oct. 2003.

 J. Benesty, D. Morgan and M. Sondhi, (1997) ‘‘A better

understanding and an improved solution to the problems of stereophonic acoustic echo cancellation’’, Proc. ICASSP,97, 303

 D. P. W. Ellis. (2009) Robust Landmark-Based Audio Fingerprinting.

http://labrosa.ee.columbia.edu/matlab/fingerprint/

 A. Alexander and O. Forth (2011) “'No, thank you, for the music': An

application of audio fingerprinting and automatic music signal cancellation for forensic audio enhancement”, International Association of Forensic Phonetics and Acoustics Conference 2011, Vienna, Austria, July 2011