EE E6820: Speech & Audio Processing & Recognition Lecture 7: - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio Compression & Coding 1 Information, compression & quantization 2 Speech coding 3 Wide bandwidth audio coding Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 1

Compression & Quantization 1 • How big is audio data? What is the bitrate ? - Fs frames/second (e.g. 8000 or 44100) x C samples/frame (e.g. 1 or 2 channels) x B bits/sample (e.g. 8 or 16) → Fs·C·B bits/second (e.g. 64 Kbps or 1.4 Mbps) bits / frame CD Audio 1.4 Mbps 32 8 Mobile 8000 44100 ≤ 13 Kbps frames / sec Telephony 64 Kbps • How to reduce? → - lower sampling rate less bandwidth (muffled) → - lower channel count no stereo image → - lower sample size quantization noise • Or: use data compression E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 2

Data compression: Redundancy vs. Irrelevance • Two main principles in compression: - remove redundant information - remove irrelevant information • Redundant information is implicit in remainder - e.g. signal bandlimited to 20kHz, but sample at 80kHz → can recover every other sample by interpolation: In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples sample time • Irrelevant information is unique, unnecessary - e.g. recording a microphone signal at 80 kHz sampling rate E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 3

Irrelevant data in audio coding • For coding of audio signals, irrelevant means perceptually insignificant - an empirical property • Compact Disc standard is adequate: - 44 kHz sampling for 20 kHz bandwidth - 16 bit linear samples for ~ 96 dB peak SNR • Reflect limits of human sensitivity: - 20 kHz bandwidth, 100 dB intensity - sinusoid phase, detail of noise structure - dynamic properties - hard to characterize • Problem: separating salient & irrelevant E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 4

Quantization • Represent waveform with discrete levels 5 4 6 x[n] Q [ x[n] ] 3 4 2 1 2 Q[x] 0 -1 0 -2 -2 -3 error e[n] = x[n] - Q [ x[n] ] -4 35 40 -5 0 5 10 15 20 25 30 -5 -4 -3 -2 -1 0 1 2 3 4 5 x • Equivalent to adding error e[n]: [ ] [ [ ] ] [ ] = + x n Q x n e n • e[n] ~ uncorrelated, uniform white noise p(e[n]) D 2 2 σ e = - - - - - - - - variance 12 -D/2 +D/2 E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 5

Quantization noise (Q-noise) • Uncorrelated noise has flat spectrum • With a B bit word and a quantization step D B -1 B -1 - max signal range (x) = -(2 )· D .. (2 -1)· D - quantization noise (e) = - D /2 .. D /2 → Best signal-to-noise ratio (power) 2 2 [ ] E e ⁄ [ ] = SNR E x 2 ( B ) = 2 ⋅ ⋅ ≈ ⋅ 20 log 10 2 6 B B .. or, in dB, dB 0 Quantized at 7 bits level / dB -20 -40 -60 -80 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 6

Redundant information • Redundancy removal is lossless • Signal correlation implies redundant information - e.g. if x[n] = x[n-1] + v[n] → x[n] has a greater amplitude range more bits than v[n] - sending v[n] = x[n] - x[n-1] can reduce amplitude, hence bitrate x[n] - x[n-1] - ‘white noise’ sequence has no redundancy • Problem: separating unique & redundant E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 7

Optimal coding • Shannon information: An unlikely occurrence is more ‘informative’ p (A) = 0.5 p (B) = 0.5 p (A) = 0.9 p (B) = 0.1 ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB A is expected; A , B equiprobable B is ‘big news’ • Information in bits = –log (probability) I 2 - clearly works when all possibilities equiprobable → • Opt. bitrate token length = entropy =E[ I ] H - i.e. equal-length tokens are equally likely • How to achieve this? - transform signal to have uniform pdf - nonuniform quantization for equiprobable tokens - variable-length tokens → Huffman coding E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 8

Quantization for optimum bitrate • Quantization should reflect pdf of signal: p ( x < x 0 ) p ( x = x 0 ) x' 1.0 0.8 0.6 0.4 0.2 0 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x - cumulative pdf p ( x < x 0 ) maps to uniform x ' - or: nonuniform quantization bins • Or, codeword length per Shannon –log 2 (p(x)): p ( x ) Shannon info / bits Codewords -0.02 111111111xx 111101xx -0.01 111100xx 101xx 100xx 0 0xx 110xx 1110xx 0.01 111110xx 1111110xx 111111100xx 0.02 111111101xx 111111110xx 0.03 0 2 4 6 8 E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 9

Huffman coding • Variable-length bit sequence tokens → can code unequally probable events • Tree-structure for unambiguous decoding: p = 0.5 0 0 p = 0.25 10 0 p = 0.0625 0 1100 1 0 p = 0.0625 1101 1 1 0 p = 0.0625 1110 1 p = 0.0625 1111 1 1011001101000001001100010011100001110 • Can build tables to approximate arbitrary distributions • Eliminates irrelevance .. within limits Problem: very probable events → short tokens • E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 10

Vector Quantization • Quantize mutually dependent values in joint space: 3 x 1 2 1 0 -1 -2 x 2 -6 -4 -2 0 2 4 • May help even if values are largely independent - larger space {x 1 ,x 2 } is easier for Huffman E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 11

Compression & Representation • As always, success depends on representation • Appropriate domain may be ‘naturally’ bandlimited - e.g. vocal-tract-shape coefficients - can reduce sampling rate without data loss • In right domain, irrelevance may be easier to get at - e.g. STFT to separate magnitude and phase E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 12

Aside: Coding standards • Coding is only useful when recipient knows the code! • Standardization efforts are important • Federal Standards: Low bit-rate secure voice: - FS1015e: LPC-10 2.4 Kbps - FS1016: 4.8 Kbps CELP • ITU G.series - G.726 ADPCM - G.729 Low delay CELP • MPEG - MPEG-Audio layers 1,2,3 - MPEG 2 Advanced Audio Codec - MPEG 4 Synthetic-Natural Hybrid Codec • etc ... E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 13

Outline 1 Information, compression & Quantization 2 Speech coding - General principles - CELP & friends 3 Wide bandwidth audio coding E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 14

Speech coding 2 • Standard voice channel: - analog: 4 kHz slot (~ 40 dB SNR) - digital: 64 Kbps = 8 bit µ -law x 8 kHz • How to compress? Redundant - signal assumed to be a single voice, not any possible waveform Irrelevant - need code only enough for intelligibility, speaker identification (c/w analog channel) • Specifically, source-filter decomposition - vocal tract & fund. frequency change slowly • Applications: - live communications - offline storage E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 15

Channel Vocoder (1940s-1960s) • Basic source-filter decomposition - filterbank breaks into spectral bands - transmit slowly-changing energy in each band Encoder Decoder Bandpass Smoothed Downsample E 1 filter 1 energy & encode Bandpass filter 1 Output Input Bandpass Smoothed Downsample E N filter N energy & encode Bandpass filter 1 V/UV Voicing Pulse Noise analysis Pitch generator source - 10-20 bands, perceptually spaced • Downsampling? • Excitation? - pitch / noise model - or: baseband + ‘flattening’... E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 16

LPC encoding • The classic source-filter model Encoder Decoder Filter coefficients { a i} | 1 / A ( ej ω ) | Represent Input f & encode s [ n ] LPC Output ^ analysis ^ e [ n ] s [n] Excitation All-pole Represent Residual generator filter & encode e [ n ] 1 H( z ) = t 1 - Σ a i z -i • Compression gains: - filter parameters are ~slowly changing - excitation can be represented many ways 20 ms Filter parameters Excitation/pitch parameters 5 ms E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 17

Encoding LPC filter parameters • For ‘communications quality’: - 8 kHz sampling (4 kHz bandwidth) - ~10th order LPC (up to 5 pole pairs) - update every 20-30 ms → 300 - 500 param/s • Representation & a i quantization - { a i } - poor distribution, -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 can’t interpolate k i - reflection coefficients { k i }: -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 guaranteed stable f L i - LSPs - lovely! • Bit allocation (filter): 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 - GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps - FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps E6820 SAPR - Dan Ellis L07 - Coding 2002-03-25 - 18

EE E6820: Speech & Audio Processing & Recognition Lecture 7: - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio Compression & Coding 1 Information, compression & quantization 2 Speech coding 3 Wide bandwidth audio coding Dan Ellis <dpwe@ee.columbia.edu>

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1

EE E6820: Speech & Audio Processing & Recognition Lecture 2: Acoustics 1 The wave

EE E6820: Speech & Audio Processing & Recognition Lecture 6: Music analysis and synthesis

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Elimination Service for Enterprises Ram Ramjee Microsoft Research India Bhavish Aggarwal^,

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Quadtrees: Hierarchical Grids Steve Oudot From S. Har-Peleds notes, Chapter 2 Outline

Saliency and KAZE Features Authors: Siddharth Srivastava, Prerana Mukherjee, Dr. Brejesh Lall

Adaptive Visualization of Dynamic Unstructured Meshes Steven P . Callahan Dissertation Defense

Video

GUIDE Adaptive Accessibility for Web&TV Platforms www.guide-project.eu Christoph Jung,

An Adaptive, Emotional, and Expressive Reminding System Nadine Richard & Seiji Yamada NII,

EE E6820: Speech & Audio Processing & Recognition Lecture 7: - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio Compression & Coding 1 Information, compression & quantization 2 Speech coding 3 Wide bandwidth audio coding Dan Ellis <dpwe@ee.columbia.edu>

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 10: ASR: Sequence Recognition

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 8: Spatial sound 1 Spatial

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 4: Auditory Perception 1

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 2: Acoustics 1 The wave

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 6: Music analysis and synthesis

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Elimination Service for Enterprises Ram Ramjee Microsoft Research India Bhavish Aggarwal^,

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Quadtrees: Hierarchical Grids Steve Oudot From S. Har-Peleds notes, Chapter 2 Outline

Saliency and KAZE Features Authors: Siddharth Srivastava, Prerana Mukherjee, Dr. Brejesh Lall

Adaptive Visualization of Dynamic Unstructured Meshes Steven P . Callahan Dissertation Defense

Video

GUIDE Adaptive Accessibility for Web&amp;TV Platforms www.guide-project.eu Christoph Jung,

An Adaptive, Emotional, and Expressive Reminding System Nadine Richard &amp; Seiji Yamada NII,

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1

EE E6820: Speech & Audio Processing & Recognition Lecture 2: Acoustics 1 The wave

EE E6820: Speech & Audio Processing & Recognition Lecture 6: Music analysis and synthesis

GUIDE Adaptive Accessibility for Web&TV Platforms www.guide-project.eu Christoph Jung,

An Adaptive, Emotional, and Expressive Reminding System Nadine Richard & Seiji Yamada NII,