Superhuman Speech Analysis? Getting Broader, Deeper & Faster. - PowerPoint PPT Presentation

Björn Schuller 0 Superhuman Speech Analysis? Getting Broader, Deeper & Faster. Björn W. Schuller Head GLAM, Imperial College London Chair EIHW, University of Augsburg CEO audEERING

Björn Schuller Superhuman?

Björn Schuller Superhuman? ASR. • Human: ASR Misses ~1-2 words in 20  5- 10% “Word Error Rate” (WER)  1 minute conversation ~16 words • Machine: ASR Switchboard : 2.4k (260 hrs), 543 speakers 1995 : 43% (IBM), 2004 : 15.2% (IBM), 2016: 8% (IBM), 6.3% (Microsoft) 2017: 5.5% (IBM) 5.1% (Microsoft/IBM) Human: 5.9% WER (single) 5.1% WER (multiple pro transcribers) AM: CNN-BLSTM, LM: entire history of a dialog session Linguistic Data Consortium, 1993/1997.

Björn Schuller Superhuman? Paralings. • Speech Analysis (CP): Objective Tasks Alcohol Intoxication 71.7% UAR (human) 16 speakers from ALC, 47 listeners: Interspeech 2011 Challenge full ALC: 72.2% UAR (system fusion) Agglomoration (Weninger et al. 2011) >80% Heart Rate, Skin Conductance, Health State, … • Speech Analysis (CP): Subjective Tasks Ground Truth? Emotion, Personality, Likability , …? Schiel : “ Perception of Alcoholic Intoxication in Speech”, Interspeech, 2011.

Björn Schuller Human Performance? “ The Perception of noisified non-sense speech in the noise ”, Interspeech, 2017.

Björn Schuller Rett & ASC. %UA Rett Syndrome 76.5 ASC 75.0 • Rett & ASC Early Diagnosis 16 hours of home videos 6-12 / 10 months Vocal cues: e.g., inspiratory vocalisation “ A Novel Way to Measure and Predict Development: A Heuristic Approach to Facilitate the Early Detection of Neurodevelopmental Disorders", Current Neurology and Neuroscience Reports, 2017. “ Earlier Identification of Children with Autism Spectrum Disorder: An Automatic Vocalisation- based Approach”, Interspeech, 2017.

Björn Schuller Getting Broader.

Björn Schuller Speaker ID & Verification Speech Recognition Language Understanding Deep Paralings Sentiment Analysis Speech Analysis Gender Recognition Broad Paralings Emotion Recognition Language ID Health Classification Speaker Diarisation Personality Recognition

Björn Schuller Paralings. %UA/*AUC/ + CC # Classes Addressee 2 70.6 INTERSPEECH Cold 2 72.0 C OM P AR E Snoring 4 70.5 Deception 2 72.1 Sincerity [0,1] 65.4+ %UA/*AUC/ + CC # Classes Native Lang. 11 82.2 Personality 5x2 70.4 %UA/*AUC/ + CC 2018 # Classes Nativeness [0,1] 43.3+ Likability 2 68.7 Affect: Atypical [-1,1] ? Parkinson’s 54.0 + [0,100] H&N Cancer 2 76.8 Affect: Self-Ass. [-1,1] ? Eating 7 62.7 Intoxication 2 72.2 Crying 3 ? Cognitive Load 3 61.6 Sleepiness 2 72.5 Heart Beats 3 ? Physical Load 2 71.9 Age 4 53.6 Social Signals 2x2 92.7* Gender 3 85.7 Conflict 2 85.9 42.8 + Interest [-1,1] Emotion 12 46.1 Emotion 5 44.0 Autism 4 69.4 Negativity 2 71.2

Björn Schuller Broad Paralings. ) ) ) ) ) X *MAE • + CC Pseudo Multimodality %UA Heart Rate 8.4* .908 + Skin Conductance X Facial Action Units 65.0 Eye-Contact 67.4 ) ) ) ) ) X

Björn Schuller Broad Paralings. • Multiple-Targets Drunk Angry Has a Cold • 1 Voice Nasal cavity Neurotic Tired Palate … Velum Oral cavity Teeth Has Parkinson‘s Tongue Lips Pharynx … Supra- Jaw glottal Glottis Is Older system Sub-glottal system “ Multi-task Deep Neural Network with Shared Hidden Layers: Breaking down the Wall between Emotion Representations”, ICASSP , 2017.

Björn Schuller Broad Paralings. Base CTL %UA Extraversion 71.7 +1.8 Agreeableness 58.6 +4.5 • Cross-Task Self-Labelling Neuroticism 63.3 +3.0 Likability 57.2 +2.9 “ Semi-Autonomous Data Enrichment Based on Cross-Task Labelling of Missing Targets for Holistic Speech Analysis ”, ICASSP, 2016.

Björn Schuller Deep Paralings. perceived felt emotion emotion (degree of) (degree of) … … discrepance acting … … (degree of) intentionality (degree of) prototypical. “ Reading the Author and Speaker: Towards a Holistic and Deep Approach on Automatic Assessment of What is in One's Words ”, CICLing, 2017.

Björn Schuller Getting Deeper.

Björn Schuller Deep Recurrent Nets. Arousal CC HMM 83.5 HMM+LSTM-RNN 87.2 (LSTM-RNN) 96.3 “ A Combined LSTM-RNN-HMM Approach to Meeting Event Segmentation and Recognition ” , ICASSP, 2006. “Abandoning Emotion Classes – Towards Continuous Emotion Recognition with Modelling of Long-Range Dependencies”, Interspeech, 2008.

Björn Schuller Deep Recurrent Nets. “Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks” , ICASSP, 2009. “ Deep neural networks for acoustic emotion recognition: raising the benchmarks ”, ICASSP, 2011.

Björn Schuller Deep Recurrent Nets.

Björn Schuller Convolutional Neural Nets. • Normalisation Layers  ensure normalisation of input also for higher layers • Batch Normalisation input of each neuron normalised over “batch” (such as 50 instances) allows for higher learning rates, reduces overfitting only in forward networks 𝑛 : batch size, 𝑏 𝑗 : activation of neuron in step 𝑗 of the batch ( 1 ≤ 𝑗 ≤ 𝑛 ) batch mean: 𝜈 𝐶 = 1 𝑛 𝑏 𝑗 𝑛 σ 𝑗=1 2 = 1 𝑛 𝑏 𝑗 − 𝜈 𝐶 2 𝑛 σ 𝑗=1 batch variance: 𝜏 𝐶 𝑏 𝑗 −𝜈 𝐶 normalised activation: ො 𝑏 𝑗 = 𝜏 𝐶

Björn Schuller End-to-End. CCC Recola Arousal Valence ComParE+LSTM .382 .187 e2e (2016) .686 .261 • CNN + LSTM RNN gain derivative for t 16 energy range (.77) loudness (.73) F0 mean (.71) “Adieu Features? End-to-End Speech Emotion Recognition using a Deep Convolutional Recurrent Network ”, ICASSP, 2016.

Björn Schuller End-to-End. CCC Recola Arousal Valence ComParE+LSTM .382 .187 e2e (2016) .686 .261 • CNN + LSTM  CLSTM ? “ Convolutional RNN: an enhanced model for extracting features from sequential data ”, IJCNN, 2016.

Björn Schuller Learning by Errors. CCC Recola Arousal Valence ComParE+LSTM .382 .187 e2e (2016) .686 .261 • Reconstruction Error Reconstruction Error .729 .360 RE of Auto-Encoder as additional input feature Model 1: - X t Auto- Model 2 Y t X` t Encoder | | X t -X` t | | Either: Low Level Descriptors (LLD) or Statistical funtionals Deep BLSTM RNN “ Reconstruction-error-based Learning for Continuous Emotion Recognition in Speech ”, ICASSP , 2017.

Björn Schuller Prediction-based. CCC Recola Arousal Valence ComParE+LSTM .382 .187 • e2e (2016) .686 .261 Tandem Learning Reconstruction Error .729 .360 concatenate two models Prediction-based .744 .377 for combined strengths Model 1 Y t- X t m1 Y t- Y t Model 2 m2 “ Prediction- based Learning for Continuous Emotion Recognition in Speech”, ICASSP, 2017.

Björn Schuller End-to-End. CCC Recola Arousal Valence ComParE+LSTM .382 .187 e2e (2016) .686 .261 • CNN + LSTM RNN Reconstruction Error .729 .360 Prediction-based .744 .377 BoAW .753 .430 e2e (submitted) .787 .440 “ Affect Recognition by Brdiging the Gap between End-2-End Deep Learning and Conventional Features ”, submitted.

Björn Schuller Adversarial Nets. CCC Recola Arousal Valence ComParE+LSTM .382 .187 e2e (2016) .686 .261 • Conditional Adversarial Nets Reconstruction Error .729 .360 Prediction-based .744 .377 BoAW .753 .430 e2e (submitted) .787 .440 CAN (submitted) .737 .455 “ Towards Conditional Adversarial Networks for Recognition of Emotion in Speech ”, submitted.

Björn Schuller Co-Learning Trust. • Multi-task Learning of Subjective / Uncertain Ground Truth Example: Arousal / Valence (SEWA data of AVEC 2017) Perception uncertainty (K ratings): “ From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty”, ACM Multimedia, 2017.

Björn Schuller Co-Learning Trust. CCC SEWA Arousal Valence Single .234 .267 Multiple (+conf) .275 .292 Single (A/V) .386 .478 Multiple (+conf, A/V) .450 .515 “ From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty”, ACM Multimedia , 2017.

Björn Schuller Audio = Images? %UA CNN+LSTM 40.3 Functionals 58.8 • VOTE Snoring Classification Velum, soft palate Oropharyngeal Tongue base Epiglottis “Classification of the Excitation Location of Snore Sounds in the Upper Airway by Acoustic Multi -Feature Analysis", IEEE Transactions on Biomedical Engineering, 2017.

Björn Schuller Audio = Images? %UA CNN+LSTM 40.3 Functionals 58.8 • VOTE Snoring Classification CNN+GRU 63.8 “ A CNN-GRU Approach to Capture Time-Frequency Pattern Interdependence for Snore Sound Classification", submitted.

Björn Schuller Audio = Images? %UA CNN+LSTM 40.3 Functionals 58.8 • VOTE Snoring Classification CNN+GRU 63.8 Deep Spec 67.0 “ Snore sound classification using image-based deep spectrum features", Interspeech, 2017.

Björn Schuller Audio = Images? • Wavelets vs STFT via VGG16 “Deep Sequential Image Features for Acoustic Scene Classification", DCASE, 2018.

Björn Schuller Audio = Images? DCASE 2017 %WA STFT 76.5 STFT+bump 79.8 • Wavelets vs STFT via VGG16 STFT+morse 76.9 All 80.9 “Deep Sequential Image Features for Acoustic Scene Classification", DCASE, 2018.

Superhuman Speech Analysis? Getting Broader, Deeper & Faster. - PowerPoint PPT Presentation

Bjrn Schuller 0 Superhuman Speech Analysis? Getting Broader, Deeper & Faster. Bjrn W. Schuller Head GLAM, Imperial College London Chair EIHW, University of Augsburg CEO audEERING Bjrn Schuller Superhuman? Bjrn Schuller

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Analysis of speech Dr. Anil Kumar Vuppala IIIT Hyderabad Analysis of speech Representing speech

Superhuman sports Game Overview System Architecture in mixed reality: User Study Future

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech of Greta Thunberg at the UN Climate Change COP24 Conference in Katowice Content -Greta

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech sound disorder by Sajjal (2018) Definition A speech sound disorder (SSD) is a speech

BridgingTechnologyandPsychology throughtheLifelog Personality, Mood and Sleep Quality

Toward a Developmentally Sensitive DSM-5: Chair: Patricia K. Kerig Making PTSD Criteria

Robot Emotions Emotions of Living Creatures motivation system for complex organisms

EMOTION & TIME PERCEPTION Project by: Sandesh Chopade (11640), Rahul Jadam (12533) Time

INTEGRATING STAGES York: Guilford Press. OF CHANGE Miller, W. R., & Rolnick, S. (2002)

ALMA - A Layered Model of Affect The Forth International Joint Conference on Autonomous Agents

DMM Inte gr ative T r e atme nt Andr e a L andini, MD (Italy) & Patr ic ia M. Cr itte

Canine Communication Understanding canine body language Understanding canine body language Agenda

Sambuz

Useful Links

Newsletter

Mail Us

Superhuman Speech Analysis? Getting Broader, Deeper & Faster. - PowerPoint PPT Presentation

Bjrn Schuller 0 Superhuman Speech Analysis? Getting Broader, Deeper & Faster. Bjrn W. Schuller Head GLAM, Imperial College London Chair EIHW, University of Augsburg CEO audEERING Bjrn Schuller Superhuman? Bjrn Schuller

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Analysis of speech Dr. Anil Kumar Vuppala IIIT Hyderabad Analysis of speech Representing speech

Superhuman sports Game Overview System Architecture in mixed reality: User Study Future

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech of Greta Thunberg at the UN Climate Change COP24 Conference in Katowice Content -Greta

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech sound disorder by Sajjal (2018) Definition A speech sound disorder (SSD) is a speech

BridgingTechnologyandPsychology throughtheLifelog Personality, Mood and Sleep Quality

Toward a Developmentally Sensitive DSM-5: Chair: Patricia K. Kerig Making PTSD Criteria

Robot Emotions Emotions of Living Creatures motivation system for complex organisms

EMOTION &amp; TIME PERCEPTION Project by: Sandesh Chopade (11640), Rahul Jadam (12533) Time

INTEGRATING STAGES York: Guilford Press. OF CHANGE Miller, W. R., &amp; Rolnick, S. (2002)

ALMA - A Layered Model of Affect The Forth International Joint Conference on Autonomous Agents

DMM Inte gr ative T r e atme nt Andr e a L andini, MD (Italy) &amp; Patr ic ia M. Cr itte

Canine Communication Understanding canine body language Understanding canine body language Agenda

Sambuz

Useful Links

Newsletter

Mail Us

EMOTION & TIME PERCEPTION Project by: Sandesh Chopade (11640), Rahul Jadam (12533) Time

INTEGRATING STAGES York: Guilford Press. OF CHANGE Miller, W. R., & Rolnick, S. (2002)

DMM Inte gr ative T r e atme nt Andr e a L andini, MD (Italy) & Patr ic ia M. Cr itte