Tradeoff Between Quality And Quantity Of Raters To Characterize - PowerPoint PPT Presentation

Tradeoff Between Quality And Quantity Of Raters To Characterize Expressive Speech Alec Burmania, Mohammed Abdelwahab, and Carlos Busso Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer Science msp.utdallas.edu

Labels from expressive speech � Emotional databases rely on labels for classification � Usually obtained via perceptual evaluations � Lab Setting + Allows researcher close control over subjects - Expensive - Small demographic distribution - Smaller corpus size � Crowdsourcing + Can solve some of the above issues + Widely tested and used in perceptual evaluations - Raises issues with rater reliability 2 msp.utdallas.edu

Labels from expressive speech � How do we balance quality and quantity in perceptual evaluations? � How many labels is enough? � Crowdsourcing makes these decisions important Many Evaluators Few Evaluators & & Low Quality High Quality or � How does this affect classification? 3 msp.utdallas.edu

Effective Reliability � Rosenthal et. al [1] proposes Spearman-Brown effective reliability framework for behavioral studies � Interprets reliability as a function of quality and quantity � We use kappa as our metric ( κ ) and raters (n) 𝑜κ Effective Reliability = 1+ 𝑜−1 κ Mean Reliability ( κ ) n raters 0.42 0.45 0.48 0.51 0.54 0.57 0.60 5 78 80 82 84 85 87 88 10 88 89 90 91 92 93 94 15 92 92 93 94 95 95 96 20 94 94 95 95 96 96 97 [1] Jinni A Harrigan, Robert Ed Rosenthal, and Klaus R Scherer,The new handbook of methods in nonverbal behavior research.,Oxford University Press, 2005. 4 msp.utdallas.edu

MSP-IMPROV Corpus � Recordings of 12 subjects improvising scenes in pairs (>9 hours, 8,438 turns) [2] � Actors are assigned context for a scene that they are supposed to act out � Collected for corpus of fixed lexical content but different emotions An example scene. � Data Sets � Target – Recorded Sentences with fixed lexical content (648) � Improvisation – Scene to produce target � Interaction – Interactions between scenes [2]Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost, "MSP-IMPR OV: An acted corpus of dyadic interactions to study emotion perception," IEEE Transactions on Affective Computing, vol. To appear, 2015. 5 msp.utdallas.edu

MSP-IMPROV Corpus How can I not ? Anger Happiness Neutral Sadness Lazy friend asks you to skip Accepting job Using coupon class Taking extra help offer at store when you are failing classes 6 msp.utdallas.edu

MSP-IMPROV Corpus 7 msp.utdallas.edu

Perceptual Evaluation � Idea: Can we verify if a worker is spamming even while lacking ground truth labels for most of the corpus? � We will focus on a five class problem (Angry, Sad, Neutral, Happy, Other) Collect Reference Set Phase A (Gold Standard) Collect reference set R R R R R R R R R R Interleave Reference Set with Data Phase B (Online Quality Assessment) … Trace performance in real time … R Data R Data R Data End End [3] Alec Burmania, Srinivas Parthasarathy, and Carlos Busso, "Increasing the reliability of crowdsourcing evaluations using online quality assessment," IEEE Transactions on Affective Computing, vol. To appear, 2015. 8 msp.utdallas.edu

Metric: Angular Agreement � Assign categories (angry, sad, happy neutral, other) as a 5D space (v). Angry 2 Sad 3 � We calculate the LOWO inter-evaluator agreement Neutral 0 Happy 0 𝑂 𝑊 (𝑗) ∙ 𝑊 𝐵𝑕𝑠𝑓𝑓𝑛𝑓𝑜𝑢 𝜄 = 1 𝑗 Other 0 𝑂 𝑏𝑑𝑝𝑡 𝑊 𝑊 (𝑗) 𝑗 𝑗=1 � Assume the rater we are evaluating chooses angry: Angry 2+1 Sad 3 � We then recalculate the agreement as above and find Neutral 0 the difference: ∆𝜄 = 𝜄 𝑢 − 𝜄 𝑡 Happy 0 Other 0 9 msp.utdallas.edu

R Average Difference of Gold Standard 10 msp.utdallas.edu

R R Performance Averaged over first two sets 11 msp.utdallas.edu

R R R First Group of Evaluators Removed 12 msp.utdallas.edu

R R R R 13 msp.utdallas.edu

R R R R R 14 msp.utdallas.edu

This is still an issue! 15 msp.utdallas.edu

Offline Filtering Process � Because we have the quality at each of the checkpoints, we can filter results that fall below a certain threshold � This gives us target sets with an average of number of evaluations >20 � Thus we can filter to have sets with different inter-evaluator agreement � We choose Angular agreement as our metric (useful for minority emotions) Data QA R R Threshold Post-Processing Step Real Time Processing Step We can control this to produce sets of varying quality 16 msp.utdallas.edu

17 msp.utdallas.edu

Δθ = 25° Secondary Post-processing threshold ( Δθ ) 18 msp.utdallas.edu

Δθ = 5° 19 msp.utdallas.edu

Rater Quality Constant sample size 5 Raters 10 Raters 15 Raters 20 Raters 25 Raters Δθ κ κ κ κ κ # sent # sent # sent # sent # sent 5 638 0.572 525 0.558 246 0.515 52 0.488 0 - 10 643 0.532 615 0.522 466 0.501 207 0.459 26 0.455 15 648 0.501 643 0.495 570 0.483 351 0.443 112 0.402 20 648 0.469 648 0.471 619 0.463 510 0.451 182 0.414 25 648 0.452 648 0.450 643 0.450 561 0.440 247 0.416 30 648 0.438 648 0.433 648 0.436 609 0.431 298 0.410 35 648 0.425 648 0.433 648 0.426 619 0.424 346 0.403 40 648 0.420 648 0.427 648 0.425 629 0.423 356 0.402 90 648 0.422 648 0.419 648 0.422 629 0.419 381 0.409 Decreasing samples meeting size criteria Increasing agreement due to filter 20 msp.utdallas.edu

Experimental Setup � Let’s choose 4 scenarios which tradeoff quality and quantity, asses their effective reliabilities and classification performance � Case 1: High Quality, Low Quantity C4 � 5 degree filter, and 5 Raters ( κ = 0.572) � Case 2: Moderate Quality, Moderate Quantity C2 Quantity � 25 Degree Filter, 15 raters ( κ = 0.450) � Case 3: Low Quality, Low Quantity � No Filter, 5 Raters ( κ = 0.422) C3 C1 � Case 4: Low Quality, High Quantity � No Filter, 20 Raters ( κ = 0.419) Quality 21 msp.utdallas.edu

Classification C4 � Five Class Problem (Angry, Sad, Neutral, Happy, Other) C2 Quantity � Excluded turns w/o majority vote agreement C3 C1 � Acoustic Features IS 2013 - OPENSMILE Quality SVM CAE Forward 6F-SI Feature Feature Feature Cross Classifier Extraction Selection Selection Validation D = 6373 D = 1000 D = 50 22 msp.utdallas.edu

Results Common Turns in all Cases # Turns Acc. (%) Pre. (%) Rec. (%) F-score(%) Case 1 514 47.39 46.53 47.39 46.96 Case 2 514 48.23 47.42 48.23 47.82 Case 3 514 47.07 46.62 47.07 46.84 Case 4 514 47.88 47.17 47.88 47.52 EF Reliability F-Score C4 Reliability Rank Rank Quantity C2 Case 1 87 3 3 Case 2 92 2 1 C3 C1 Case 3 78 4 4 Case 4 94 1 2 Quality 23 msp.utdallas.edu

Discussion � Relatively small differences appear in Label Differences Case 1 Case 2 Case 3 Case 4 labels (<10%) Case 1 - 26 40 32 � “Wisdom of the crowd” seems to Case 2 - - 32 10 Case 3 - - - 36 be useful for emotion Case 4 - - - - � Cost � Accuracy desired may be a function of cost Quality � Is it worth 4x cost for minor improvement? � What is the cost of quality? Cost 24 msp.utdallas.edu

What does this mean? � We can establish a rough crowdsourcing framework for emotion Test collection for Repeat as needed reliability Establish reliability target and cost target Data Collection 25 msp.utdallas.edu

Questions? Interested in the MSP-IMPROV database? Come visit us at msp.utdallas.edu and click “Resources” 26 msp.utdallas.edu

References [1] Jinni A Harrigan, Robert Ed Rosenthal, and Klaus R Scherer,The new handbook of methods in nonverbal behavior research.,Oxford University Press, 2005. [2]Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost, "MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception," IEEE Transactions on Affective Computing, vol. To appear, 2015. [3] Alec Burmania, Srinivas Parthasarathy, and Carlos Busso, "Increasing the reliability of crowdsourcing evaluations using online quality assessment," IEEE Transactions on Affective C omputing, vol. To appear, 2015. 26 msp.utdallas.edu

Tradeoff Between Quality And Quantity Of Raters To Characterize - PowerPoint PPT Presentation

Tradeoff Between Quality And Quantity Of Raters To Characterize Expressive Speech Alec Burmania, Mohammed Abdelwahab, and Carlos Busso Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

5. Structured Descriptions & Tradeoff Between Expressiveness and Tractability Outline

Optimal Communication-Distortion Tradeoff in Voting Debmalya Mandal (Columbia), Nisarg Shah

Optimizing the Relevance-Redundancy Tradeoff for Efficient Semantic Segmentation Caner Hazrba

Analysis of the Parallel Distinguished Point Tradeoff Jin Hong, *Ga Won Lee, Daegun Ma Seoul

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

Learning objectives Introduce dimensions and tradeoff between test and analysis activities A

Compressed Sensing and Dictionary Learning to Alleviate Tradeoff between Temporal and Spatial

Understanding and Mitigating the Tradeoff Between Robustness and Accuracy Aditi Raghunathan*

Tradeoff between Performance and Security Alessandro Aldini University of Urbino Carlo Bo

A Rich Morphological Tagger for English: Exploring the Cross-Linguistic Tradeoff Between

PRICING Overview Context: Many firms face a tradeoff between price and quantity. To sell

Modeling the Tradeoff between Inventory and Capacity to Optimize Return on Assets in Production

Analyzing the Effects of Insuring Health Risks: On the Tradeoff between Short Run Insurance

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

New quality paradigm: New quality paradigm: Quality by Design Quality by Design ICH

CSE 190 Lecture 14 Data Mining and Predictive Analytics Hubs and Authorities; PageRank Trust

Midterm II Review Sta 101 - Fall 2018 Todays office hours changed to 2 - 3pm Office

Probability and Statistics for Computer Science The statement that The average US

Fuzzy Logic : Introduction Debasis Samanta IIT Kharagpur dsamanta@iitkgp.ac.in 07.01.2015

PROJECT MANAGEMENT 6 Steps to Achieving Goals P roper P lanning P revents P oor P erformance STEP

Spending Constraint Utilities, with Applications to the Adwords Market Vijay V. Vazirani

Session 1. Well-Being General 1.2 Consequentialism and Utilitarianism Consequentialism is a

Course Information Course Website: http://www.cs.columbia.edu/~smaskey/CS6998 Discussions

Tradeoff Between Quality And Quantity Of Raters To Characterize - PowerPoint PPT Presentation

Tradeoff Between Quality And Quantity Of Raters To Characterize Expressive Speech Alec Burmania, Mohammed Abdelwahab, and Carlos Busso Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

5. Structured Descriptions &amp; Tradeoff Between Expressiveness and Tractability Outline

Optimal Communication-Distortion Tradeoff in Voting Debmalya Mandal (Columbia), Nisarg Shah

Optimizing the Relevance-Redundancy Tradeoff for Efficient Semantic Segmentation Caner Hazrba

Analysis of the Parallel Distinguished Point Tradeoff Jin Hong, *Ga Won Lee, Daegun Ma Seoul

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

Learning objectives Introduce dimensions and tradeoff between test and analysis activities A

Compressed Sensing and Dictionary Learning to Alleviate Tradeoff between Temporal and Spatial

Understanding and Mitigating the Tradeoff Between Robustness and Accuracy Aditi Raghunathan*

Tradeoff between Performance and Security Alessandro Aldini University of Urbino Carlo Bo

A Rich Morphological Tagger for English: Exploring the Cross-Linguistic Tradeoff Between

PRICING Overview Context: Many firms face a tradeoff between price and quantity. To sell

Modeling the Tradeoff between Inventory and Capacity to Optimize Return on Assets in Production

Analyzing the Effects of Insuring Health Risks: On the Tradeoff between Short Run Insurance

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

New quality paradigm: New quality paradigm: Quality by Design Quality by Design ICH

CSE 190 Lecture 14 Data Mining and Predictive Analytics Hubs and Authorities; PageRank Trust

Midterm II Review Sta 101 - Fall 2018 Todays office hours changed to 2 - 3pm Office

Probability and Statistics for Computer Science The statement that The average US

Fuzzy Logic : Introduction Debasis Samanta IIT Kharagpur dsamanta@iitkgp.ac.in 07.01.2015

PROJECT MANAGEMENT 6 Steps to Achieving Goals P roper P lanning P revents P oor P erformance STEP

Spending Constraint Utilities, with Applications to the Adwords Market Vijay V. Vazirani

Session 1. Well-Being General 1.2 Consequentialism and Utilitarianism Consequentialism is a

Course Information Course Website: http://www.cs.columbia.edu/~smaskey/CS6998 Discussions

5. Structured Descriptions & Tradeoff Between Expressiveness and Tractability Outline