Outliers Detection vs. Control Questions to Ensure Reliable Results - PowerPoint PPT Presentation

Outliers Detection vs. Control Questions to Ensure Reliable Results in Crowdsourcing. A Speech Quality Assessment Case Study Rafael Zequeira Jiménez , Laura Fernández Gallardo, Sebastian Möller Quality and Usability Lab, Technische Universität Berlin HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

Crowdsourcing HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 2

Motivation Spe Speech quali lity is important for for the e Quality of of Ex Exper erience (QoE) in: audio books virtual or robotic conversational agents * * The The coll collected ra rati tings can can be be used used to to trai train AI AI systems to to pre redict the the speech qu quali ality au auto tomat atical ally ly HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 3

Motivation Spe Speech quali lity ex experiments traditionally ly co conducted in Laboratory La - Professional audio equipment - Soundproof room - Limited number of participants HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 4

Crowdsourcing Study o Con onducted a a spe peech quality as assessment ex experiment o Cr Crowd-workers were ere pre presen ented with th 20 20 sp speech sti stimuli o Opinion ab about ove overall ll quality gat gathered in a a 5-point sc scale

Speech Material: o Database number 50 501 from from ITU-T Re Rec. P.8 P.863 o 4 4 Ge Germans were ere rec recorded per per co condition o 20 200 sp speech sti stimuli (9s (9s lon ong on on av avg.) o 50 50 deg degradation co conditions: o nar arrow- & wide- ba band o tem emporal l cl clipping o sig signal-corre related nois oise, o com ombinations of of these de degradations o Th The e da database con contains quali lity ra ratings to o the e 20 200 sti stimuli li made by by 24 24 di different nati tive Ge German listeners, in acc accordance to o ITU-T Rec Rec. P.8 .800 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 6

Study Conditions: o Address the st study to nati ative Ge Germ rmans o Coll Collect 24 24 ra rati tings per per sti stimulus fro from di different listener ers o Ex Experiment in ac accordance with the ITU-T Re Rec. . P.8 P.800 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 7

Crowdsourcing Platform: o Ge German bas based CS CS pla platform o Rep Reported 1 1 mill llion glo global user sers in Se September 201 2017 o Most of of their user sers are are from from Ge German sp speaking co countries HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 8

Crowdsourcing Experiment Scre Screening ta task to o recr recruit Ge Germ rman listeners o Speech quali Spe lity as assessment task: o o Qualification ph phase o Spe Speech quali lity ass assessment t HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 9

Crowdsourcing Experiment Speech Quali ality ty Quali alific icat ation Asse ssessment • co consent req request • intr troduction • use se of of hea eadphone • en environment rec record • au audio Math ath trapping up p to o 15 15s questi tion • 20 20 sti stimuli to o ra rate te • 5 5 st stimuli li as as an an an anchor • 2 2 tra rapping Question HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 10

Crowdsourcing Experiment Speech Quali ality ty Quali alific icat ation Asse ssessment • co consent req request • intr troduction • use se of of hea eadphone • en environment rec record • au audio Math ath trapping up p to o 15 15s questi tion • 20 20 sti stimuli to o ra rate te • 5 5 st stimuli li as as an an an anchor • 2 2 tra rapping Question HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 11

Results 87 87 wor orkers pa participated in the stu study o 8 8 wor orkers fa failed the e Qualification ph phase o 53 3 unique listen eners: o o 60, 60,4% male les o 96, 96,2% nati tive Ge Germans o pro rovided 48 4840 ra rati tings the co colle llected ra rati tings acc account for for 24 24 to 26 26 ass assessment fro from dif different t listeners o per per file file HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 12

Crowdsourcing vs. Laboratory Spearman’s rank -order cor correla lation: : o o rho = 0,864 (p<0,001 01) Mon onotonic rel relationship betw between La Lab- an and CS CS- MOS o Ro Root Mean Sq Square e Err Error: o o RMSE SE=0 =0,47 ,474 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 13

Filtering from unreliable workers Work in [1] [1] and and [2] [2] re recommends: o o the use se of of trapping ques estion, to cat catch inattentive user sers o when en the use ser fa fail, l, then en all all of of their ra rati tings are are dis discarded Thi This app approach was ef effe fective in [1] 1] an and improved slig slightly the res result lts in [2] 2] [1] B. Naderi, T. Polzehl, I. Wechsung, F. Köster , and S. Möller, “Effect of Trapping Questions on the Reliability of Speech Quality Judgments in a Crowdsourcing Paradigm,” in Interspeech, 2015, pp. 2799 – 2803. [2] R. Zequeira Jiménez, L. Fernández Gallardo, and S. Möller, “Scoring Voice Likability using Pair -Comparison: Laboratory vs. Crowdsourcing Approach,” in Ninth International Conference on Quality of Multimedia Experience (QoMEX), 2017, pp. 1 – 3. Page 14

Filtering from unreliable workers A wor orker is s unre reli liable or or untrustworthy when: : o s/h s/he fa fails the e trapping ques estion in the SQ SQAT AT o s/h s/he fa fails the e Qualification mor ore than onc once HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 15

Filtering from unreliable workers A wor orker is s unre reli liable or or untrustworthy when: : o s/h s/he fa fails the e trapping ques estion in the SQ SQAT AT o s/h s/he fa fails the e Qualification mor ore than onc once HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 16

Filtering from unreliable workers o Discarded 32 320 ra ratings in total fro from W4, 4, W5, W7 o W6 did did not ot co conduct the SQ SQAT Meth thod: : “ fi filt ltering by by tra trappin ing question" ( F-TQ ) o Spearman’s rank -order co correla lation on on 45 4520 ra rati tings: : o rh rho o = = 0,8 0,862 (p (p<0,001) When dis discarding all all the e wor orkers ( F- TQ’ ) ) : o rh rho o = = 0,8 0,854 (p< p<0,001) HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 17

Outlier Detection outliers: o rati tings abo above 1,5 1,5 inte terquartile le ran range (I (IQR) o dep epicted by by ci circle les ex extr treme out outliers: o rati tings at t 3,0 3,0 IQR or or abo above o dep epicted by by ast asterisks HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 18

Outlier Detection o Discarded 12 122 ra ratings identified as as ex extreme out outliers Meth thod: : “ fi filt ltering by by ou outli tlier de detectio ion 1" " ( F-OD1 ) o Spearman’s rank -order co correla lation: : o rh rho o = = 0,8 0,863 (p (p<0,001) sti still not be better r than the fir first coef coefficient when en no o da data was dis discarded HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 19

Outlier Detection 2 o Discarded 14 1480 ra ratings fro from 12 12 wor orkers that wer ere out outliers or or ex extr treme out outliers three times or or mor ore [5] 5]. Meth thod: : “ fi filt ltering by by ou outli tlier de detectio ion 2" " ( F-OD2 ) o Spearman’s rank -order co correla lation: : o rh rho o = = 0,8 0,867 (p (p<0,001) [5] B. Naderi, Motivation of Workers on Microtask Crowdsourcing Platforms. Springer, 2018. Page 20

Alternative Approach o Appli lied F-OD1 an and F-OD2 an and dis discarded 15 1529 ra ratings in total. o Ide dentify the ou outl tliers made by by all all the wor orkers that fa failed the e trapping ques estions. . The Then rem removed 17 17 ra rati tings. Method: : F-TQ-OD o Spearman’s rank -order co correla lation on on 32 3294 ra rati tings: : o rh rho o = = 0,8 0,868 (p< (p<0,001) HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 21

Results Overview Ratings Approach rho RMSE discarded - 0 0,864* 0,474 F-TQ 320 0,862* 0,476 F- TQ’ 780 0,854* 0,480 F-OD1 122 0,863* 0,477 F-OD2 1480 0,867* 0,474 F-TQ-OD 1546 0,868* 0,479 *p < 0,001 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 22

Outliers Detection vs. Control Questions to Ensure Reliable Results - PowerPoint PPT Presentation

Outliers Detection vs. Control Questions to Ensure Reliable Results in Crowdsourcing. A Speech Quality Assessment Case Study Rafael Zequeira Jimnez , Laura Fernndez Gallardo, Sebastian Mller Quality and Usability Lab, Technische Universitt

Detecting Outliers under Detecting Outliers . . . What We Plan To Do Interval Uncertainty:

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

Supervised classification and outliers detection in gene expression data Laurent Br eh elin

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Detecting Outliers in HMM modeling through Relative Entropy with Applications to Change-Point

On a discrete Laplacian based method for outliers detection in Phase I of profile control charts

Detecting Outliers with Ensemble of Profile HMMs Xilin Yu 1 UIUC December 11, 2018 1 under the

Conference on Seasonality, Seasonal Adjustment and their implications for Short-Term Analysis and

Visualizing Big Data Outliers through Distributed Aggregation Leland Wilkinson. Proc VAST 2017,

Proximity-based Outlier Detection Objects far away from the others are outliers The

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers Yuanyuan Wei,

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Industrial Robots Industrial Robots Control Control Part 1 Control Control Part 1 Part 1

Outlier Detection Outlier detection is both easy and difficult. It is easy since there are

Perimeter Intrusion Detection Mikro Tek Detection Technologies Ltd | +44 (0) 1773 744750 |

Collision Detection Collision detection weaknesses Naive collision detection suffers from 3 known

Computer Science Meets Foreign Policy Stephanie Forrest Biodesign Institute and Computer Science

Ethernet OAM Victor Olifer (JANET/GEANT JRA1 Task 1) JRA1/TERENA workshop, Copenhagen, 20

Lecture 09: Live Sequence Charts 2015-06-11 Prof. Dr. Andreas Podelski, Dr. Bernd Westphal 09

ITU-T Q13/15 Updates tle pt TICTOC / IETF-80 tle pt Stefano RUffini, Ericsson; Q13/15

Globaliz obalizing ing Mo Modeling eling Lang nguage uages: s: Is Issue ues s and nd

Image and Video Coding: From Block Transforms to JPEG JPEG Overview The JPEG Standard Joint

Asynchronous modeling in railway systems emmanuel.gaudin@pragmadev.com Different types of models

FIXING BUFFERBLOAT IN WIFI STATUS AND NEXT STEPS Toke Hiland-Jrgensen, Karlstad University