Quality Control - part 1
Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website: crowdsourcing-class.org
Quality Control - part 1 Crowdsourcing and Human Computation - - PowerPoint PPT Presentation
Quality Control - part 1 Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website: crowdsourcing-class.org Classification System for Human Computation Motivation Quality Control Aggregation Human Skill
Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website: crowdsourcing-class.org
Figure 1. Partners agreeing on an image. Neither of them can
Player 1 guesses: purse Player 1 guesses: bag Player 1 guesses: brown Success! Agreement on “purse” Player 2 guesses: handbag Player 2 guesses: purse Success! Agreement on “purse”
“think like each other”
2.5 minutes
15 images
difficult image
generated on their partner’s screen; a pair cannot pass on an image until both have passed
upon words (up to 6)
and guarantee that images get several different labels
ESP game, generating 1,271,451 labels for 293,760 different images
hours a day would label all images on Google (425,000,000 images) with 1 label each in 31 days
every image in Google’s index
“better”
don’t agree on
players
(permissive = 1 pair agrees, strict = 40 agree)
random that have at least 5 labels
and agreed on labels
anything to do with the image?
with this image)?
Figure 4. An image with all its labe Dog Leash German Shepard Standing Canine
10 20 30 40 Anger Disgust Fear Joy Sadness Surprise
sim(lad, boy) > sim(rooster, noon)
meaning for a word
and chief operating officer...
university 2) head of a country (other than the U.S.) 3) head of the U.S., President of the United States
before or after another verb?
two fireballs go down to the water, and there was smoke coming up from that.
up?
Headline Expert Non-expert
Beware of peanut butter pathogens
37 15
Experts offer advice on salmonella
23 10
Indonesian with bird flu dies
45 39
Thousands tested after Russian H5N1
71 80
Roots of autism more complex than thought
15 20
Largest ever autism study identifies two genetic culprits
12 22
τ = (number of concordant pairs) - (number of discordant pairs) 1/2 n*(n-1)
Headline Expert Non-expert
Beware of peanut butter pathogens
37 15
Experts offer advice on salmonella
23 10
τ = (number of concordant pairs) - (number of discordant pairs) 1/2 n*(n-1)
Headline Expert Non-expert
Experts offer advice on salmonella
23 10
Largest ever autism study identifies two genetic culprits
12 22
τ = (number of concordant pairs) - (number of discordant pairs) 1/2 n*(n-1) τ = 11 - 4 = 0.46 15
expert values against expert values
annotations from different experts – this let’s us establish a topline
multiple Turkers for each judgment
Task Labels Affect Recognition 7000 Word Similarity 300 Recognizing Textual Entailment 8000 Word Sense Disambiguation 1770 Temporal Ordering 4620 Total 21,690
e
s d e he e in e
4 6 8 10 0.45 0.55 0.65 correlation
anger
2 4 6 8 10 0.55 0.65 0.75 correlation
disgust
2 4 6 8 10 0.40 0.50 0.60 0.70 correlation
fear
2 4 6 8 10 0.35 0.45 0.55 0.65 correlation
joy
r? ,
l
0.40 2 4 6 8 10 0.55 0.65 0.75 annotators correlation
sadness
correlation 10 2 4 6 8 10 0.20 0.30 0.40 0.50 annotators correlation
surprise
200 400 600 800 0.4 0.6 0.8 1.0 number of annotations accuracy
instead weight it
well they do on gold standard data
alongside data without labels
well, and less for those who perform poorly
annotators accuracy 0.7 0.8 0.9
RTE
annotators 0.7 0.8 0.9
before/after
Gold calibrated Naive voting
measure accuracy of subjective responses
multiple choice questions – things like content generation / free text responses can’t be calibrated in the same way
standard data by experts, requires multiple Workers to do each item
windshields on a production line
unit during a year and a half
started in the hourly rate and switched to the per-unit scheme increased by 20%
schemes can elicit improved performance
filling out a survey
2, 3, 4
could end participation at any point and get paid for what they did
words in a set as you can:
BUILDING, CHAIR, COMPLETE, GREEN, LAMP , MASTER, MUSIC, PLANT, STAPLE, STEREO, STRIVE, SUCCEED, TURTLE
are in the puzzle!
either on a per-grid basis or a per-word basis, or not told anything
quality = fraction of words found per puzzle
23,440 words
wo
like intrinsic motivation, do so, since the quality of work will be the same
might be in your best interest to pay as little as possible. Your work will be done slower, but quality will be similar.