[PPT] - Crowdsourcing for Information Retrieval Experimentation and PowerPoint Presentation

SLIDE 1

Omar Alonso

Microsoft

20 September 2011

Crowdsourcing for Information Retrieval Experimentation and Evaluation

CLEF 2011

SLIDE 2

Disclaimer

The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect the

fficial policy or position of Microsoft.

CLEF 2011

SLIDE 3

Introduction

Crowdsourcing is hot
Lots of interest in the research community

– Articles showing good results – Workshops and tutorials (ECIR’10, SIGIR’10, NACL’10, WSDM’11, WWW’11, SIGIR’11, etc.) – HCOMP – CrowdConf 2011

Large companies leveraging crowdsourcing
Start-ups
Venture capital investment

CLEF 2011

SLIDE 4

Crowdsourcing

Crowdsourcing is the act of taking a

job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call.

The application of Open Source

principles to fields outside of software.

Most successful story: Wikipedia

CLEF 2011

SLIDE 5

Personal thoughts …

CLEF 2011

SLIDE 6

HUMAN COMPUTATION

CLEF 2011

SLIDE 7

Human computation

Not a new idea
Computers before

computers

You are a human

computer

CLEF 2011

SLIDE 8

Some definitions

Human computation is a computation

that is performed by a human

Human computation system is a system

that organizes human efforts to carry

ut computation
Crowdsourcing is a tool that a human

computation system can use to distribute tasks.

CLEF 2011

SLIDE 9

Examples

ESP game
Captcha: 200M every day
ReCaptcha: 750M to date

CLEF 2011

SLIDE 10

Crowdsourcing today

Outsource micro-tasks
Power law
Attention
Incentives
Diversity

CLEF 2011

SLIDE 11

MTurk

Amazon Mechanical Turk

(AMT, MTurk, www.mturk.com)

Crowdsourcing platform
On-demand workforce
“Artificial artificial

intelligence”: get humans to do hard part

Named after faux automaton
f 18th C.

CLEF 2011

SLIDE 12

MTurk – How it works

Requesters create “Human Intelligence Tasks”

(HITs) via web services API or dashboard.

Workers (sometimes called “Turkers”) log in,

choose HITs, perform them.

Requesters assess results, pay per HIT

satisfactorily completed.

Currently >200,000 workers from 100

countries; millions of HITs completed

CLEF 2011

SLIDE 13

Why is this interesting?

Easy to prototype and test new experiments
Cheap and fast
No need to setup infrastructure
Introduce experimentation early in the cycle
In the context of IR, implement and

experiment as you go

For new ideas, this is very helpful

CLEF 2011

SLIDE 14

Caveats and clarifications

Trust and reliability
Wisdom of the crowd re-visit
Adjust expectations
Crowdsourcing is another data point for your

analysis

Complementary to other experiments

CLEF 2011

SLIDE 15

Why now?

The Web
Use humans as processors in a distributed

system

Address problems that computers aren’t good
Scale
Reach

CLEF 2011

SLIDE 16

INFORMATION RETRIEVAL AND CROWDSOURCING

CLEF 2011

SLIDE 17

Evaluation

Relevance is hard to evaluate

– Highly subjective – Expensive to measure

Click-through data
Professional editorial work
Verticals

CLEF 2011

SLIDE 18

You have a new idea

Novel IR technique
Don’t have access to click data
Can’t hire editors
How to test new ideas?

CLEF 2011

SLIDE 19

Crowdsourcing and relevance evaluation

Subject pool access: no need to come into the

lab

Diversity
Low cost
Agile

CLEF 2011

SLIDE 20

Examples

NLP
Machine Translation
Relevance assessment and evaluation
Spelling correction
NER
Image tagging

CLEF 2011

SLIDE 21

Pedal to the metal

You read the papers
You tell your boss (or advisor) that

crowdsourcing is the way to go

You now need to produce hundreds of

thousands of labels per month

Easy, right?

CLEF 2011

SLIDE 22

Ask the right questions

Instructions are key
Workers are not IR experts so don’t assume

the same understanding in terms of terminology

Show examples
Hire a technical writer
Prepare to iterate

CLEF 2011

SLIDE 23

UX design

Time to apply all those usability concepts
Need to grab attention
Generic tips

– Experiment should be self-contained. – Keep it short and simple. – Be very clear with the task. – Engage with the worker. Avoid boring stuff. – Always ask for feedback (open-ended question) in an input box.

Localization

CLEF 2011

SLIDE 24

TREC assessment example

CLEF 2011

Form with a close question (binary relevance) and open-ended question (user

feedback)

Clear title, useful keywords
Workers need to find your task

SLIDE 25

Payments

How much is a HIT?
Delicate balance

– Too little, no interest – Too much, attract spammers

Heuristics

– Start with something and wait to see if there is interest or feedback (“I’ll do this for X amount”) – Payment based on user effort. Example: $0.04 (2 cents to answer a yes/no question, 2 cents if you provide feedback that is not mandatory)

Bonus

CLEF 2011

SLIDE 26

Managing crowds

CLEF 2011

SLIDE 27

Quality control

Extremely important part of the experiment
Approach it as “overall” quality – not just for

workers

Bi-directional channel

– You may think the worker is doing a bad job. – The same worker may think you are a lousy requester.

Test with a gold standard

CLEF 2011

SLIDE 28

Quality control - II

Approval rate
Qualification test

– Problems: slows down the experiment, difficult to “test” relevance – Solution: create questions on topics so user gets familiar before starting the assessment

Still not a guarantee of good outcome
Interject gold answers in the experiment
Identify workers that always disagree with the

majority

CLEF 2011

SLIDE 29

Methods for measuring agreement

Inter-agreement level

– Agreement between judges – Agreement between judges and the gold set

Some statistics

– Cohen’s kappa (2 raters) – Fleiss’ kappa (any number of raters) – Krippendorff’s alpha

Gray areas

– 2 workers say “relevant” and 3 say “not relevant” – 2-tier system

CLEF 2011

SLIDE 30

Time to re-visit things …

Crowdsourcing offers flexibility to design and

experiment

Need to be creative
Test different things
Let’s dissect items that look trivial

CLEF 2011

SLIDE 31

The standard template

Assuming a lab setting

– Show a document – Question: “Is this document relevant to the query”?

Can we do better?
GWAP
Barry & Schamber

– Depth/scope/specifity – Accuracy/validity – Clarity – Recency

CLEF 2011

SLIDE 32

Content quality

People like to work on things that they like
TREC ad-hoc vs. INEX

– TREC experiments took twice to complete – INEX (Wikipedia), TREC (LA Times, FBIS)

Topics

– INEX: Olympic games, movies, salad recipes, etc. – TREC: cosmic events, Schengen agreement, etc.

Content and judgments according to modern times

– Airport security docs are pre 9/11 – Antarctic exploration (global warming )

Document length
Randomize content
Avoid worker fatigue

CLEF 2011

SLIDE 33

Scales and labels

Binary
Ternary
Likert

– Strongly disagree, disagree, neither agree nor disagree, agree, strongly agree

DCG paper

– Irrelevant, marginally, fairly, highly

Other examples

– Perfect, excellent, good, fair, bad – Highly relevant, relevant, related, not relevant – 0..10 (0 == irrelevant, 10 == relevant) – Not at all, to some extent, very much so, don’t know (David Brent)

Usability factors

– Provide clear, concise labels that use plain language – Terminology has to be familiar to assessors

CLEF 2011

SLIDE 34

The human side

As a worker

– I hate when instructions are not clear – I’m not a spammer – I just don’t get what you want – Boring task – A good pay is ideal but not the only condition for engagement

As a requester

– Attrition rate – Balancing act: a task that would produce the right results and is appealing to workers – I want your honest answer for the task – I want qualified workers and I want the system to do some of that for me

Managing crowds and tasks is a daily activity and more difficult

than managing computers

CLEF 2011

SLIDE 35

Difficulty of the task

Some topics may be more difficult
Ask workers
TREC example

CLEF 2011

SLIDE 36

Relevance justification

Why settle for a label?
Let workers justify answers
INEX: 22% of assignments with comments
TREC: 10% of assignments with comments
Must be optional

CLEF 2011

SLIDE 37

Development & testing

CLEF 2011

SLIDE 38

Development framework

Incremental approach
Measure, evaluate, and adjust as you go
Suitable for repeatable tasks

CLEF 2011

SLIDE 39

Experiment in production

Ad-hoc experimentation vs. ongoing metrics
Lots of tasks on the system at any moment
Need to grab attention
Importance of experiment metadata
Scalability

– Scale on data first then on workers – Size of batch – Cost of a deletion

When to schedule

– Split a large task into batches and have 1 single batch in the system – Always review feedback from batch n before uploading n+1

CLEF 2011

SLIDE 40

Advanced applications

Training sets for machine learning
Active learning
Adaptive quality control
Automatic generation of black/white lists

CLEF 2011

SLIDE 41

What’s next?

Are we going crowdsourcing 100%?
Memory hierarchy

– Cache, main memory, disk, tape

People

– Experts, editors, workers

Task routing problem

– Not all human computers are created equal – Push: workers are passive receivers – Pull: workers are active seekers

Cost and difficulty

– {klm} , {family hotels amsterdam} – {greek philosophers}, {dinning philosophers}

CLEF 2011

SLIDE 42

FUTURE TRENDS

CLEF 2011

SLIDE 43

Multiple areas

Social/behavioral science
Human factors
Algorithms
Economics
Distributed systems
Statistics

CLEF 2011

SLIDE 44

Things that need work

UX and guidelines

– Help the worker – Cost of interaction

Scheduling and refresh rate
Exposure effect
Sometimes we just don’t agree
How crowdsourcable is your task

CLEF 2011

SLIDE 45

Mechanical Turk

Advantages

– Speed of experimentation – Price – Diversity – Payments – Lots of problems and missing features

Disadvantages

– Crowdsourcing != MTurk – Spam – Worker and task quality – No analytics – Need to build tools around it

CLEF 2011

SLIDE 46

Problems - IR

Methodology
Cost models
Metrics
Re-visit how we do IR evaluation

CLEF 2011

SLIDE 47

Problems – crowds, clouds and algorithms

Infrastructure

– Current platforms are very rudimentary – No tools for data analysis

CrowdFlower, oDesk, SamaSource, TurkIt,

Soylent

Dealing with uncertainty
Programming crowds
Combining CPU + HPU

CLEF 2011

SLIDE 48

Conclusions

Crowdsourcing for relevance evaluation works
Fast turnaround, easy to experiment, few dollars to test
But you have to design the experiments carefully
Usability considerations
Worker quality
User feedback extremely useful
Need to question all aspects of relevance evaluation
Be creative
Lots of opportunities to improve current platforms

CLEF 2011

SLIDE 49

Questions

What is the best way to perform human

computation?

What is the best way to combine CPU with

HPU for solving problems?

What are the desirable integration points for a

computation that involves CPU and HPU?

CLEF 2011

SLIDE 50

Thank you

CLEF 2011