Crowdsourcing for Information Retrieval Experimentation and - - PowerPoint PPT Presentation

crowdsourcing for information retrieval experimentation
SMART_READER_LITE
LIVE PREVIEW

Crowdsourcing for Information Retrieval Experimentation and - - PowerPoint PPT Presentation

Crowdsourcing for Information Retrieval Experimentation and Evaluation Omar Alonso Microsoft 20 September 2011 CLEF 2011 Disclaimer The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect


slide-1
SLIDE 1

Omar Alonso

Microsoft

20 September 2011

Crowdsourcing for Information Retrieval Experimentation and Evaluation

CLEF 2011

slide-2
SLIDE 2

Disclaimer

The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect the

  • fficial policy or position of Microsoft.

CLEF 2011

slide-3
SLIDE 3

Introduction

  • Crowdsourcing is hot
  • Lots of interest in the research community

– Articles showing good results – Workshops and tutorials (ECIR’10, SIGIR’10, NACL’10, WSDM’11, WWW’11, SIGIR’11, etc.) – HCOMP – CrowdConf 2011

  • Large companies leveraging crowdsourcing
  • Start-ups
  • Venture capital investment

CLEF 2011

slide-4
SLIDE 4

Crowdsourcing

  • Crowdsourcing is the act of taking a

job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call.

  • The application of Open Source

principles to fields outside of software.

  • Most successful story: Wikipedia

CLEF 2011

slide-5
SLIDE 5

Personal thoughts …

CLEF 2011

slide-6
SLIDE 6

HUMAN COMPUTATION

CLEF 2011

slide-7
SLIDE 7

Human computation

  • Not a new idea
  • Computers before

computers

  • You are a human

computer

CLEF 2011

slide-8
SLIDE 8

Some definitions

  • Human computation is a computation

that is performed by a human

  • Human computation system is a system

that organizes human efforts to carry

  • ut computation
  • Crowdsourcing is a tool that a human

computation system can use to distribute tasks.

CLEF 2011

slide-9
SLIDE 9

Examples

  • ESP game
  • Captcha: 200M every day
  • ReCaptcha: 750M to date

CLEF 2011

slide-10
SLIDE 10

Crowdsourcing today

  • Outsource micro-tasks
  • Power law
  • Attention
  • Incentives
  • Diversity

CLEF 2011

slide-11
SLIDE 11

MTurk

  • Amazon Mechanical Turk

(AMT, MTurk, www.mturk.com)

  • Crowdsourcing platform
  • On-demand workforce
  • “Artificial artificial

intelligence”: get humans to do hard part

  • Named after faux automaton
  • f 18th C.

CLEF 2011

slide-12
SLIDE 12

MTurk – How it works

  • Requesters create “Human Intelligence Tasks”

(HITs) via web services API or dashboard.

  • Workers (sometimes called “Turkers”) log in,

choose HITs, perform them.

  • Requesters assess results, pay per HIT

satisfactorily completed.

  • Currently >200,000 workers from 100

countries; millions of HITs completed

CLEF 2011

slide-13
SLIDE 13

Why is this interesting?

  • Easy to prototype and test new experiments
  • Cheap and fast
  • No need to setup infrastructure
  • Introduce experimentation early in the cycle
  • In the context of IR, implement and

experiment as you go

  • For new ideas, this is very helpful

CLEF 2011

slide-14
SLIDE 14

Caveats and clarifications

  • Trust and reliability
  • Wisdom of the crowd re-visit
  • Adjust expectations
  • Crowdsourcing is another data point for your

analysis

  • Complementary to other experiments

CLEF 2011

slide-15
SLIDE 15

Why now?

  • The Web
  • Use humans as processors in a distributed

system

  • Address problems that computers aren’t good
  • Scale
  • Reach

CLEF 2011

slide-16
SLIDE 16

INFORMATION RETRIEVAL AND CROWDSOURCING

CLEF 2011

slide-17
SLIDE 17

Evaluation

  • Relevance is hard to evaluate

– Highly subjective – Expensive to measure

  • Click-through data
  • Professional editorial work
  • Verticals

CLEF 2011

slide-18
SLIDE 18

You have a new idea

  • Novel IR technique
  • Don’t have access to click data
  • Can’t hire editors
  • How to test new ideas?

CLEF 2011

slide-19
SLIDE 19

Crowdsourcing and relevance evaluation

  • Subject pool access: no need to come into the

lab

  • Diversity
  • Low cost
  • Agile

CLEF 2011

slide-20
SLIDE 20

Examples

  • NLP
  • Machine Translation
  • Relevance assessment and evaluation
  • Spelling correction
  • NER
  • Image tagging

CLEF 2011

slide-21
SLIDE 21

Pedal to the metal

  • You read the papers
  • You tell your boss (or advisor) that

crowdsourcing is the way to go

  • You now need to produce hundreds of

thousands of labels per month

  • Easy, right?

CLEF 2011

slide-22
SLIDE 22

Ask the right questions

  • Instructions are key
  • Workers are not IR experts so don’t assume

the same understanding in terms of terminology

  • Show examples
  • Hire a technical writer
  • Prepare to iterate

CLEF 2011

slide-23
SLIDE 23

UX design

  • Time to apply all those usability concepts
  • Need to grab attention
  • Generic tips

– Experiment should be self-contained. – Keep it short and simple. – Be very clear with the task. – Engage with the worker. Avoid boring stuff. – Always ask for feedback (open-ended question) in an input box.

  • Localization

CLEF 2011

slide-24
SLIDE 24

TREC assessment example

CLEF 2011

  • Form with a close question (binary relevance) and open-ended question (user

feedback)

  • Clear title, useful keywords
  • Workers need to find your task
slide-25
SLIDE 25

Payments

  • How much is a HIT?
  • Delicate balance

– Too little, no interest – Too much, attract spammers

  • Heuristics

– Start with something and wait to see if there is interest or feedback (“I’ll do this for X amount”) – Payment based on user effort. Example: $0.04 (2 cents to answer a yes/no question, 2 cents if you provide feedback that is not mandatory)

  • Bonus

CLEF 2011

slide-26
SLIDE 26

Managing crowds

CLEF 2011

slide-27
SLIDE 27

Quality control

  • Extremely important part of the experiment
  • Approach it as “overall” quality – not just for

workers

  • Bi-directional channel

– You may think the worker is doing a bad job. – The same worker may think you are a lousy requester.

  • Test with a gold standard

CLEF 2011

slide-28
SLIDE 28

Quality control - II

  • Approval rate
  • Qualification test

– Problems: slows down the experiment, difficult to “test” relevance – Solution: create questions on topics so user gets familiar before starting the assessment

  • Still not a guarantee of good outcome
  • Interject gold answers in the experiment
  • Identify workers that always disagree with the

majority

CLEF 2011

slide-29
SLIDE 29

Methods for measuring agreement

  • Inter-agreement level

– Agreement between judges – Agreement between judges and the gold set

  • Some statistics

– Cohen’s kappa (2 raters) – Fleiss’ kappa (any number of raters) – Krippendorff’s alpha

  • Gray areas

– 2 workers say “relevant” and 3 say “not relevant” – 2-tier system

CLEF 2011

slide-30
SLIDE 30

Time to re-visit things …

  • Crowdsourcing offers flexibility to design and

experiment

  • Need to be creative
  • Test different things
  • Let’s dissect items that look trivial

CLEF 2011

slide-31
SLIDE 31

The standard template

  • Assuming a lab setting

– Show a document – Question: “Is this document relevant to the query”?

  • Can we do better?
  • GWAP
  • Barry & Schamber

– Depth/scope/specifity – Accuracy/validity – Clarity – Recency

CLEF 2011

slide-32
SLIDE 32

Content quality

  • People like to work on things that they like
  • TREC ad-hoc vs. INEX

– TREC experiments took twice to complete – INEX (Wikipedia), TREC (LA Times, FBIS)

  • Topics

– INEX: Olympic games, movies, salad recipes, etc. – TREC: cosmic events, Schengen agreement, etc.

  • Content and judgments according to modern times

– Airport security docs are pre 9/11 – Antarctic exploration (global warming )

  • Document length
  • Randomize content
  • Avoid worker fatigue

CLEF 2011

slide-33
SLIDE 33

Scales and labels

  • Binary
  • Ternary
  • Likert

– Strongly disagree, disagree, neither agree nor disagree, agree, strongly agree

  • DCG paper

– Irrelevant, marginally, fairly, highly

  • Other examples

– Perfect, excellent, good, fair, bad – Highly relevant, relevant, related, not relevant – 0..10 (0 == irrelevant, 10 == relevant) – Not at all, to some extent, very much so, don’t know (David Brent)

  • Usability factors

– Provide clear, concise labels that use plain language – Terminology has to be familiar to assessors

CLEF 2011

slide-34
SLIDE 34

The human side

  • As a worker

– I hate when instructions are not clear – I’m not a spammer – I just don’t get what you want – Boring task – A good pay is ideal but not the only condition for engagement

  • As a requester

– Attrition rate – Balancing act: a task that would produce the right results and is appealing to workers – I want your honest answer for the task – I want qualified workers and I want the system to do some of that for me

  • Managing crowds and tasks is a daily activity and more difficult

than managing computers

CLEF 2011

slide-35
SLIDE 35

Difficulty of the task

  • Some topics may be more difficult
  • Ask workers
  • TREC example

CLEF 2011

slide-36
SLIDE 36

Relevance justification

  • Why settle for a label?
  • Let workers justify answers
  • INEX: 22% of assignments with comments
  • TREC: 10% of assignments with comments
  • Must be optional

CLEF 2011

slide-37
SLIDE 37

Development & testing

CLEF 2011

slide-38
SLIDE 38

Development framework

  • Incremental approach
  • Measure, evaluate, and adjust as you go
  • Suitable for repeatable tasks

CLEF 2011

slide-39
SLIDE 39

Experiment in production

  • Ad-hoc experimentation vs. ongoing metrics
  • Lots of tasks on the system at any moment
  • Need to grab attention
  • Importance of experiment metadata
  • Scalability

– Scale on data first then on workers – Size of batch – Cost of a deletion

  • When to schedule

– Split a large task into batches and have 1 single batch in the system – Always review feedback from batch n before uploading n+1

CLEF 2011

slide-40
SLIDE 40

Advanced applications

  • Training sets for machine learning
  • Active learning
  • Adaptive quality control
  • Automatic generation of black/white lists

CLEF 2011

slide-41
SLIDE 41

What’s next?

  • Are we going crowdsourcing 100%?
  • Memory hierarchy

– Cache, main memory, disk, tape

  • People

– Experts, editors, workers

  • Task routing problem

– Not all human computers are created equal – Push: workers are passive receivers – Pull: workers are active seekers

  • Cost and difficulty

– {klm} , {family hotels amsterdam} – {greek philosophers}, {dinning philosophers}

CLEF 2011

slide-42
SLIDE 42

FUTURE TRENDS

CLEF 2011

slide-43
SLIDE 43

Multiple areas

  • Social/behavioral science
  • Human factors
  • Algorithms
  • Economics
  • Distributed systems
  • Statistics

CLEF 2011

slide-44
SLIDE 44

Things that need work

  • UX and guidelines

– Help the worker – Cost of interaction

  • Scheduling and refresh rate
  • Exposure effect
  • Sometimes we just don’t agree
  • How crowdsourcable is your task

CLEF 2011

slide-45
SLIDE 45

Mechanical Turk

  • Advantages

– Speed of experimentation – Price – Diversity – Payments – Lots of problems and missing features

  • Disadvantages

– Crowdsourcing != MTurk – Spam – Worker and task quality – No analytics – Need to build tools around it

CLEF 2011

slide-46
SLIDE 46

Problems - IR

  • Methodology
  • Cost models
  • Metrics
  • Re-visit how we do IR evaluation

CLEF 2011

slide-47
SLIDE 47

Problems – crowds, clouds and algorithms

  • Infrastructure

– Current platforms are very rudimentary – No tools for data analysis

  • CrowdFlower, oDesk, SamaSource, TurkIt,

Soylent

  • Dealing with uncertainty
  • Programming crowds
  • Combining CPU + HPU

CLEF 2011

slide-48
SLIDE 48

Conclusions

  • Crowdsourcing for relevance evaluation works
  • Fast turnaround, easy to experiment, few dollars to test
  • But you have to design the experiments carefully
  • Usability considerations
  • Worker quality
  • User feedback extremely useful
  • Need to question all aspects of relevance evaluation
  • Be creative
  • Lots of opportunities to improve current platforms

CLEF 2011

slide-49
SLIDE 49

Questions

  • What is the best way to perform human

computation?

  • What is the best way to combine CPU with

HPU for solving problems?

  • What are the desirable integration points for a

computation that involves CPU and HPU?

CLEF 2011

slide-50
SLIDE 50

Thank you

CLEF 2011