Probabilistic Modeling for Crowdsourcing Partially-Subjective - - PowerPoint PPT Presentation

probabilistic modeling for crowdsourcing partially
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Modeling for Crowdsourcing Partially-Subjective - - PowerPoint PPT Presentation

Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings An T. Nguyen 1 Matthew Halpern 1 Byron C. Wallace 2 Matthew Lease 1 1 University of Texas at Austin 2 Northeastern University HCOMP 2016 Presenter 1 Probabilistic


slide-1
SLIDE 1

Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings

An T. Nguyen1∗ Matthew Halpern1 Byron C. Wallace2 Matthew Lease1

1University of Texas at Austin 2 Northeastern University

HCOMP 2016

∗Presenter

1

slide-2
SLIDE 2

Probabilistic Modeling

A popular approach to improve labels quality

2

slide-3
SLIDE 3

Probabilistic Modeling

A popular approach to improve labels quality Dawid & Skene (1979)

◮ Model true labels as hidden variables. ◮ Qualities of workers as parameters. ◮ Estimation: EM algorithm.

2

slide-4
SLIDE 4

Probabilistic Modeling

A popular approach to improve labels quality Dawid & Skene (1979)

◮ Model true labels as hidden variables. ◮ Qualities of workers as parameters. ◮ Estimation: EM algorithm.

Extensions

◮ Bayesian (Kim & Ghahramani 2012) ◮ Communities (Venanzi et. al. 2014) ◮ Instance features (Kamar et. al. 2015)

2

slide-5
SLIDE 5

Probabilistic Modeling

Common assumption: Single true label for each instance. (i.e. objective task)

3

slide-6
SLIDE 6

Probabilistic Modeling

Common assumption: Single true label for each instance. (i.e. objective task) Subjective task ?

◮ No single true labels ◮ Gold standard may not be appropriate (Sen et. al., CSCW

2015)

3

slide-7
SLIDE 7

Video Rating task

Data:

◮ User interaction in smartphone. ◮ Varying hardware configurations (CPU freq. , cores, GPU)

Task

◮ Watch a short video ◮ Rate user satisfaction from 1 to 5 ◮ 370 videos, ≈ 50 AMT ratings each.

4

slide-8
SLIDE 8

General Setting

For each instance:

◮ No single true label ...

(i.e. no instance-level gold standard)

5

slide-9
SLIDE 9

General Setting

For each instance:

◮ No single true label ...

(i.e. no instance-level gold standard)

◮ ... but true distribution over true labels.

(i.e. gold standard on instance label distribution) Our data: Instances = Videos, Distribution of ratings.

5

slide-10
SLIDE 10

General Setting

For each instance:

◮ No single true label ...

(i.e. no instance-level gold standard)

◮ ... but true distribution over true labels.

(i.e. gold standard on instance label distribution) Our data: Instances = Videos, Distribution of ratings. Two tasks:

◮ Predict that distribution. ◮ Detect unreliable workers.

5

slide-11
SLIDE 11

Model

Intuition:

  • 1. Unreliable workers tend to give unreliable ratings.

6

slide-12
SLIDE 12

Model

Intuition:

  • 1. Unreliable workers tend to give unreliable ratings.
  • 2. Unreliable ratings are independent of instances.

(e.g. rate videos without watching)

6

slide-13
SLIDE 13

Model

Intuition:

  • 1. Unreliable workers tend to give unreliable ratings.
  • 2. Unreliable ratings are independent of instances.

(e.g. rate videos without watching) Assumptions:

  • 1. Worker j has param θj: how often his labels unreliable.

6

slide-14
SLIDE 14

Model

Intuition:

  • 1. Unreliable workers tend to give unreliable ratings.
  • 2. Unreliable ratings are independent of instances.

(e.g. rate videos without watching) Assumptions:

  • 1. Worker j has param θj: how often his labels unreliable.
  • 2. Rating labels are samples from Normal(µ, σ)

6

slide-15
SLIDE 15

Model

Intuition:

  • 1. Unreliable workers tend to give unreliable ratings.
  • 2. Unreliable ratings are independent of instances.

(e.g. rate videos without watching) Assumptions:

  • 1. Worker j has param θj: how often his labels unreliable.
  • 2. Rating labels are samples from Normal(µ, σ)

◮ Unreliable: µ, σ fixed. ◮ Reliable: µ, σ vary with instances.

6

slide-16
SLIDE 16

Model

(i indexes instances, j indexes workers)

Reliable indicator Zij ∼ Ber(θj)

Lij Zij θj A, B

Normal Ber Beta

3, s xi dot w, v

Instances Workers

7

slide-17
SLIDE 17

Model

(i indexes instances, j indexes workers)

Reliable indicator Zij ∼ Ber(θj) Labels Lij|Zij = 0 ∼ N(3, s) Lij|Zij = 1 ∼ N(µi, σ2

i )

Lij Zij θj A, B

Normal Ber Beta

3, s xi dot w, v

Instances Workers

7

slide-18
SLIDE 18

Model

(i indexes instances, j indexes workers)

Reliable indicator Zij ∼ Ber(θj) Labels Lij|Zij = 0 ∼ N(3, s) Lij|Zij = 1 ∼ N(µi, σ2

i )

Models: Features → µ, σ µi = wTxi σi = exp(vTxi)

Lij Zij θj A, B

Normal Ber Beta

3, s xi dot w, v

Instances Workers

7

slide-19
SLIDE 19

Model

(i indexes instances, j indexes workers)

Reliable indicator Zij ∼ Ber(θj) Labels Lij|Zij = 0 ∼ N(3, s) Lij|Zij = 1 ∼ N(µi, σ2

i )

Models: Features → µ, σ µi = wTxi σi = exp(vTxi) Prior θj ∼ Beta(A, B)

Lij Zij θj A, B

Normal Ber Beta

3, s xi dot w, v

Instances Workers

7

slide-20
SLIDE 20

Learning

(For model without prior on θ )

EM algorithm, iterate

8

slide-21
SLIDE 21

Learning

(For model without prior on θ )

EM algorithm, iterate E-step: Infer posterior over Zij (analytic solution) M-step: Optimize parameters w, v and θ (BFGS)

8

slide-22
SLIDE 22

Learning

(For the Bayesian model, with prior on θ)

Closed-form EM not possible

9

slide-23
SLIDE 23

Learning

(For the Bayesian model, with prior on θ)

Closed-form EM not possible Meanfield: approximate posterior p(z, θ) by q(z, θ) =

  • ij

q(Zij)

  • j

q(θj)

9

slide-24
SLIDE 24

Learning

(For the Bayesian model, with prior on θ)

Closed-form EM not possible Meanfield: approximate posterior p(z, θ) by q(z, θ) =

  • ij

q(Zij)

  • j

q(θj) Minimize KL(q||p) using co-ordinate descent. (similar to LDA topic model, details on paper)

9

slide-25
SLIDE 25

Evaluation

Difficulty: Subjective, don’t know who is reliable.

10

slide-26
SLIDE 26

Evaluation

Difficulty: Subjective, don’t know who is reliable. Solution:

◮ Assume all labels in data are reliable. ◮ Select p% workers at random. ◮ Change q% their labels to ‘unreliable labels’.

10

slide-27
SLIDE 27

Evaluation

Difficulty: Subjective, don’t know who is reliable. Solution:

◮ Assume all labels in data are reliable. ◮ Select p% workers at random. ◮ Change q% their labels to ‘unreliable labels’. ◮ p, q are evaluation parameters

(p ∈ {0, 5, 10, 15, 20}, q ∈ {20, 40, 60, 80, 100})

10

slide-28
SLIDE 28

Evaluation

Distribution of ‘unreliable labels’.

11

slide-29
SLIDE 29

Evaluation

Distribution of ‘unreliable labels’. AMT task

◮ Pretend to be spammer. ◮ Give ratings without

watching video.

11

slide-30
SLIDE 30

Evaluation

Distribution of ‘unreliable labels’. AMT task

◮ Pretend to be spammer. ◮ Give ratings without

watching video. Recall our model:

◮ unreliable lab. ∼ N(3, s) ◮ i.e. We don’t cheat.

11

slide-31
SLIDE 31

Baselines

Predict ratings distribution (mean & var)

◮ Two Linear Regression models ... ◮ ... for mean and variance.

12

slide-32
SLIDE 32

Baselines

Predict ratings distribution (mean & var)

◮ Two Linear Regression models ... ◮ ... for mean and variance.

Detect unreliable workers: Average Deviation

◮ Each instance: Deviation from the mean rating. ◮ Each worker: average the deviations. ◮ High AD → unreliable.

12

slide-33
SLIDE 33

Results (varying unreliable workers)

(Baselines LR2: Linear Regression, AD: Average Deviation NEW: Our Model , B-NEW: Our Bayesian Model )

13

slide-34
SLIDE 34

Observations

◮ Bayesian model (B-NEW) better in prediction... ◮ ... but worse in detecting unreliable workers.

14

slide-35
SLIDE 35

Observations

◮ Bayesian model (B-NEW) better in prediction... ◮ ... but worse in detecting unreliable workers.

Prior on worker parameter θ

◮ Reduce overfitting of w, v. ◮ Create bias on workers.

14

slide-36
SLIDE 36

Observations

◮ Bayesian model (B-NEW) better in prediction... ◮ ... but worse in detecting unreliable workers.

Prior on worker parameter θ

◮ Reduce overfitting of w, v. ◮ Create bias on workers.

Other experiments

◮ Varying unreliable ratings, training data, number of workers ◮ Similar results (on paper).

14

slide-37
SLIDE 37

Discussion

◮ Subjective task: common but little work. ◮ Our method improves prediction & detection.

15

slide-38
SLIDE 38

Discussion

◮ Subjective task: common but little work. ◮ Our method improves prediction & detection.

Extensions:

◮ Improve recommendation systems. ◮ Other subjective tasks. ◮ More realistic evaluation. ◮ Better learning for Bayesian model.

15

slide-39
SLIDE 39

Discussion

◮ Subjective task: common but little work. ◮ Our method improves prediction & detection.

Extensions:

◮ Improve recommendation systems. ◮ Other subjective tasks. ◮ More realistic evaluation. ◮ Better learning for Bayesian model.

Data + Code on GitHub Acknowledgment: Reviewers, Workers, NSF

15

slide-40
SLIDE 40

Discussion

◮ Subjective task: common but little work. ◮ Our method improves prediction & detection.

Extensions:

◮ Improve recommendation systems. ◮ Other subjective tasks. ◮ More realistic evaluation. ◮ Better learning for Bayesian model.

Data + Code on GitHub Acknowledgment: Reviewers, Workers, NSF (and Angry Birds). Questions?

15