Learning Human Preferences and Perceptions From Data Robert Nowak - - PowerPoint PPT Presentation

learning human preferences and perceptions from data
SMART_READER_LITE
LIVE PREVIEW

Learning Human Preferences and Perceptions From Data Robert Nowak - - PowerPoint PPT Presentation

Learning Human Preferences and Perceptions From Data Robert Nowak University of Wisconsin MIDAS March 2017 metric learning rating systems www.newyorker.com/cartoons/vote https://edpsych.education.wisc.edu/


slide-1
SLIDE 1

Learning Human Preferences and Perceptions From Data

Robert Nowak University of Wisconsin MIDAS March 2017

slide-2
SLIDE 2

preference learning

http://pulse.media.mit.edu

metric learning

  • https://edpsych.education.wisc.edu/

rating systems

www.newyorker.com/cartoons/vote

slide-3
SLIDE 3

Minimizing Human Effort

human labeling raw unlabeled data predictive model machine learning

“active machine learning” machine decides which data people should label next

  • ptimize machine learning algorithms to minimize need for human feedback
slide-4
SLIDE 4

Help health experts train machines to interpret electronic health records Help scientists adaptively select experiments to determine which genes are the most important

slide-5
SLIDE 5

nextml.org

Lalit Jain Kevin Jamieson

slide-6
SLIDE 6

Bob Mankoff Cartoon Editor, The New Yorker

“Flawless execution!

  • n ≈ 5000 captions submitted each week
  • crowdsource contest to volunteers who rate captions
  • goal: identify funniest caption
  • 50+ weeks of experiments
slide-7
SLIDE 7

www.newyorker.com/cartoons/vote each week: 10-20K participants 500-1000K ratings

slide-8
SLIDE 8

average rating for caption 2 more ratings ⇒ smaller intervals captions confidence interval

· · ·

1 2 3 n-1 n

· · ·

1 2 3 n-1 n

Ratings and Confidence Intervals

slide-9
SLIDE 9

· · ·

1 2 3 n-1 n keep collecting an equal number of ratings for each caption ... until there is a statistically significant winner

· · ·

1 2 3 n-1 n

Non-Adaptively Collecting Ratings

slide-10
SLIDE 10

· · ·

1 2 3 n-1 n using the same logic: we can stop rating captions with upper confidence bounds that are less than the greatest lower confidence bound

x x x

successive elimination focuses rating process on the top captions

Adaptively Collecting Ratings:

Successive Elimination Algorithm

slide-11
SLIDE 11

Guaranteed to identify the funniest caption, according to the crowd’s collective wisdom, as quickly as possible

Adaptively Collecting Ratings:

Upper Confidence Bound Algorithm

slide-12
SLIDE 12

virtually no gap between theory and practice in this application

Adaptively Collecting Ratings:

Upper Confidence Bound Algorithm

slide-13
SLIDE 13

Ranking Accuracy

conventional crowdsourcing UCB 4 to 5 times fewer ratings needed

# ratings collected (1000s) Prob( best caption in top 10 )

slide-14
SLIDE 14

What we learned

not funny 1 somewhat funny 2 funny almost all captions have low average rating less rating variances typically much less than 1 sharper confidence interval bounds based

  • n KL divergences can exploit this fact

another 2X to 3X improvement ! # ratings collected Prob( best caption in top 10 )

slide-15
SLIDE 15

Tim Rogers

Learning Models of Perceived Similarities

Suppose the comparisons are consistent with a low-dimensional embedding; e.g., each item is associated with a point x 2 Rd and comparisons are: kxi xjk2 < kxi xkk2 ?

is the emotion in i more like j k ?

slide-16
SLIDE 16

dist(A,B) > dist(A,C) dist(A,B) < dist(A,D) dist(B,C) > dist(B,D) dist(A,D) > dist(D,E)

. . .

from ordinal information

A C B E D

to metric representation

Ordinal Embedding (aka Nonmetric Multidimensional Scaling)

slide-17
SLIDE 17

dist(A,B) > dist(A,C) dist(A,B) < dist(A,D) dist(B,C) > dist(B,D) dist(A,D) > dist(D,E)

. . .

from ordinal information

A C B E D

to metric representation

Metric Learning

d(A, B) = (xA − xB)T K(xA − xB)

plus “raw” feature information

xA

xB xC xD

xE

kernel matrix K defines the metric representation

slide-18
SLIDE 18

Applications of Metric Learning

Learned metrics can be used to

  • 1. improve the performance of metric-based algorithms such

as k-nearest neighbors, clustering, or ranking algorithms

  • 2. understand and/or visualize human perceptions,

reasoning, and preferences

  • computer vision: classification, recognition, tracking
  • information retrieval: search engines, NLP, image search
  • bioinformatics: sequence analysis, string matching
  • cognitive science: perception, reasoning, learning
slide-19
SLIDE 19
  • Martina Rau

Blake Mason

Educational Science

identifying the visual features that students focus on (and miss) informs the design of tutoring systems Lalit Jain

slide-20
SLIDE 20
  • # carbon

# hydrogen # oxygen

slide-21
SLIDE 21

x1, x2, · · · ∈ R3 embedding into R2 distances Dij = (xi − xj)T K(xi − xj) where K is rank 2

# carbon # hydrogen # oxygen

Low-Dimensional Metrics

visualization: two-dimensional metric representation is easy to interpret

  • Davis, Jason V., et al. "Information-theoretic metric learning." Proceedings
  • f the 24th international conference on Machine learning. ACM, 2007.
  • P. Jain, B. Kulis, and I. Dhillon, “Inductive regularized learning of kernel func- tions,” in

Advances in Neural Information Processing Systems (NIPS), 2010.

Kunapuli, Gautam, and Jude Shavlik. "Mirror descent for metric learning: A unified approach." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2012.

slide-22
SLIDE 22

x1, x2, · · · ∈ R3 embedding into 2 of the 3 dimensions ⇒ K has 2 nonzero rows/columns

# carbon # hydrogen # oxygen

Sparse Metrics

implication: students don’t pay much attention to number of oxygen atoms

Ying, Yiming, Kaizhu Huang, and Colin Campbell. "Sparse metric learning via smooth optimization." Advances in neural information processing systems. 2009. Rosales, Rómer, and Glenn Fung. "Learning sparse metrics via linear programming." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006. Atzmon, Yuval, Uri Shalit, and Gal Chechik. "Learning sparse metrics, one feature at a time." Journal of Machine Learning Research 1 (2015): 1-48.

slide-23
SLIDE 23

dist(A,B) > dist(A,C) dist(A,B) < dist(A,D) dist(B,C) > dist(B,D)

. . .

from ordinal information

A C B E D

to low-dimensional metric representation

Ordinal Embedding

special case of metric learning starting with trivial embedding xA =          1 . . .          xB =          1 . . .          xC =          1 . . .          · · ·

slide-24
SLIDE 24

Classic Papers

slide-25
SLIDE 25

Recent Methods Papers

[3] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David J Kriegman, and Serge Belongie. Generalized non-metric multidimensional scaling. In International Conference on Artificial Intelligence and Statistics, pages 11–18, 2007. [4] Brian McFee and Gert Lanckriet. Learning multi-modal similarity. The Journal of Machine Learning Research, 12:491–523, 2011. [5] Omer Tamuz, Ce Liu, Ohad Shamir, Adam Kalai, and Serge J Belongie. Adaptively learning the crowd

  • kernel. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages

673–680, 2011. [6] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1–6. IEEE, 2012. [7] Eric Heim, Matthew Berger, Lee Seversky, and Milos Hauskrecht. Active perceptual similarity modeling with auxiliary information. arXiv preprint arXiv:1511.02254, 2015.

New embedding algorithms: Classically formulated as a non-convex optimization. Modern papers propose new algorithms, convex relaxations and regularization methods, but do not mathematically embedding accuracy

Schultz, M., & Joachims, T. (2004). Learning a distance metric from relative comparisons. Advances in neural information processing systems (NIPS), 41.

slide-26
SLIDE 26

Two Open Problems

identifiability: if ordinal constraints do agree with some ground-truth metric, is it uniquely identified by them? can two or more very different embeddings represent the same ordinal constraints?

A C B E D A C B E D

sample complexity: how many ordinal constraints are needed to determine the underlying metric representation?

with n = 100 items, about n3/2 ≈ 500, 000 distinct triplet constraints

slide-27
SLIDE 27

dist(A,B) > dist(A,C) dist(A,B) < dist(A,D) dist(B,C) > dist(B,D) dist(A,D) > dist(D,E)

. . .

  • rdinal information

A C B E D

metric representation

generative model

Does an inverse exist? This problem has been studied for 50+ years, but the fundamental question of whether an embedding can be truly recovered from distance/similarity comparisons had not been answered

?

slide-28
SLIDE 28

is shoe A more like B or C ?

Researchers at Air Force Research Lab gathered comparative judgments for n = 85

  • shoes. Judgments for all 296, 310 triplets collected via Amazon Mechanical Turk.

# samples / n hold-out error

hold-out error prediction accuracy vs. training samples this embedding predicts reasonably well, but are there many fundamentally different embeddings that are equally good?

slide-29
SLIDE 29

Ordinal Embedding

Latent Embedding: X = [x1 · · · xn] ∈ Rd×n Euclidean distance matrix: D? with entries D?

ij = kxi xjk2 2

D?

ij = (ei − ej)T K(ei − ej) , with K = XT X

Triplet comparisons: {D?

ij < D? ik}

Find a distance matrix D that agrees with triplets, then factorize to obtain X

slide-30
SLIDE 30

D =       2 4 2 1 2 2 4 1 4 2 2 1 2 4 2 1 1 1 1 1      

D defines the embedding up to rigid transformations (translations, rotations, reflections)

(1,0) (0,1) (-1,0) (0,-1)

Example

slide-31
SLIDE 31

Let S denote the set of constraint triplets and for each t = (i, j, k) ∈ S let given constraints: yt =    1 if D∗

ij < D∗ ik

if D∗

ij > D∗ ik

Let b D be the distance matrix that minimizes the number of mistakes: min

D

X

t∈S

b yt(D) 6= yt

Embedding via Optimization

For any distance matrix D, let b yt(D) = 8 < : 1 if Dij < Dik if Dij > Dik D’s predictions

slide-32
SLIDE 32

comparative judgments provide noisy “one-bit” measurements of differences of distances D∗

ij − D∗ ik

P(yijk = 1)

D?

ik − D? ij

1 2 -

Modeling Noise and Errors

inconsistencies and errors in human judgments can be modeled probabilistically

slide-33
SLIDE 33

Comparative judgments: Is D?

ij < D? ik ?

Latent Embedding: X = [x1 · · · xn] ∈ Rd×n Euclidean distance matrix: D?

ij = kxi xjk2 2

f(D?

ik − D? ij)

Ordinal Embedding Problem Consider n points x1, x2 · · · , xn in d-dimensional Euclidean space with distance matrix D?. Let S denote a collection of triplets and for each t = (i, j, k) ∈ S

  • bserve an independent random variable

yt =    1 w.p. f(D?

ik − D? ij)

w.p. 1 − f(D?

ik − D? ij)

. where link function f : R → [0, 1] is monotonic increasing. Estimate D? from {yt}.

“i is probably closer to j” “i is probably closer to k”

Problem Formulation

D?

ik − D? ij

1 2 -

slide-34
SLIDE 34

Can we uniquely recover D? from the differences D?

ij − D? ik?

For large m b pijk ≈ f(D?

ik − D? ij) =

⇒ f −1(b pijk) ≈ D?

ik − D? ij

Suppose we ask m people “is i closer to j or k?” and average the results: b pijk = 1 m

m

X

`=1

y(`)

ijk

D?

ik − D? ij

1 2 -

Intuition

f(D?

ik − D? ij)

slide-35
SLIDE 35

The differences D?

ij − D? ik are invariant to the average distance

Are both D and D0 valid distance matrices corresponding to different embeddings?

Problem with Difference Measurements

D0 =       2 + α 4 + α 2 + α 1 + α 2 + α 2 + α 4 + α 1 + α 4 + α 2 + α 2 + α 1 + α 2 + α 4 + α 2 + α 1 + α 1 + α 1 + α 1 + α 1 + α       Example: The differences of distances in D and D0 are exactly the same D =       2 4 2 1 2 2 4 1 4 2 2 1 2 4 2 1 1 1 1 1      

slide-36
SLIDE 36

D =   1 1 1 1 1 1   D =   2 2 2 2 2 2  

Problem with Difference Measurements

Different distances and embeddings, but the same differences of distances (all zero) !

slide-37
SLIDE 37

Rigidity from Differences of Distances

A C B E D A C B E D Example: These two embeddings cannot have the same differences of distances Luckily, if we have a sufficient number of points, then the differences of distances uniquely determine the embedding

slide-38
SLIDE 38

J =       1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1       Any distance matrix D has orthogonal decomposition D = C + µDJ where µD is the average non-zero distance and D = C + µDJ Key Question: When is D uniquely determined by C?

“centered” distance matrix (measurable from differences) “mean” distance matrix (unobservable)

Orthogonal Representation

slide-39
SLIDE 39

C =       2 −1 2 −1 2 −1 2 −1 −1 −1 −1 −1       D =       2 4 2 1 2 2 4 1 4 2 2 1 2 4 2 1 1 1 1 1       D = C + 2 J =       2 −1 2 −1 2 −1 2 −1 −1 −1 −1 −1       + 2       1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1      

(1,0) (0,1) (-1,0) (0,-1)

Example

slide-40
SLIDE 40

Recovery of D from C α

Distances can be represented with 2-dimensional embedding if and only if α = 0

energy outside 2 dimensions

D0 =       2 + α 4 + α 2 + α 1 + α 2 + α 2 + α 4 + α 1 + α 4 + α 2 + α 2 + α 1 + α 2 + α 4 + α 2 + α 1 + α 1 + α 1 + α 1 + α 1 + α       Example: The differences of distances in D and D0 are exactly the same D =       2 4 2 1 2 2 4 1 4 2 2 1 2 4 2 1 1 1 1 1      

slide-41
SLIDE 41

A remarkable fact: Lemma 1 For any n × n distance matrix D with n > d + 2, D = C + λ2(C) J ⇒ if the number of points is larger than the embedding dimension + 2, then D is uniquely determined by a nonlinear function of C

Recovery of D from C

slide-42
SLIDE 42

C =       2 −1 2 −1 2 −1 2 −1 −1 −1 −1 −1       D =       2 4 2 1 2 2 4 1 4 2 2 1 2 4 2 1 1 1 1 1       λ2(C) = 2 D = C + λ2(C) J =       2 −1 2 −1 2 −1 2 −1 −1 −1 −1 −1       + 2       1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1       n > d + 2 = 4

(1,0) (0,1) (-1,0) (0,-1)

Example

slide-43
SLIDE 43

Question: Can D? be uniquely identified from triplet comparisons? Fact 1: b pijk = 1 m

m

X

`=1

y(`)

ijk ≈ f(D? ij − D? ik)

⇒ f −1(b pijk) ≈ D?

ij − D? ik ⇒ C?

Fact 2: D? = C? + λ2(C?)J Answer: If we have at least n > d + 2 items and a sufficiently large number of triplet comparisons, then we can accurately estimate all differences of distances or equivalently the centered distance matrix C?

Question 1: Identifiability

slide-44
SLIDE 44

Question: How many triplet comparisons are sufficient to accurately estimate C?? Answer: O(dn log n) triplet comparisons are sufficient. Intuition: Latent embedding X has d × n unkowns

Question 2: Sample Complexity

slide-45
SLIDE 45

Recall: Dij = (ei − ej)T K(ei − ej) , with K = XT X if points are embedded into d dimensions, then rank of K is at most d rank(K) ≤ d Also: If we have n points in d dimensions, then K is n ⇥ n and the sum of its squared entries is on the order of n2 kKk2

F  n2

Optimization Constraints

slide-46
SLIDE 46

key assumptions:

  • 1. d-dimensional embedding ) rank(K)  d
  • 2. n items to embed ) kKkF  n

non-convex constraint set convex constraint set {D : rank(K)  d and kKkF  n} ⇢ {D : trace(K)  p dn} We should search/optimize over the set Dd,n = {D : trace(K) ≤ √ dn} Dij = (ei − ej)T K(ei − ej) , with K = XT X

Optimization Constraints

slide-47
SLIDE 47

Dij < Dik agrees with yt disagrees

ideal loss

Theorem 1 Let S denote a set of M triplet comparisons between selected uniformly at random, Dd,n = {D : trace(K) ≤ √ dn}, and b D be the solution to the convex

  • ptimization

min

D∈Dd,n

X

(i,j,k)∈S

Lijk(D) Then 1 n2

n

X

i,j=1

| b Dij − D?

ij|2 = O

✓dn log n M ◆

convex and easy to minimize minimizes the number of mistakes: min

D

X

t∈S

b yt(D) 6= yt

Learning via Prediction

logistic loss Lijk(D)

slide-48
SLIDE 48

Minimizing Human Effort

human judgments set of items learned embedding machine learning

crowd’s effort need only scale linearly with number of items to be embedded average squared distance error

= 1 n2

n

X

i,j=1

| b Dij−D?

ij|2 = O

s dn log n

# of comparisons

!

slide-49
SLIDE 49

Two Closed Problems

identifiability: if ordinal constraints do agree with some ground-truth metric, is it uniquely identified by them? YES sample complexity: how many ordinal constraints are needed to determine the underlying metric representation? O(nd log n) new embedding algorithms: better theoretical understanding led to development of better algorithms

n = 64 points in d = 2 dimensions with Ekxik2 = 1 and logistic model

slide-50
SLIDE 50

Thanks!

http://www.newyorker.com/cartoons/vote Jamieson, K.G., Malloy, M., Nowak, R.D. and Bubeck, S.,

  • 2014. lil'UCB: An Optimal Exploration Algorithm for Multi-

Armed Bandits. In COLT (Vol. 35, pp. 423-439). Jamieson, K.G., Jain, L., Fernandez, C., Glattard, N.J. and Nowak, R., 2015. Next: A system for real-world development, evaluation, and application of active learning. In Advances in Neural Information Processing Systems (pp. 2656-2664). Jain, L., Jamieson, K.G. and Nowak, R., 2016. Finite Sample Prediction and Recovery Bounds for Ordinal

  • Embedding. In Advances In Neural Information

Processing Systems (pp. 2703-2711).

  • Rau, M.A., Mason, B. and Nowak, R., 2016. How to

model implicit knowledge? Similarity learning methods to assess perceptions of visual representations. In Proceedings of the 9th International Conference on Educational Data Mining.