Ranking observations with latent information and binary feedback - - PowerPoint PPT Presentation

ranking observations with latent information and binary
SMART_READER_LITE
LIVE PREVIEW

Ranking observations with latent information and binary feedback - - PowerPoint PPT Presentation

Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale Sup erieure de Cachan Workshop on The Mathematics of Ranking at AIM - Palo Alto, August 2010 1 Statistical Issues in Machine Learning 2


slide-1
SLIDE 1

Ranking observations with latent information and binary feedback Nicolas Vayatis

Ecole Normale Sup´ erieure de Cachan

Workshop on The Mathematics of Ranking at AIM - Palo Alto, August 2010

slide-2
SLIDE 2

1 Statistical Issues in Machine Learning 2 Prediction of Preferences 3 Other Criteria for Ranking Error

slide-3
SLIDE 3

Statistical issues in Machine Learning

slide-4
SLIDE 4

Generalization ability of decision rules

Class G of candidate decision rules Risk functional L, the ”objective” criterion Past data Dn with sample size n Method/Algorithm outputs an empirical estimate gn ∈ G Main questions:

◮ Strong Bayes-risk consistency

L( gn)

a.s.

− − → L∗ = inf

g L(g) ,

n → ∞ ?

◮ Rate of this convergence?

slide-5
SLIDE 5

An example - Binary classification with i.i.d. data

Data Dn = {(Xi, Yi) : i = 1, . . . , n} i.i.d. copies of (X, Y ) ∈ X × {−1, +1} Empirical Risk Minimization principle

  • gn = arg min

g∈G

  • Ln(g) := 1

n

n

  • i=1

I{g(Xi) = Yi} First-order analysis: with probability at least 1 − δ L( gn) − inf

g∈G L(g) ≤ 2E

  • sup

g∈G

| Ln(g) − L(g)|

  • + c
  • log(1/δ)

n Tools: empirical processes techniques, concentration inequalities

slide-6
SLIDE 6

Complexity Control

Vapnik-Chervonenkis inequality: E

  • sup

g∈G

| Ln(g) − L(g)|

  • ≤ c
  • V

n where V is the VC dimension of the class G. Rademacher average: Rn(G) = 1 nE

  • sup

g∈G

  • n
  • i=1

ǫiI{Yi = g(Xi)}

  • where ǫ1, . . . , ǫn i.i.d. sign variables
slide-7
SLIDE 7

Variance control

Second-order analysis: Talagrand’s inequality sup

f ∈F

  • P(f ) −

Pn(f )

  • ≤ 2E
  • sup

f ∈F

  • P(f ) −

Pn(f )

  • + . . .

. . . +

  • 2 (supf ∈F Var(f )) log(1/δ)

n + c log(1/δ) n Variance control assumption Var(f ) ≤ C (L(g) − L∗)α , ∀g with α ∈ (0, 1]. Fast rates of convergence: excess risk in n−1/(2−α)

slide-8
SLIDE 8

Prediction of Preferences

Joint work with St´ ephan Cl´ emen¸ con (Telecom ParisTech) G´ abor Lugosi (Pompeu Fabra)

slide-9
SLIDE 9

Setup

(X, Y ) random pair with unknown distribution P over X × R (X, Y ), (X ′, Y ′) i.i.d., and Y , Y ′ may not be observed Preference label R = R(Y , Y ′) ∈ R , with R(Y , Y ′) = −R(Y ′, Y ) R > 0 means ”X is better than X ′” Decision rule: r : X × X → {−1, 0, 1} Prediction error = classification error with pairs of observations L(r) = P

  • R · r(X, X ′) < 0
  • Same like before?
slide-10
SLIDE 10

Empirical Ranking Risk Minimization

Latent data Dn = {(Xi, Yi) : i = 1, . . . , n} i.i.d. Observed data: {(Xi, Xj, Ri,j) : i, j = 1, . . . , n} , Ri,j = R(Yi, Yj) Empirical criterion for ranking: Ln(r) = 1 n(n − 1)

  • i=j

I{Ri,j · r(Xi, Xj) < 0} General definition of a U-statistic (fixed f ): Un(f ) = 1 n(n − 1)

  • i=j

f (Zi, Zj) where Z1, ..., Zn i.i.d.

slide-11
SLIDE 11

Structure of U-Statistics - First representation

Assume f symmetric. Average of ’sums-of-i.i.d.’ blocks: Un(f ) = 1 n!

  • π

1 ⌊n/2⌋

⌊n/2⌋

  • i=1

f

  • Zπ(i), Zπ(⌊n/2⌋+i)
  • where π represents permutations of {1, . . . , n}.

Lemma

Let ψ convex increasing and F a class of functions. Then: Eψ

  • sup

f ∈F

Un(f )

  • ≤ Eψ

 sup

f ∈F

1 ⌊n/2⌋

⌊n/2⌋

  • i=1

f

  • Zπ(i), Zπ(⌊n/2⌋+i)

slide-12
SLIDE 12

Consequences of the first representation

Back to classification with ⌊n/2⌋ i.i.d. pairs Enough for first-order analysis (including ERM and CRM) Overestimates the variance Noise assumption too restrictive!! No fast rates in the general case!

slide-13
SLIDE 13

Structure of U-Statistics - Second representation

Hoeffding’s decomposition Un(f ) = E(Un(f )) + 2Tn(f ) + Wn(f ) with

◮ Tn(f ) = 1

n

n

  • i=1

h(Zi) ( empirical average of i.i.d. ) where h(z) = Ef (Z1, z) − E(Un(f ))

◮ Wn(f ) = degenerate U-statistic (remainder term)

Degenerate U-statistic Wn with kernel ˜ h is such that: E(˜ h(Z1, Z2) | Z1) = 0 a.s. Remark: Need here to observe individual labels Y , Y ′!

slide-14
SLIDE 14

Insights for rates-of-convergence results

Leading term Tn is an empirical process

◮ handled by Talagrand’s concentration inequality ◮ involves ”standard” complexity measures:

⇒ Variance control involves the function h Exponential inequality for degenerate U-processes

◮ VC classes - exponential inequality by Arcones and Gin´

e (AoP1993)

◮ general case - a new moment inequality

⇒ additional complexity measures

slide-15
SLIDE 15

Fast Rates - Notations

Kernel:

qr((x, y), (x′, y ′)) = I{(y − y ′) · r(x, x′) < 0} − I{(y − y ′) · r ∗(x, x′) < 0}

U-process indexed by ranking rule r ∈ R Λn(r) = 1 n(n − 1)

  • i=j

qr((Xi, Yi), (Xj, Yj)), Excess risk: Λ(r) = L(r) − L∗ = E{qr((X, Y ), (X ′, Y ′))} Key quantity: hr(x, y) = E{qr((x, y), (X ′, Y ′))} − Λ(r)

slide-16
SLIDE 16

Result on Fast Rates - VC Case

Assume we have: the class R of ranking rules has finite VC dimension V for all r ∈ R, Var(hr(X, Y )) ≤ c

  • L(r) − L∗α

(V) with some constants c > 0 and α ∈ [0, 1]. Then, with probability larger than 1 − δ: L(rn) − L∗ ≤ 2

  • inf

r∈R L(r) − L∗

  • + C

V log(n/δ) n 1/(2−α)

slide-17
SLIDE 17

Comments

Question

Sufficient condition for Assumption (V): ∀r ∈ R, Var(hr(X, Y )) ≤ c

  • L(r) − L∗α

?

Goal

Formulate noise assumptions on the regression function: E{Y | X = x}

slide-18
SLIDE 18

Example 1 - Bipartite Ranking

Binary labels Y , Y ′ ∈ {−1, +1} Posterior probability: η(x) = P{Y = +1 | X = x}

Noise Assumption (NA)

There exist constants c > 0 and α ∈ [0, 1] such that : ∀x ∈ X, E(|η(x) − η(X)|−α) ≤ c .

Sufficient condition for (NA) with α < 1

η(X) absolutely continuous on [0, 1] with bounded density

slide-19
SLIDE 19

Example 2 - Regression Data

Y = m(X) + σ(X) · N , where N ∼ N(0, 1), E(N | X) = 0 Key quantity: ∆(X, X ′) = m(X) − m(X ′)

  • σ2(X) + σ2(X ′)

Noise Assumption (NA)

There exist constants c > 0 and α ∈ [0, 1] such that : ∀x ∈ X, E(|∆(x, X)|−α) ≤ c .

Sufficient condition for (NA) with α < 1

m(X) has a bounded density and σ(X) is bounded over X.

slide-20
SLIDE 20

Remainder Term

Degenerate U-process

Consider F a class of degenerate kernels, and ˜ Wn = sup

f ∈F

  • i,j

f (Zi, Zj)

slide-21
SLIDE 21

Additional Complexity Measures

ǫ1, . . . , ǫn i.i.d. Rademacher random variables

Complexity measures:

(1) Zǫ = sup

f ∈F

  • i,j

ǫiǫjf (Zi, Zj)

  • (2)

Uǫ = sup

f ∈F

sup

α:α2≤1

  • i,j

ǫiαjf (Zi, Zj) (3) Mǫ = sup

f ∈F

max

k=1...n

  • n
  • i=1

ǫif (Zi, Zk)

slide-22
SLIDE 22

Moment Inequality

Theorem

If ˜ Wn is a degenerate U-process, then there exists a universal constant C > 0 such that for all n and q ≥ 2,

  • E ˜

W q

n

1/q ≤ C

  • EZǫ + q1/2EUǫ + q(EMǫ + n) + q3/2n1/2 + q2

Main tools: symmetrization, decoupling and concentration inequalities Related work: Adamczak (AoP, 2006), Arcones and Gin´ e (AoP, 1993), Gin´ e, Latala and Zinn (HDP II, 2000), Houdr´ e and Reynaud-Bouret (SIA, 2003), Major (PTRF, 2006)

slide-23
SLIDE 23

Control of the Degenerate Part

Corollary

With probability 1 − δ, ˜ Wn ≤ C

  • EZǫ

n2 + EUǫ

  • log(1/δ)

n2 + EMǫ log(1/δ) n2 + log(1/δ) n

  • Special case - F is a VC class

EZǫ ≤ CnV , EUǫ ≤ Cn √ V , EǫMǫ ≤ C √ Vn Hence, with probability 1 − δ ˜ Wn ≤ 1 n (V + log(1/δ))

slide-24
SLIDE 24

Other Criteria for Ranking Error

AUC and beyond - Focus on the top of the list Joint work with St´ ephan Cl´ emen¸ con (Telecom ParisTech)

slide-25
SLIDE 25

Global performance measures: ROC Curve

For a given scoring rule: s : X → R Threshold t ∈ R True positive rate: βs(t) = P {s(X) ≥ t | Y = +1} False positive rate: αs(t) = P {s(X) ≥ t | Y = −1} ROC : (s, t) →

  • αs(t), βs(t)
  • + continuous extension
slide-26
SLIDE 26

Optimality, Metrics for ROC Curves

By Neyman-Pearson’s lemma: optimal scoring rules are in S∗ = {T ◦ η : T strictly increasing} Optimal ROC curve: α ∈ [0, 1] → ROC∗(α) = βη ◦ α−1

η (α)

L1 metric on ROC curves d1(s, η) = 1 (ROC∗(α) − ROCs(α)) dα = AUC(η) − AUC(s) What about stronger metrics ? d∞(s, η) = sup

α∈[0,1]

(ROC∗(α) − ROCs(α))

slide-27
SLIDE 27

Connection to the AUC criterion

Consider a real-valued scoring rule s : X → R Take: (X, Y ), (X ′, Y ′) i.i.d. copies AUC(s) = 1 ROCs(α) dα = P{s(X) ≥ s(X ′) | Y > Y ′} Ranking rule: r(X, X ′) = 2I{s(X) > s(X ′)} − 1 Ranking error and AUC: p = P{Y = +1} AUC(s) = 1 − 1 2p(1 − p)L(r) Maximization of AUC = Minimization of ranking error

slide-28
SLIDE 28

Beyond the AUC - Truncation of the ROC curve

Focus on the ”best” instances Question: cut-off point on the ROC curve? Constraint: fix u ∈ (0, 1) to be the rate of ”best” X’s Best instances according to scoring function s at rate u: Cs,u = {x ∈ X | s(x) > Q(s, u)} where Q(s, u) is the (1 − u)-quantile of s(X)

◮ Mass constraint property: µ(Cs,u) = P {X ∈ Cs,u} = u ◮ Invariance property: if T nondecreasing, then CT◦s,u = Cs,u

slide-29
SLIDE 29

Reparameterization of the ROC curve

True positive rate at level u: β(s, u) = P {s(X) ≥ Q(s, u) | Y = 1} False positive rate at level u: α(s, u) = P {s(X) ≥ Q(s, u) | Y = −1} Control line at level u: u = pβ(s, u) + (1 − p)α(s, u) with p = P{Y = 1}

slide-30
SLIDE 30

Partial AUC

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ROC curve and partial AUC

false positive rate α true positive rate β

Definition (Partial AUC)

For a scoring function s and a rate u of best instances as: PartAUC(s, u) = α(s,u) β(s, t) dt .

slide-31
SLIDE 31

Partial AUC is not consistent!

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ROC curve and partial AUC

true positive rate β false positive rate α

For any scoring function s, we have for all u ∈ (0, 1), β(s, u) ≤ β(η, u) α(s, u) ≥ α(η, u)

slide-32
SLIDE 32

Correction - Local AUC

Local AUC vs. Partial AUC

Set u ∈ (0, 1). For any scoring function s: LocAUC(s, u) = PartAUC(s, u) + β(s, u)(1 − α(s, u)) . Double goal:

◮ Find the best instances:

C ∗

u = {x ∈ X | η(x) > Q(η, u)}

◮ Rank them with a scoring function

slide-33
SLIDE 33

A Subproblem - ERM for Finding the Best Instances

Take sets of the form: Cs,u = {x ∈ X | s(x) > Q(s, u)} where s positive real-valued scoring function Empirical risk:

  • Ln(s) = 1

n

n

  • i=1

I{Yi · (s(Xi) − Q(s, u)) < 0}. Conditions for consistency and (fast) rates:

◮ behavior of η around Q(η, u) ◮ class of scoring functions neither too flat nor too steep

Result: Fastest rate in n−2/3

slide-34
SLIDE 34

Typical Scoring Functions

slide-35
SLIDE 35

Signed rank statistics

Take Z1, . . . , Zn i.i.d. Φ : [0, 1] → [0, 1] (score generating function) R+

i

= rank(|Zi|)

Definition

The statistic

n

  • i=1

Φ R+

i

n + 1

  • sgn(Zi)

is a linear signed rank statistic.

slide-36
SLIDE 36

Structure of the empirical risk

Notations: K(s, u) = E (Y I{s(X) ≤ Q(s, u)}) ˆ Kn(s, u) = 1 n

n

  • i=1

Yi I{s(Xi) ≤ ˆ Q(s, u)} We have: L(s) = 1 − p + K(s, u) ˆ Ln(s) = n− n + ˆ Kn(s, u) where n− = n

i=1 I{Yi = −1}

Observe

Set Zi = Yis(Xi). For fixed s and u, the statistic ˆ Kn(s, u) is a linear signed rank statistic.

slide-37
SLIDE 37

Hoeffding’s-type decomposition

Notations: Zn(s, u) = 1 n

n

  • i=1
  • Yi − K ′(s, u)
  • I{s(Xi) ≤ Q(s, u)}−K(s, u)+uK ′(s, u) ,

where K ′(s, u) = K ′

u(s, u).

Proposition

We have, for all s and u ∈ [0, 1]: ˆ Kn(s, u) = K(s, u) + Zn(s, u) + Λn(s) . with Λn(s) = OP(n−1) as n → ∞ .

slide-38
SLIDE 38

General ROC summaries

Score-generating function Φ : [0, 1] → [0, 1] increasing Empirical performance functional Wn(s) =

n

  • i=1

I{Yi = 1} Φ rank(s(Xi)) n + 1

  • Choices of Φ:

◮ Φ(x) = x ⇒ AUC ◮ Φ(x) = x I{x ≥ 1 − u} ⇒ Local AUC ◮ Φ(x) = c(x)I{x ≥ k/n + 1}, ⇒ DCG ◮ smooth Φ’s

Most ranking criteria are conditional linear rank statistics

slide-39
SLIDE 39

This talk: Statistical theory for learning summaries of the optimal ROC curve Analysis of higher order statistics Orthogonal decompositions and control of remainder term Generic form for risk criteria in ranking Not in this talk! Approximation and estimation schemes for the optimal ROC curve Design of scoring/ranking algorithms based on decision trees Aggregation of ranking trees involves rank aggregation techniques Application to multivariate homogeneity tests R implementation of TreeRank available!

slide-40
SLIDE 40

Today, Yves Meyer received the Carl Friedrich Gauss Prize in Hyderabad.

slide-41
SLIDE 41

References

  • S. Cl´

emen¸ con, M. Depecker, and N. Vayatis (2010). Adaptive partitioning schemes for bipartite ranking. Machine Learning Journal. To appear.

  • S. Cl´

emen¸ con and N. Vayatis (2010). Overlaying classifiers: a practical approach for optimal scoring. Constructive Approximation. To appear.

  • S. Cl´

emen¸ con, M. Depecker, and N. Vayatis (2009). AUC maximization and the two-sample problem. Proceedings of NIPS’09, Advances in Neural Information Processing Systems 22, pp.360-368, MIT Press.

  • S. Cl´

emen¸ con and N. Vayatis (2009). Tree-based ranking methods. IEEE Transactions on Information Theory.

  • S. Cl´

emen¸ con and N. Vayatis (2008). Empirical performance maximization for linear rank statistics. Proceedings of NIPS’08, MIT Press.

  • S. Cl´

emen¸ con, G. Lugosi, and N. Vayatis (2008). Ranking and empirical risk minimization of U-statistics. The Annals of Statistics, vol.36(2):844-874.

  • S. Cl´

emen¸ con and NV (2007). Ranking the best instances. Journal of Machine Learning Research, 8(Dec):2671-2699.