Ranking observations with latent information and binary feedback - - PowerPoint PPT Presentation
Ranking observations with latent information and binary feedback - - PowerPoint PPT Presentation
Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale Sup erieure de Cachan Workshop on The Mathematics of Ranking at AIM - Palo Alto, August 2010 1 Statistical Issues in Machine Learning 2
1 Statistical Issues in Machine Learning 2 Prediction of Preferences 3 Other Criteria for Ranking Error
Statistical issues in Machine Learning
Generalization ability of decision rules
Class G of candidate decision rules Risk functional L, the ”objective” criterion Past data Dn with sample size n Method/Algorithm outputs an empirical estimate gn ∈ G Main questions:
◮ Strong Bayes-risk consistency
L( gn)
a.s.
− − → L∗ = inf
g L(g) ,
n → ∞ ?
◮ Rate of this convergence?
An example - Binary classification with i.i.d. data
Data Dn = {(Xi, Yi) : i = 1, . . . , n} i.i.d. copies of (X, Y ) ∈ X × {−1, +1} Empirical Risk Minimization principle
- gn = arg min
g∈G
- Ln(g) := 1
n
n
- i=1
I{g(Xi) = Yi} First-order analysis: with probability at least 1 − δ L( gn) − inf
g∈G L(g) ≤ 2E
- sup
g∈G
| Ln(g) − L(g)|
- + c
- log(1/δ)
n Tools: empirical processes techniques, concentration inequalities
Complexity Control
Vapnik-Chervonenkis inequality: E
- sup
g∈G
| Ln(g) − L(g)|
- ≤ c
- V
n where V is the VC dimension of the class G. Rademacher average: Rn(G) = 1 nE
- sup
g∈G
- n
- i=1
ǫiI{Yi = g(Xi)}
- where ǫ1, . . . , ǫn i.i.d. sign variables
Variance control
Second-order analysis: Talagrand’s inequality sup
f ∈F
- P(f ) −
Pn(f )
- ≤ 2E
- sup
f ∈F
- P(f ) −
Pn(f )
- + . . .
. . . +
- 2 (supf ∈F Var(f )) log(1/δ)
n + c log(1/δ) n Variance control assumption Var(f ) ≤ C (L(g) − L∗)α , ∀g with α ∈ (0, 1]. Fast rates of convergence: excess risk in n−1/(2−α)
Prediction of Preferences
Joint work with St´ ephan Cl´ emen¸ con (Telecom ParisTech) G´ abor Lugosi (Pompeu Fabra)
Setup
(X, Y ) random pair with unknown distribution P over X × R (X, Y ), (X ′, Y ′) i.i.d., and Y , Y ′ may not be observed Preference label R = R(Y , Y ′) ∈ R , with R(Y , Y ′) = −R(Y ′, Y ) R > 0 means ”X is better than X ′” Decision rule: r : X × X → {−1, 0, 1} Prediction error = classification error with pairs of observations L(r) = P
- R · r(X, X ′) < 0
- Same like before?
Empirical Ranking Risk Minimization
Latent data Dn = {(Xi, Yi) : i = 1, . . . , n} i.i.d. Observed data: {(Xi, Xj, Ri,j) : i, j = 1, . . . , n} , Ri,j = R(Yi, Yj) Empirical criterion for ranking: Ln(r) = 1 n(n − 1)
- i=j
I{Ri,j · r(Xi, Xj) < 0} General definition of a U-statistic (fixed f ): Un(f ) = 1 n(n − 1)
- i=j
f (Zi, Zj) where Z1, ..., Zn i.i.d.
Structure of U-Statistics - First representation
Assume f symmetric. Average of ’sums-of-i.i.d.’ blocks: Un(f ) = 1 n!
- π
1 ⌊n/2⌋
⌊n/2⌋
- i=1
f
- Zπ(i), Zπ(⌊n/2⌋+i)
- where π represents permutations of {1, . . . , n}.
Lemma
Let ψ convex increasing and F a class of functions. Then: Eψ
- sup
f ∈F
Un(f )
- ≤ Eψ
sup
f ∈F
1 ⌊n/2⌋
⌊n/2⌋
- i=1
f
- Zπ(i), Zπ(⌊n/2⌋+i)
-
Consequences of the first representation
Back to classification with ⌊n/2⌋ i.i.d. pairs Enough for first-order analysis (including ERM and CRM) Overestimates the variance Noise assumption too restrictive!! No fast rates in the general case!
Structure of U-Statistics - Second representation
Hoeffding’s decomposition Un(f ) = E(Un(f )) + 2Tn(f ) + Wn(f ) with
◮ Tn(f ) = 1
n
n
- i=1
h(Zi) ( empirical average of i.i.d. ) where h(z) = Ef (Z1, z) − E(Un(f ))
◮ Wn(f ) = degenerate U-statistic (remainder term)
Degenerate U-statistic Wn with kernel ˜ h is such that: E(˜ h(Z1, Z2) | Z1) = 0 a.s. Remark: Need here to observe individual labels Y , Y ′!
Insights for rates-of-convergence results
Leading term Tn is an empirical process
◮ handled by Talagrand’s concentration inequality ◮ involves ”standard” complexity measures:
⇒ Variance control involves the function h Exponential inequality for degenerate U-processes
◮ VC classes - exponential inequality by Arcones and Gin´
e (AoP1993)
◮ general case - a new moment inequality
⇒ additional complexity measures
Fast Rates - Notations
Kernel:
qr((x, y), (x′, y ′)) = I{(y − y ′) · r(x, x′) < 0} − I{(y − y ′) · r ∗(x, x′) < 0}
U-process indexed by ranking rule r ∈ R Λn(r) = 1 n(n − 1)
- i=j
qr((Xi, Yi), (Xj, Yj)), Excess risk: Λ(r) = L(r) − L∗ = E{qr((X, Y ), (X ′, Y ′))} Key quantity: hr(x, y) = E{qr((x, y), (X ′, Y ′))} − Λ(r)
Result on Fast Rates - VC Case
Assume we have: the class R of ranking rules has finite VC dimension V for all r ∈ R, Var(hr(X, Y )) ≤ c
- L(r) − L∗α
(V) with some constants c > 0 and α ∈ [0, 1]. Then, with probability larger than 1 − δ: L(rn) − L∗ ≤ 2
- inf
r∈R L(r) − L∗
- + C
V log(n/δ) n 1/(2−α)
Comments
Question
Sufficient condition for Assumption (V): ∀r ∈ R, Var(hr(X, Y )) ≤ c
- L(r) − L∗α
?
Goal
Formulate noise assumptions on the regression function: E{Y | X = x}
Example 1 - Bipartite Ranking
Binary labels Y , Y ′ ∈ {−1, +1} Posterior probability: η(x) = P{Y = +1 | X = x}
Noise Assumption (NA)
There exist constants c > 0 and α ∈ [0, 1] such that : ∀x ∈ X, E(|η(x) − η(X)|−α) ≤ c .
Sufficient condition for (NA) with α < 1
η(X) absolutely continuous on [0, 1] with bounded density
Example 2 - Regression Data
Y = m(X) + σ(X) · N , where N ∼ N(0, 1), E(N | X) = 0 Key quantity: ∆(X, X ′) = m(X) − m(X ′)
- σ2(X) + σ2(X ′)
Noise Assumption (NA)
There exist constants c > 0 and α ∈ [0, 1] such that : ∀x ∈ X, E(|∆(x, X)|−α) ≤ c .
Sufficient condition for (NA) with α < 1
m(X) has a bounded density and σ(X) is bounded over X.
Remainder Term
Degenerate U-process
Consider F a class of degenerate kernels, and ˜ Wn = sup
f ∈F
- i,j
f (Zi, Zj)
Additional Complexity Measures
ǫ1, . . . , ǫn i.i.d. Rademacher random variables
Complexity measures:
(1) Zǫ = sup
f ∈F
- i,j
ǫiǫjf (Zi, Zj)
- (2)
Uǫ = sup
f ∈F
sup
α:α2≤1
- i,j
ǫiαjf (Zi, Zj) (3) Mǫ = sup
f ∈F
max
k=1...n
- n
- i=1
ǫif (Zi, Zk)
Moment Inequality
Theorem
If ˜ Wn is a degenerate U-process, then there exists a universal constant C > 0 such that for all n and q ≥ 2,
- E ˜
W q
n
1/q ≤ C
- EZǫ + q1/2EUǫ + q(EMǫ + n) + q3/2n1/2 + q2
Main tools: symmetrization, decoupling and concentration inequalities Related work: Adamczak (AoP, 2006), Arcones and Gin´ e (AoP, 1993), Gin´ e, Latala and Zinn (HDP II, 2000), Houdr´ e and Reynaud-Bouret (SIA, 2003), Major (PTRF, 2006)
Control of the Degenerate Part
Corollary
With probability 1 − δ, ˜ Wn ≤ C
- EZǫ
n2 + EUǫ
- log(1/δ)
n2 + EMǫ log(1/δ) n2 + log(1/δ) n
- Special case - F is a VC class
EZǫ ≤ CnV , EUǫ ≤ Cn √ V , EǫMǫ ≤ C √ Vn Hence, with probability 1 − δ ˜ Wn ≤ 1 n (V + log(1/δ))
Other Criteria for Ranking Error
AUC and beyond - Focus on the top of the list Joint work with St´ ephan Cl´ emen¸ con (Telecom ParisTech)
Global performance measures: ROC Curve
For a given scoring rule: s : X → R Threshold t ∈ R True positive rate: βs(t) = P {s(X) ≥ t | Y = +1} False positive rate: αs(t) = P {s(X) ≥ t | Y = −1} ROC : (s, t) →
- αs(t), βs(t)
- + continuous extension
Optimality, Metrics for ROC Curves
By Neyman-Pearson’s lemma: optimal scoring rules are in S∗ = {T ◦ η : T strictly increasing} Optimal ROC curve: α ∈ [0, 1] → ROC∗(α) = βη ◦ α−1
η (α)
L1 metric on ROC curves d1(s, η) = 1 (ROC∗(α) − ROCs(α)) dα = AUC(η) − AUC(s) What about stronger metrics ? d∞(s, η) = sup
α∈[0,1]
(ROC∗(α) − ROCs(α))
Connection to the AUC criterion
Consider a real-valued scoring rule s : X → R Take: (X, Y ), (X ′, Y ′) i.i.d. copies AUC(s) = 1 ROCs(α) dα = P{s(X) ≥ s(X ′) | Y > Y ′} Ranking rule: r(X, X ′) = 2I{s(X) > s(X ′)} − 1 Ranking error and AUC: p = P{Y = +1} AUC(s) = 1 − 1 2p(1 − p)L(r) Maximization of AUC = Minimization of ranking error
Beyond the AUC - Truncation of the ROC curve
Focus on the ”best” instances Question: cut-off point on the ROC curve? Constraint: fix u ∈ (0, 1) to be the rate of ”best” X’s Best instances according to scoring function s at rate u: Cs,u = {x ∈ X | s(x) > Q(s, u)} where Q(s, u) is the (1 − u)-quantile of s(X)
◮ Mass constraint property: µ(Cs,u) = P {X ∈ Cs,u} = u ◮ Invariance property: if T nondecreasing, then CT◦s,u = Cs,u
Reparameterization of the ROC curve
True positive rate at level u: β(s, u) = P {s(X) ≥ Q(s, u) | Y = 1} False positive rate at level u: α(s, u) = P {s(X) ≥ Q(s, u) | Y = −1} Control line at level u: u = pβ(s, u) + (1 − p)α(s, u) with p = P{Y = 1}
Partial AUC
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ROC curve and partial AUC
false positive rate α true positive rate β
Definition (Partial AUC)
For a scoring function s and a rate u of best instances as: PartAUC(s, u) = α(s,u) β(s, t) dt .
Partial AUC is not consistent!
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ROC curve and partial AUC
true positive rate β false positive rate α
For any scoring function s, we have for all u ∈ (0, 1), β(s, u) ≤ β(η, u) α(s, u) ≥ α(η, u)
Correction - Local AUC
Local AUC vs. Partial AUC
Set u ∈ (0, 1). For any scoring function s: LocAUC(s, u) = PartAUC(s, u) + β(s, u)(1 − α(s, u)) . Double goal:
◮ Find the best instances:
C ∗
u = {x ∈ X | η(x) > Q(η, u)}
◮ Rank them with a scoring function
A Subproblem - ERM for Finding the Best Instances
Take sets of the form: Cs,u = {x ∈ X | s(x) > Q(s, u)} where s positive real-valued scoring function Empirical risk:
- Ln(s) = 1
n
n
- i=1
I{Yi · (s(Xi) − Q(s, u)) < 0}. Conditions for consistency and (fast) rates:
◮ behavior of η around Q(η, u) ◮ class of scoring functions neither too flat nor too steep
Result: Fastest rate in n−2/3
Typical Scoring Functions
Signed rank statistics
Take Z1, . . . , Zn i.i.d. Φ : [0, 1] → [0, 1] (score generating function) R+
i
= rank(|Zi|)
Definition
The statistic
n
- i=1
Φ R+
i
n + 1
- sgn(Zi)
is a linear signed rank statistic.
Structure of the empirical risk
Notations: K(s, u) = E (Y I{s(X) ≤ Q(s, u)}) ˆ Kn(s, u) = 1 n
n
- i=1
Yi I{s(Xi) ≤ ˆ Q(s, u)} We have: L(s) = 1 − p + K(s, u) ˆ Ln(s) = n− n + ˆ Kn(s, u) where n− = n
i=1 I{Yi = −1}
Observe
Set Zi = Yis(Xi). For fixed s and u, the statistic ˆ Kn(s, u) is a linear signed rank statistic.
Hoeffding’s-type decomposition
Notations: Zn(s, u) = 1 n
n
- i=1
- Yi − K ′(s, u)
- I{s(Xi) ≤ Q(s, u)}−K(s, u)+uK ′(s, u) ,
where K ′(s, u) = K ′
u(s, u).
Proposition
We have, for all s and u ∈ [0, 1]: ˆ Kn(s, u) = K(s, u) + Zn(s, u) + Λn(s) . with Λn(s) = OP(n−1) as n → ∞ .
General ROC summaries
Score-generating function Φ : [0, 1] → [0, 1] increasing Empirical performance functional Wn(s) =
n
- i=1
I{Yi = 1} Φ rank(s(Xi)) n + 1
- Choices of Φ:
◮ Φ(x) = x ⇒ AUC ◮ Φ(x) = x I{x ≥ 1 − u} ⇒ Local AUC ◮ Φ(x) = c(x)I{x ≥ k/n + 1}, ⇒ DCG ◮ smooth Φ’s
Most ranking criteria are conditional linear rank statistics
This talk: Statistical theory for learning summaries of the optimal ROC curve Analysis of higher order statistics Orthogonal decompositions and control of remainder term Generic form for risk criteria in ranking Not in this talk! Approximation and estimation schemes for the optimal ROC curve Design of scoring/ranking algorithms based on decision trees Aggregation of ranking trees involves rank aggregation techniques Application to multivariate homogeneity tests R implementation of TreeRank available!
Today, Yves Meyer received the Carl Friedrich Gauss Prize in Hyderabad.
References
- S. Cl´
emen¸ con, M. Depecker, and N. Vayatis (2010). Adaptive partitioning schemes for bipartite ranking. Machine Learning Journal. To appear.
- S. Cl´
emen¸ con and N. Vayatis (2010). Overlaying classifiers: a practical approach for optimal scoring. Constructive Approximation. To appear.
- S. Cl´
emen¸ con, M. Depecker, and N. Vayatis (2009). AUC maximization and the two-sample problem. Proceedings of NIPS’09, Advances in Neural Information Processing Systems 22, pp.360-368, MIT Press.
- S. Cl´
emen¸ con and N. Vayatis (2009). Tree-based ranking methods. IEEE Transactions on Information Theory.
- S. Cl´
emen¸ con and N. Vayatis (2008). Empirical performance maximization for linear rank statistics. Proceedings of NIPS’08, MIT Press.
- S. Cl´
emen¸ con, G. Lugosi, and N. Vayatis (2008). Ranking and empirical risk minimization of U-statistics. The Annals of Statistics, vol.36(2):844-874.
- S. Cl´
emen¸ con and NV (2007). Ranking the best instances. Journal of Machine Learning Research, 8(Dec):2671-2699.