Online isotonic regression Wojciech Kot lowski DA2PL 2018 Pozna - - PowerPoint PPT Presentation

online isotonic regression
SMART_READER_LITE
LIVE PREVIEW

Online isotonic regression Wojciech Kot lowski DA2PL 2018 Pozna - - PowerPoint PPT Presentation

Online isotonic regression Wojciech Kot lowski DA2PL 2018 Pozna n University of Technology 1 / 59 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6


slide-1
SLIDE 1

Online isotonic regression

Wojciech Kot lowski DA2PL 2018 Pozna´ n University of Technology

1 / 59

slide-2
SLIDE 2

Outline

1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

2 / 59

slide-3
SLIDE 3

Outline

1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

3 / 59

slide-4
SLIDE 4

Motivation I – house pricing

Assess the selling price of a house based on its attributes.

4 / 59

slide-5
SLIDE 5

Motivation I – house pricing

Den Bosch data set

  • 200

400 600 800 200 300 400 500 600 700 800 area price 5 / 59

slide-6
SLIDE 6

Motivation I – house pricing

Fitting linear function

  • 200

400 600 800 200 300 400 500 600 700 800 area price 6 / 59

slide-7
SLIDE 7

Motivation I – house pricing

Fitting isotonic1 function

  • 200

400 600 800 200 300 400 500 600 700 800 area price 1isotonic – non-decreasing, order-preserving 7 / 59

slide-8
SLIDE 8

Motivation II – predicting good probabilities

Predictions of SVM classifier (german credit)

  • ● ●
  • −4

−3 −2 −1 1 2 0.0 0.2 0.4 0.6 0.8 1.0 score label

Can we turn score values into conditional probabilities P(y|x)?

8 / 59

slide-9
SLIDE 9

Motivation II – predicting good probabilities

Fitting isotonic function to the labels [Zadrozny & Elkan, 2002]

  • ● ●
  • −4

−3 −2 −1 1 2 0.0 0.2 0.4 0.6 0.8 1.0 score label/probability 9 / 59

slide-10
SLIDE 10

Motivation II – predicting good probabilities

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of positives

Calibration plots (reliability curve)

Perfectly calibrated Logistic (0.099) SVM (0.163) SVM + Isotonic (0.100)

(generated by a script from scikit-learn.org)

10 / 59

slide-11
SLIDE 11

Motivation II – predicting good probabilities

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of positives

Calibration plots (reliability curve)

Perfectly calibrated Logistic (0.099) Naive Bayes (0.118) Naive Bayes + Isotonic (0.098)

(generated by a script from scikit-learn.org)

10 / 59

slide-12
SLIDE 12

Outline

1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

11 / 59

slide-13
SLIDE 13

Isotonic regression

Definition Fit an isotonic (monotonically increasing) function to the data. Extensively studied in statistics [Ayer et al., 55; Brunk, 55; Robertson et al., 98]. Numerous applications: Biology, medicine, psychology, etc. Multicriteria decision support. Hypothesis tests under order constraints. Multidimensional scaling. Machine learning: probability calibration, ROC analysis.

12 / 59

slide-14
SLIDE 14

Isotonic regression

Definition Given data {(xt, yt)}T

t=1 ⊂ R × R, find isotonic (nondecreasing)

f ∗ : R → R, which minimizes squared error over the labels: min

f

:

T

  • t=1

(yt − f (xt))2, subject to : xt ≥ xq = ⇒ f (xt) ≥ f (xq), q, t ∈ {1, . . . , T}. The optimal solution f ∗ is called isotonic regression function. What only matters are values f (xt), t = 1, . . . , T.

13 / 59

slide-15
SLIDE 15

Isotonic regression example

(source: scikit-learn.org)

14 / 59

slide-16
SLIDE 16

Properties of isotonic regression

Depends on instances (x) only through their order relation. Only defined at points {x1, . . . , xT}.

Often extended to R by linear interpolation.

Piecewise constants (splits the data into level sets). Self-averaging property: the value of f ∗ in a given level set equals the average of labels in that level set. For any v: v = 1 |Sv|

  • t∈Sv

yt where Sv = {t : f ∗(xt) = v}. When y ∈ {0, 1}, produces calibrated (empirical) probabilities: Eemp[y|f ∗ = v] = v

15 / 59

slide-17
SLIDE 17

Pool Adjacent Violators Algorithm (PAVA)

Iterative merging of of data points into blocks until no violators of isotonic constraints exist. The values assigned to each block is the average over labels in this block. The final assignments to blocks corresponds to the level sets

  • f isotonic regression.

Works in linear O(T) time, but requires the data to be sorted.

16 / 59

slide-18
SLIDE 18

Generalized isotonic regression

Definition Given data {(xt, yt)}T

t=1 ⊂ R × R, find isotonic f ∗ : R → R which

minimizes: min

isotonic f T

  • t=1

∆(yt, f (xt)). Squared loss (yt − f (xt))2 replaced with general loss ∆(yt, f (xt)).

17 / 59

slide-19
SLIDE 19

Generalized isotonic regression

Definition Given data {(xt, yt)}T

t=1 ⊂ R × R, find isotonic f ∗ : R → R which

minimizes: min

isotonic f T

  • t=1

∆(yt, f (xt)). Squared loss (yt − f (xt))2 replaced with general loss ∆(yt, f (xt)). Theorem [Robertson et al., 1998] All loss functions of the form: ∆(y, z) = Ψ(y) − Ψ(z) − Ψ′(z)(y − z) for some strictly convex Ψ result in the same isotonic regression function f ∗.

17 / 59

slide-20
SLIDE 20

Generalized isotonic regression – examples

∆(y, z) = Ψ(y) − Ψ(z) − Ψ′(z)(y − z) Squared function Ψ(y) = y2: ∆(y, z) = y2 − z2 − 2f (y − z) = (y − z)2 (squared loss). Entropy Ψ(y) = −y log y − (1 − y) log(1 − y), y ∈ [0, 1] ∆(y, z) = − y log z − (1 − y) log(1 − z) (cross-entropy). Negative logarithm Ψ(y) = − log y, y > 0 ∆(y, z) = y z − log y z (Itakura-Saito distance / Burg entropy).

18 / 59

slide-21
SLIDE 21

Outline

1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

19 / 59

slide-22
SLIDE 22

Online learning framework

A theoretical framework for the analysis of online algorithms. Learning process by its very nature is incremental. Avoids stochastic (e.g., i.i.d.) assumptions on the data sequence, designs algorithms which work well for any data. Meaningful performance guarantees based on observed quantities: regret bounds.

20 / 59

slide-23
SLIDE 23

Online learning framework

learner (strategy) ft : X → Y prediction

  • yt = ft(xt)

suffered loss ℓ(yt, yt) new instance (xt, ?) feedback: yt

t → t + 1

21 / 59

slide-24
SLIDE 24

Online learning framework

Set of strategies (actions) F; known loss function ℓ. Learner starts with some initial strategy (action) f1. For t = 1, 2, . . .:

1 Learner observes instance xt. 2 Learner predicts with

yt = ft(xt).

3 The environment reveals outcome yt. 4 Learner suffers loss ℓ(yt,

yt).

5 Learner updates its strategy ft → ft+1.

22 / 59

slide-25
SLIDE 25

Online learning framework

The goal of the learner is to be close to the best f in hindsight. Cumulative loss of the learner:

  • LT =

T

  • t=1

ℓ(yt, yt). Cumulative loss of the best strategy f in hindsight: L∗

T = min f ∈F T

  • t=1

ℓ(yt, f (xt)). Regret of the learner: regretT = LT − L∗

T.

The goal is to minimize regret over all possible data sequences.

23 / 59

slide-26
SLIDE 26

Outline

1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

24 / 59

slide-27
SLIDE 27

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1

25 / 59

slide-28
SLIDE 28

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1 x5

25 / 59

slide-29
SLIDE 29

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1 x5

  • y5

25 / 59

slide-30
SLIDE 30

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1 x5

  • y5

y5

25 / 59

slide-31
SLIDE 31

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1 x5

  • y5

y5 loss = ( y5 − y5)2

25 / 59

slide-32
SLIDE 32

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1 x1

25 / 59

slide-33
SLIDE 33

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1 x1

  • y1

25 / 59

slide-34
SLIDE 34

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1 x1

  • y1

y1

25 / 59

slide-35
SLIDE 35

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1 x1

  • y1

y1 loss = ( y1 − y1)2

25 / 59

slide-36
SLIDE 36

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1

25 / 59

slide-37
SLIDE 37

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 1

25 / 59

slide-38
SLIDE 38

Online isotonic regression

The protocol Given: x1 < x2 < . . . < xT. At trial t = 1, . . . , T: Environment chooses a yet unlabeled point xit. Learner predicts yit ∈ [0, 1]. Environment reveals label yit ∈ [0, 1]. Learner suffers squared loss (yit − yit)2.

26 / 59

slide-39
SLIDE 39

Online isotonic regression

The protocol Given: x1 < x2 < . . . < xT. At trial t = 1, . . . , T: Environment chooses a yet unlabeled point xit. Learner predicts yit ∈ [0, 1]. Environment reveals label yit ∈ [0, 1]. Learner suffers squared loss (yit − yit)2. Strategies = isotonic functions: F = {f : f (x1) ≤ f (x2) ≤ . . . ≤ f (xT)}

26 / 59

slide-40
SLIDE 40

Online isotonic regression

The protocol Given: x1 < x2 < . . . < xT. At trial t = 1, . . . , T: Environment chooses a yet unlabeled point xit. Learner predicts yit ∈ [0, 1]. Environment reveals label yit ∈ [0, 1]. Learner suffers squared loss (yit − yit)2. Strategies = isotonic functions: F = {f : f (x1) ≤ f (x2) ≤ . . . ≤ f (xT)} regretT =

T

  • t=1

(yit − yit)2 − min

f ∈F T

  • t=1

(yit − f (xit))2

26 / 59

slide-41
SLIDE 41

Online isotonic regression

F = {f : f (x1) ≤ f (x2) ≤ . . . ≤ f (xT)} regretT =

T

  • t=1

(yit − yit)2 − min

f ∈F T

  • t=1

(yit − f (xit))2 Cumulative loss of the learner should not be much larger than the loss of (optimal) isotonic regression function in hindsight. Only the order x1 < . . . < xT matters, not the values.

27 / 59

slide-42
SLIDE 42

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1

28 / 59

slide-43
SLIDE 43

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x1

28 / 59

slide-44
SLIDE 44

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1

  • y1

x1

28 / 59

slide-45
SLIDE 45

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1

  • y1

y1 x1

28 / 59

slide-46
SLIDE 46

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1

  • y1

y1 x1 loss ≥ 1/4

28 / 59

slide-47
SLIDE 47

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x2 x2

28 / 59

slide-48
SLIDE 48

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x2

  • y2

x2

28 / 59

slide-49
SLIDE 49

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x2

  • y2

y2 x2

28 / 59

slide-50
SLIDE 50

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x2

  • y2

y2 x2 loss ≥ 1/4

28 / 59

slide-51
SLIDE 51

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x2 x3 x3

28 / 59

slide-52
SLIDE 52

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x2 x3

  • y1

x3

28 / 59

slide-53
SLIDE 53

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x2 x3

  • y1

y3 x3

28 / 59

slide-54
SLIDE 54

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x2 x3

  • y1

y3 x3 loss ≥ 1/4

28 / 59

slide-55
SLIDE 55

The adversary is too powerful!

Every algorithm will have Ω(T) regret

X Y 1 x1 x2 x3

Algorithms’ loss ≥ 1

4 per trial, loss of best isotonic function = 0.

28 / 59

slide-56
SLIDE 56

Outline

1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

29 / 59

slide-57
SLIDE 57

Fixed design

Data x1, . . . , xT is known in advance to the learner We will show that in such model, efficient online algorithms exist. K., Koolen, Malek: Online Isotonic Regression. Proc. of Conference on Learning Theory (COLT), pp. 1165–1189, 2016.

30 / 59

slide-58
SLIDE 58

Off-the-shelf online algorithms

Algorithm General bound Bound for online IR Stochastic Gradient Descent G2D2 √ T T Exponentiated Gradient G∞D1 √T log d √T log T Follow the Leader G2D2d log T T 2 log T Exponential Weights d log T T log T These bounds are tight (up to logarithmic factor).

31 / 59

slide-59
SLIDE 59

Exponential Weights (Bayes) with uniform prior

Let f = (f1, . . . , fT) denote values of f at (x1, . . . , xT). π(f ) = const, for all f : f1 ≤ . . . ≤ fT, P(f |yi1, . . . , yit) ∝ π(f )e− 1

2 loss1...t(f ),

  • yit+1 =
  • fit+1P(f |yi1, . . . , yit)df
  • = posterior mean

.

32 / 59

slide-60
SLIDE 60

Exponential Weights with uniform prior does not learn

prior mean

X Y 0.2 0.4 0.6 0.8 1 prior mean posterior mean 33 / 59

slide-61
SLIDE 61

Exponential Weights with uniform prior does not learn

posterior mean (t = 10)

X Y 0.2 0.4 0.6 0.8 1

  • ● ● ● ● ● ● ● ● ●

prior mean posterior mean 34 / 59

slide-62
SLIDE 62

Exponential Weights with uniform prior does not learn

posterior mean (t = 20)

X Y 0.2 0.4 0.6 0.8 1

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

prior mean posterior mean 35 / 59

slide-63
SLIDE 63

Exponential Weights with uniform prior does not learn

posterior mean (t = 50)

X Y 0.2 0.4 0.6 0.8 1

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

prior mean posterior mean 36 / 59

slide-64
SLIDE 64

Exponential Weights with uniform prior does not learn

posterior mean (t = 100)

X Y 0.2 0.4 0.6 0.8 1

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

prior mean posterior mean 37 / 59

slide-65
SLIDE 65

The algorithm

Exponential Weights on a covering net FK =

  • f : ft = kt

K , k ∈ {0, 1, . . . , K}, f1 ≤ . . . ≤ fT

  • ,

π(f ) uniform on FK. Efficient implementation by dynamic programming: O(Kt) at trial t. Speed-up to O(K) if the data revealed in isotonic order.

38 / 59

slide-66
SLIDE 66

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

39 / 59

slide-67
SLIDE 67

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

39 / 59

slide-68
SLIDE 68

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

39 / 59

slide-69
SLIDE 69

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

39 / 59

slide-70
SLIDE 70

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

There are O(T K) functions in FK

39 / 59

slide-71
SLIDE 71

Performance of the algorithm

Regret bound When K = Θ

  • T 1/3 log−1/3(T)
  • ,

Regret = O

  • T 1/3 log2/3(T)
  • 40 / 59
slide-72
SLIDE 72

Performance of the algorithm

Regret bound When K = Θ

  • T 1/3 log−1/3(T)
  • ,

Regret = O

  • T 1/3 log2/3(T)
  • Matching lower bound Ω(T 1/3) (up to log factor).

40 / 59

slide-73
SLIDE 73

Performance of the algorithm

Regret bound When K = Θ

  • T 1/3 log−1/3(T)
  • ,

Regret = O

  • T 1/3 log2/3(T)
  • Matching lower bound Ω(T 1/3) (up to log factor).

Proof idea Regret = Loss(alg) − min

f ∈FK Loss(f )

+ min

f ∈FK Loss(f ) −

min

isotonic f Loss(f )

40 / 59

slide-74
SLIDE 74

Performance of the algorithm

Regret bound When K = Θ

  • T 1/3 log−1/3(T)
  • ,

Regret = O

  • T 1/3 log2/3(T)
  • Matching lower bound Ω(T 1/3) (up to log factor).

Proof idea Regret = Loss(alg) − min

f ∈FK Loss(f )

  • =2 log |FK|=O(K log T)

+ min

f ∈FK Loss(f ) −

min

isotonic f Loss(f )

  • = T

4K2 40 / 59

slide-75
SLIDE 75

Performance of the algorithm

prior mean

X Y 0.2 0.4 0.6 0.8 1 prior mean posterior mean 41 / 59

slide-76
SLIDE 76

Performance of the algorithm

posterior mean (t = 10)

X Y 0.2 0.4 0.6 0.8 1

  • ● ● ● ● ● ● ● ● ●

prior mean posterior mean 42 / 59

slide-77
SLIDE 77

Performance of the algorithm

posterior mean (t = 20)

X Y 0.2 0.4 0.6 0.8 1

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

prior mean posterior mean 43 / 59

slide-78
SLIDE 78

Performance of the algorithm

posterior mean (t = 50)

X Y 0.2 0.4 0.6 0.8 1

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

prior mean posterior mean 44 / 59

slide-79
SLIDE 79

Performance of the algorithm

posterior mean (t = 100)

X Y 0.2 0.4 0.6 0.8 1

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

prior mean posterior mean 45 / 59

slide-80
SLIDE 80

Other loss functions

Cross-entropy loss ℓ(y, y) = −y log y − (1 − y) log(1 − y) The same bound O

  • T 1/3 log2/3(T)
  • .

Covering net FK obtained by non-uniform discretization.

46 / 59

slide-81
SLIDE 81

Other loss functions

Cross-entropy loss ℓ(y, y) = −y log y − (1 − y) log(1 − y) The same bound O

  • T 1/3 log2/3(T)
  • .

Covering net FK obtained by non-uniform discretization. Absolute loss ℓ(y, y) = |y − y| O

√T log T

  • btained by Exponentiated Gradient.

Matching lower bound Ω( √ T) (up to log factor).

46 / 59

slide-82
SLIDE 82

Outline

1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

47 / 59

slide-83
SLIDE 83

Random permutation model

A more realistic scenario for generating x1, . . . , xT which allows data to be unknown in advance.

48 / 59

slide-84
SLIDE 84

Random permutation model

A more realistic scenario for generating x1, . . . , xT which allows data to be unknown in advance. The data are chosen adversarially before the game begins, but then are presented to the learner in a random order Motivation: data gathering process is independent on the underlying data generation mechanism. Still very weak assumption. Evaluation: regret averaged over all permutations of data: Eσ [regretT] K., Koolen, Malek: Random Permutation Online Isotonic

  • Regression. NIPS, pp. 4180–4189, 2017.

48 / 59

slide-85
SLIDE 85

Leave-one-out loss

Definition Given t labeled points {(xi, yi)}t

i=1, for i = 1, . . . , t:

Take out i-th point and give remaining t − 1 points to the learner as a training data. Learner predict yi on xi and receives loss ℓ(yi, yi). Evaluate the learner by ℓoot = 1

t

t

i=1 ℓ(yi,

yi) No sequential structure in the definition.

49 / 59

slide-86
SLIDE 86

Leave-one-out loss

Definition Given t labeled points {(xi, yi)}t

i=1, for i = 1, . . . , t:

Take out i-th point and give remaining t − 1 points to the learner as a training data. Learner predict yi on xi and receives loss ℓ(yi, yi). Evaluate the learner by ℓoot = 1

t

t

i=1 ℓ(yi,

yi) No sequential structure in the definition. Theorem If ℓoot ≤ g(t) for all t, then Eσ [regretT] ≤ T

t=1 g(t).

49 / 59

slide-87
SLIDE 87

Fixed design to random permutation conversion

Any algorithm for fixed-design can be used in the random permutation setup by being re-run from the scratch in each trial. We have shown that: ℓoot ≤ 1 t Eσ [fixed-design-regrett] We thus get an optimal algorithm (Exponential Weights on a grid) with O(T −2/3) leave-one-out loss “for free”, but it is complicated. Can we get simpler algorithms to work in this setup?

50 / 59

slide-88
SLIDE 88

Follow the Leader (FTL) algorithm

Definition Given past t − 1 data, compute the optimal (loss-minimizing) function f ∗ and predict on new instance x according to f ∗(x).

51 / 59

slide-89
SLIDE 89

Follow the Leader (FTL) algorithm

Definition Given past t − 1 data, compute the optimal (loss-minimizing) function f ∗ and predict on new instance x according to f ∗(x). FTL is undefined for isotonic regression. x −3 −1 2 3 y 0.2 0.7 1 f ∗(x) 0.2 0.7 1

51 / 59

slide-90
SLIDE 90

Follow the Leader (FTL) algorithm

Definition Given past t − 1 data, compute the optimal (loss-minimizing) function f ∗ and predict on new instance x according to f ∗(x). FTL is undefined for isotonic regression. x −3 −1 2 3 y 0.2 0.7 1 f ∗(x) 0.2 ?? 0.7 1

51 / 59

slide-91
SLIDE 91

Foward Algorithm (FA)

Definition Given past t − 1 data and a new instance x, take any guess y′ ∈ [0, 1] of the new label and predict according to the optimal function f ∗ on the past data including the new point (x, y′). x −3 −1 2 3 y 0.2 0.7 1 f ∗(x)

52 / 59

slide-92
SLIDE 92

Foward Algorithm (FA)

Definition Given past t − 1 data and a new instance x, take any guess y′ ∈ [0, 1] of the new label and predict according to the optimal function f ∗ on the past data including the new point (x, y′). x −3 −1 2 3 y 0.2 y′ = 1 0.7 1 f ∗(x)

52 / 59

slide-93
SLIDE 93

Foward Algorithm (FA)

Definition Given past t − 1 data and a new instance x, take any guess y′ ∈ [0, 1] of the new label and predict according to the optimal function f ∗ on the past data including the new point (x, y′). x −3 −1 2 3 y 0.2 y′ = 1 0.7 1 f ∗(x) 0.2 0.85 0.85 1

52 / 59

slide-94
SLIDE 94

Foward Algorithm (FA)

Definition Given past t − 1 data and a new instance x, take any guess y′ ∈ [0, 1] of the new label and predict according to the optimal function f ∗ on the past data including the new point (x, y′). x −3 −1 2 3 y 0.2 y′ = 1 0.7 1 f ∗(x) 0.2 0.85 0.85 1 Various popular prediction algorithms for IR fall into this framework (including linear interpolation [Zadrozny & Elkan, 2002] and many others [Vovk et al., 2015]).

52 / 59

slide-95
SLIDE 95

Foward Algorithm (FA)

Two extreme FA: guess-1 and guess-0, denoted f ∗

1 and f ∗ 0 .

Prediction of any FA is always between: f ∗

0 (x) ≤ f ∗(x) ≤ f ∗ 1 (x).

X Y x1 y1 x2 y2 x3 y3 x5 y5 x6 y6 x7 y7 x8 y8 x4 1

53 / 59

slide-96
SLIDE 96

Foward Algorithm (FA)

Two extreme FA: guess-1 and guess-0, denoted f ∗

1 and f ∗ 0 .

Prediction of any FA is always between: f ∗

0 (x) ≤ f ∗(x) ≤ f ∗ 1 (x).

X Y x1 y1 x2 y2 x3 y3 x5 y5 x6 y6 x7 y7 x8 y8 x4 1 f ∗

1

53 / 59

slide-97
SLIDE 97

Foward Algorithm (FA)

Two extreme FA: guess-1 and guess-0, denoted f ∗

1 and f ∗ 0 .

Prediction of any FA is always between: f ∗

0 (x) ≤ f ∗(x) ≤ f ∗ 1 (x).

X Y x1 y1 x2 y2 x3 y3 x5 y5 x6 y6 x7 y7 x8 y8 x4 1 f ∗

1

f ∗

53 / 59

slide-98
SLIDE 98

Foward Algorithm (FA)

Two extreme FA: guess-1 and guess-0, denoted f ∗

1 and f ∗ 0 .

Prediction of any FA is always between: f ∗

0 (x) ≤ f ∗(x) ≤ f ∗ 1 (x).

X Y x1 y1 x2 y2 x3 y3 x5 y5 x6 y6 x7 y7 x8 y8 x4 1 f ∗

1

f ∗ every FA predicts in this range

53 / 59

slide-99
SLIDE 99

Performance of FA

Theorem For squared loss, every forward algorithm has: ℓoot = O

 

  • log t

t

 

The bound is suboptimal, but only a factor of O(t1/6) off. For cross-entropy loss, the some bound holds but a more careful choice of the guess must be made.

54 / 59

slide-100
SLIDE 100

Outline

1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

55 / 59

slide-101
SLIDE 101

Conclusions

Two models for online isotonic regression: fixed design and random permutation. Optimal algorithm in both models: Exponential Weights (Bayes) on a grid. In the random permutation model, a class of forward algorithms with good bounds on the leave-one-out loss. Open problem: Extend the analysis of these algorithms to the partial order case.

56 / 59

slide-102
SLIDE 102

Bibliography

Statistics

  • M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical

distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4):641–647, 1955

  • H. D. Brunk. Maximum likelihood estimates of monotone parameters. Annals of

Mathematical Statistics, 26(4):607–616, 1955

  • J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric
  • hypothesis. Psychometrika, 29(1):1–27, 1964
  • R. E. Barlow and H. D. Brunk. The isotonic regression problem and its dual. Journal
  • f the American Statistical Association, 67:140–147, 1972
  • T. Robertson, F. T. Wright, and R. L. Dykstra. Order Restricted Statistical Inference.

John Wiley & Sons, 1998 Sara Van de Geer. Estimating a regression function. Annals of Statistics, 18:907–924, 1990 Cun-Hui Zhang. Risk bounds in isotonic regression. The Annals of Statistics, 30(2):528–555, 2002 Jan de Leeuw, Kurt Hornik, and Patrick Mair. Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods. Journal of Statistical Software, 32:1–24, 2009

57 / 59

slide-103
SLIDE 103

Bibliography

Machine Learning

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages 694–699, 2002 Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In ICML, volume 119, pages 625–632. ACM, 2005 Tom Fawcett and Alexandru Niculescu-Mizil. PAV and the ROC convex hull. Machine Learning, 68(1):97–106, 2007 Vladimir Vovk, Ivan Petej, and Valentina Fedorova. Large-scale probabilistic predictors with and without guarantees of validity. In NIPS, pages 892–900, 2015 Aditya Krishna Menon, Xiaoqian Jiang, Shankar Vembu, Charles Elkan, and Lucila Ohno-Machado. Predicting accurate probabilities with a ranking loss. In ICML, 2012 Rasmus Kyng, Anup Rao, and Sushant Sachdeva. Fast, provable algorithms for isotonic regression in all ℓp-norms. In NIPS, 2015 Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009

  • T. Moon, A. Smola, Y. Chang, and Z. Zheng. Intervalrank: Isotonic regression with

listwise and pairwise constraint. In WSDM, pages 151–160. ACM, 2010 Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In NIPS, pages 927–935, 2011

58 / 59

slide-104
SLIDE 104

Bibliography

Online isotonic regression

Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression. In COLT, pages 1232–1264, 2014 Pierre Gaillard and S´ ebastien Gerchinovitz. A chaining algorithm for online nonparametric regression. In COLT, pages 764–796, 2015 Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek. Online isotonic regression. In COLT, pages 1165–1189, 2016 Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek. Random permutation online isotonic regression. In Neural Information Processing Systems (NIPS), pages 4180–4189, 2017

59 / 59