Selective sampling algorithms for cost-sensitive multiclass - - PowerPoint PPT Presentation

selective sampling algorithms for cost sensitive
SMART_READER_LITE
LIVE PREVIEW

Selective sampling algorithms for cost-sensitive multiclass - - PowerPoint PPT Presentation

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft Research Alekh Agarwal Selective sampling for multiclass prediction Why active learning? Standard setting - receive randomly sampled examples


slide-1
SLIDE 1

Selective sampling algorithms for cost-sensitive multiclass prediction

Alekh Agarwal

Microsoft Research

Alekh Agarwal Selective sampling for multiclass prediction

slide-2
SLIDE 2

Why active learning?

Standard setting - receive randomly sampled examples

Alekh Agarwal Selective sampling for multiclass prediction

slide-3
SLIDE 3

Why active learning?

Standard setting - receive randomly sampled examples Not all data points are equally informative!

Alekh Agarwal Selective sampling for multiclass prediction

slide-4
SLIDE 4

Why active learning?

Standard setting - receive randomly sampled examples Not all data points are equally informative! Labelled data points are expensive, unlabelled cheap

Object recognition - images need human labelling Protein interaction prediction - lab test for each protein pair Web ranking - human editors to label relevant pages

Alekh Agarwal Selective sampling for multiclass prediction

slide-5
SLIDE 5

What is active learning?

Sequentially query points with label uncertainty

Alekh Agarwal Selective sampling for multiclass prediction

slide-6
SLIDE 6

What is active learning?

Sequentially query points with label uncertainty

Like random search vs. binary search

Alekh Agarwal Selective sampling for multiclass prediction

slide-7
SLIDE 7

What is active learning?

Sequentially query points with label uncertainty

Like random search vs. binary search Example: sampling near decision boundary for linear separators

Alekh Agarwal Selective sampling for multiclass prediction

slide-8
SLIDE 8

Online selective sampling paradigm

Algorithm x2 xt x1 Predict ˆ yt Zt = 0 Don’t Observe yt Zt = 1 Observe yt

Filter examples online, querying only a subset of labels. Examples not revisited

Alekh Agarwal Selective sampling for multiclass prediction

slide-9
SLIDE 9

Prior work

Bulk of work in the binary setting Agnostic active learning

Atlas, Balcan, Beygelzimer, Cohn, Dasgupta, Hanneke, Hsu, Ladner, Langford, . . .

Linear selective sampling: Cesa-Bianchi, Gentile and co-authors

Alekh Agarwal Selective sampling for multiclass prediction

slide-10
SLIDE 10

Prior work

Bulk of work in the binary setting Agnostic active learning

Atlas, Balcan, Beygelzimer, Cohn, Dasgupta, Hanneke, Hsu, Ladner, Langford, . . .

Linear selective sampling: Cesa-Bianchi, Gentile and co-authors Empirical work in the multiclass setting: Jain and Kapoor (2009), Joshi et al. (2012), . . . Relatively little theoretical work

Alekh Agarwal Selective sampling for multiclass prediction

slide-11
SLIDE 11

This talk

Efficient algorithm in a multiclass GLM setting Analysis of regret and label complexity Sharp rates under Tsybakov-type noise condition Regret ranges between O(1/√NT) (noisy) to O(exp(−c0NT)) (hard-margin)

Alekh Agarwal Selective sampling for multiclass prediction

slide-12
SLIDE 12

This talk

Efficient algorithm in a multiclass GLM setting Analysis of regret and label complexity Sharp rates under Tsybakov-type noise condition Regret ranges between O(1/√NT) (noisy) to O(exp(−c0NT)) (hard-margin) Safety guarantee under model mismatch Numerical simulations

Alekh Agarwal Selective sampling for multiclass prediction

slide-13
SLIDE 13

Multiclass prediction

x ∈ Rd, y ∈ {1, 2, . . . , K} Only one label per example

Horse Cat Dog

Alekh Agarwal Selective sampling for multiclass prediction

slide-14
SLIDE 14

Multiclass prediction

x ∈ Rd, y ∈ {1, 2, . . . , K} Only one label per example Cost matrix C ∈ RK×K Penalty Cij for predicting label j when true label is i

Horse Cat Dog

Cat Dog Horse Cat 1 10 Dog 1 10 Horse 10 10

Alekh Agarwal Selective sampling for multiclass prediction

slide-15
SLIDE 15

Structured cost matrices

Often have block- or tree-structured cost matrices in applications 0/1 Block Tree

2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Alekh Agarwal Selective sampling for multiclass prediction

slide-16
SLIDE 16

Multiclass GLM

Weight matrix W∗ ∈ RK×d Convex function Φ : RK → R Definition (Multiclass GLM) For every x ∈ Rd, the class conditional probabilities follow the model P(Y = i | W∗, x) = (∇Φ(W∗x))i

W∗ K d d = W∗x K ∇Φ x P(Y | W∗, x)

Alekh Agarwal Selective sampling for multiclass prediction

slide-17
SLIDE 17

Multiclass GLM intuition

Binary: K = 2. Φ is convex ⇐ ⇒ link function is monotone increasing. E.g.: logistic, linear, . . .

−100 −50 50 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

wTx P(y = 1 | w, x)

Alekh Agarwal Selective sampling for multiclass prediction

slide-18
SLIDE 18

Example: multiclass logistic

Define Φ(v) = log(K

i=1 exp(vi))

Obtain (∇Φ(v))i = exp(vi)/(K

j=1 exp(vj))

Yields the multinomial logit noise model P(Y = i | W, x) = exp(xTWi) K

j=1 exp(xTWj)

.

Alekh Agarwal Selective sampling for multiclass prediction

slide-19
SLIDE 19

Loss function

Given Φ, define the loss ℓ(Wx, y) = Φ(Wx) − yTWx. Convex since Φ is convex Fisher consistent: W∗ minimizes E[ℓ(Wx, y) | W∗, x] for each x

Alekh Agarwal Selective sampling for multiclass prediction

slide-20
SLIDE 20

Loss function

Given Φ, define the loss ℓ(Wx, y) = Φ(Wx) − yTWx. Convex since Φ is convex Fisher consistent: W∗ minimizes E[ℓ(Wx, y) | W∗, x] for each x E[∇ℓ(Wx, y) | W∗, x] = E[∇Φ(Wx) | W∗, x] − E[∇yTWx | W∗, x] = ∇Φ(Wx)xT − E[y | W∗, x]xT = ∇Φ(Wx)xT − ∇Φ(W∗x)xT

Alekh Agarwal Selective sampling for multiclass prediction

slide-21
SLIDE 21

Score function

Given a cost matrix C and Φ, define Sx

W(i) = − K

  • j=1

C(j, i) cost of i (∇Φ(Wx))j

  • probability of j

. Negative expected cost of predicting j, when W is the true weight matrix Maximum score ⇐ ⇒ minimum expected cost

Alekh Agarwal Selective sampling for multiclass prediction

slide-22
SLIDE 22

Score function

Given a cost matrix C and Φ, define Sx

W(i) = − K

  • j=1

C(j, i) cost of i (∇Φ(Wx))j

  • probability of j

. Negative expected cost of predicting j, when W is the true weight matrix Maximum score ⇐ ⇒ minimum expected cost Bayes predictor predicts arg maxK

i=1 Sx W∗(i)

Alekh Agarwal Selective sampling for multiclass prediction

slide-23
SLIDE 23

CS-Selectron algorithm with general query function

Input: Query function Q, cost matrix C, parameter γ > 0 Initialize: W1 = 0, M1 = γI/γℓ For t = 1, 2, . . . T

Alekh Agarwal Selective sampling for multiclass prediction

slide-24
SLIDE 24

CS-Selectron algorithm with general query function

Input: Query function Q, cost matrix C, parameter γ > 0 Initialize: W1 = 0, M1 = γI/γℓ For t = 1, 2, . . . T

Observe xt, Ht+1 = Ht ∪ {xt}

Algorithm x2 xt x1

Alekh Agarwal Selective sampling for multiclass prediction

slide-25
SLIDE 25

CS-Selectron algorithm with general query function

For t = 1, 2, . . . T

Observe xt, Ht+1 = Ht ∪ {xt} Predict ˆ yt = arg maxi=1,2,...,K Sxt

Wt(i)

ˆ yt = arg max Sxt

Wt(i)

x2 xt x1 Algorithm Predict

Alekh Agarwal Selective sampling for multiclass prediction

slide-26
SLIDE 26

CS-Selectron algorithm with general query function

For t = 1, 2, . . . T

Observe xt, Ht+1 = Ht ∪ {xt} Predict ˆ yt = arg maxi=1,2,...,K Sxt

Wt(i)

If Q(xt, Ht) = 1, then

Q(xt, Ht) = 1 x2 xt x1 Algorithm Predict ˆ yt = arg max Sxt

Wt(i) Alekh Agarwal Selective sampling for multiclass prediction

slide-27
SLIDE 27

CS-Selectron algorithm with general query function

For t = 1, 2, . . . T

Observe xt, Ht+1 = Ht ∪ {xt} Predict ˆ yt = arg maxi=1,2,...,K Sxt

Wt(i)

If Q(xt, Ht) = 1, then

Query label yt

Observe yt x2 xt x1 Algorithm Predict ˆ yt = arg max Sxt

Wt(i)

Q(xt, Ht) = 1

Alekh Agarwal Selective sampling for multiclass prediction

slide-28
SLIDE 28

CS-Selectron algorithm with general query function

For t = 1, 2, . . . T

Observe xt, Ht+1 = Ht ∪ {xt} Predict ˆ yt = arg maxi=1,2,...,K Sxt

Wt(i)

If Q(xt, Ht) = 1, then

Query label yt Update Wt, Mt and Ht

Observe yt x2 xt x1 Algorithm Predict ˆ yt = arg max Sxt

Wt(i)

Q(xt, Ht) = 1

Zt = 1, Ht+1 = Ht ∪ {yt}, Mt+1 = Mt + xtxT

t

Wt+1 = arg min

W∈W

  • t
  • s=1

Zsℓ(Wxs, ys) + γW2

F

  • .

Alekh Agarwal Selective sampling for multiclass prediction

slide-29
SLIDE 29

Algorithm intuition

Low-regret algorithm on queried examples Update ensures Wt − W∗Mt is small Query function ensures low regret on rounds with no queries

Alekh Agarwal Selective sampling for multiclass prediction

slide-30
SLIDE 30

Query function: BBQǫrule

Variant of Cesa-Bianchi et al. (2009) Q(xt, Ht) = 1 1

  • ηǫxt2

M−1

t

≥ ǫ2

Alekh Agarwal Selective sampling for multiclass prediction

slide-31
SLIDE 31

Query function: BBQǫrule

Variant of Cesa-Bianchi et al. (2009) Q(xt, Ht) = 1 1

  • ηǫxt2

M−1

t

≥ ǫ2 Note: W∗xt − Wtxt2 ≤ W∗ − WtMt xtM−1

t

Queries points with large confidence intervals on the predictions

Mt xt xt Mt

Q(xt, Ht) = 1 Q(xt, Ht) = 0

Alekh Agarwal Selective sampling for multiclass prediction

slide-32
SLIDE 32

Theoretical results: assumptions

Assumption The function Φ(·) is γℓ-strongly convex, that is for all u, v ∈ S ⊆ RK, we have Φ(u) ≥ Φ(v) + ∇Φ(v), (u − v) + γℓ 2 u − v2

2.

Alekh Agarwal Selective sampling for multiclass prediction

slide-33
SLIDE 33

Theoretical results: assumptions

Assumption The function Φ(·) is γℓ-strongly convex, that is for all u, v ∈ S ⊆ RK, we have Φ(u) ≥ Φ(v) + ∇Φ(v), (u − v) + γℓ 2 u − v2

2.

Assumption The function Φ(·) is γu-smooth, that is for all vectors u, v ∈ S ⊆ RK, we have Φ(u) ≤ Φ(v) + ∇Φ(v), (u − v) + γu 2 u − v2

2.

Alekh Agarwal Selective sampling for multiclass prediction

slide-34
SLIDE 34

Theoretical results: assumptions

Φ

Alekh Agarwal Selective sampling for multiclass prediction

slide-35
SLIDE 35

Theoretical results: assumptions

Assumption ∀x ∈ X, we have x2 ≤ R and ∀W ∈ W, we have Wi2 ≤ ω for all i = 1, 2, . . . , K.

Alekh Agarwal Selective sampling for multiclass prediction

slide-36
SLIDE 36

Theoretical results: setup

Bound label complexity NT and regret: RT =

T

  • t=1

(Et[C(Yt, ˆ yt)] − Et[C(Yt, y∗

t )])

Alekh Agarwal Selective sampling for multiclass prediction

slide-37
SLIDE 37

Theoretical results: setup

Bound label complexity NT and regret: RT =

T

  • t=1

(Et[C(Yt, ˆ yt)] − Et[C(Yt, y∗

t )])

Determined by fraction of hard examples Tǫ = {1 ≤ t ≤ T : Sxt

W∗(y∗ t ) − Sxt W∗(y

t) ≤ ǫ}.

Alekh Agarwal Selective sampling for multiclass prediction

slide-38
SLIDE 38

Main results: BBQǫrule

Theorem (BBQǫ query rule) With probability at least 1 − 2δ, RT = O

  • ǫTǫ + ψ(C, Φ) d

ǫ log 1 δ

  • ,

and label complexity is at most NT = O γ2

ud2K

γ2

ℓ ǫ2

log 1 δ

  • Result holds for arbitrary sequence xt

Alekh Agarwal Selective sampling for multiclass prediction

slide-39
SLIDE 39

Query function: DGS rule

BBQǫ doesn’t use the labels at all for querying! Variant of Dekel et al. (2010) Define

y ∗

t = arg maxi=1,...,K Sxt W∗(i),

y

t = arg maxi=y ∗

t Sxt

W∗(i)

ˆ yt = arg maxi=1,...,K Sxt

Wt(i),

y

′′

t = arg maxi=ˆ yt Sxt Wt(i).

Alekh Agarwal Selective sampling for multiclass prediction

slide-40
SLIDE 40

Query function: DGS rule

BBQǫ doesn’t use the labels at all for querying! Variant of Dekel et al. (2010) Define

y ∗

t = arg maxi=1,...,K Sxt W∗(i),

y

t = arg maxi=y ∗

t Sxt

W∗(i)

ˆ yt = arg maxi=1,...,K Sxt

Wt(i),

y

′′

t = arg maxi=ˆ yt Sxt Wt(i).

Set Q(xt, Ht) = 1 1

  • Sxt

Wt(ˆ

yt) − Sxt

Wt(y

′′

t ) ≤ 2ηDGS xtM−1

t

  • Alekh Agarwal

Selective sampling for multiclass prediction

slide-41
SLIDE 41

Query function: DGS rule

BBQǫ doesn’t use the labels at all for querying! Variant of Dekel et al. (2010) Define

y ∗

t = arg maxi=1,...,K Sxt W∗(i),

y

t = arg maxi=y ∗

t Sxt

W∗(i)

ˆ yt = arg maxi=1,...,K Sxt

Wt(i),

y

′′

t = arg maxi=ˆ yt Sxt Wt(i).

Set Q(xt, Ht) = 1 1

  • Sxt

Wt(ˆ

yt) − Sxt

Wt(y

′′

t ) ≤ 2ηDGS xtM−1

t

  • Note: |Sxt

Wt(i) − Sxt W∗(i)| ≤ ηDGS xtM−1

t

Sxt

Wt(y

′′

t )

ηxtM−1

t

Sxt

Wt(ˆ

yt) > 2ηxtM−1

t

ηxtM−1

t

Sxt

W∗(ˆ

yt) Sxt

W∗(y

′′

t )

Alekh Agarwal Selective sampling for multiclass prediction

slide-42
SLIDE 42

Main results: DGS rule

Theorem (DGS query rule) With probability at least 1 − 2δ, RT = O

  • inf

ǫ>0

  • ǫTǫ + γ2

ud

γ2

ℓ ǫ log 1

δ

  • ,

and for any ǫ > 0, label complexity is at most NT = O

  • Tǫ + γ2

ud2K

γ2

ℓ ǫ2

  • Can optimize over ǫ for the best bound

Alekh Agarwal Selective sampling for multiclass prediction

slide-43
SLIDE 43

Multiclass Tsybakov noise condition

Specialize to 0/1 costs for ease of presentation, and i.i.d. xt Assumption (Multiclass Tsybakov noise condition) There exist ǫ0 > 0, α > 0 and some c such that the distribution P

  • ver Rd satisfies for all 0 ≤ ǫ ≤ ǫ0,

P

  • (∇Φ(W∗X))y∗(X) − (∇Φ(W∗X))y′(X) ≤ ǫ
  • ≤ c ǫα.

Alekh Agarwal Selective sampling for multiclass prediction

slide-44
SLIDE 44

Multiclass Tsybakov noise condition

Specialize to 0/1 costs for ease of presentation, and i.i.d. xt Assumption (Multiclass Tsybakov noise condition) There exist ǫ0 > 0, α > 0 and some c such that the distribution P

  • ver Rd satisfies for all 0 ≤ ǫ ≤ ǫ0,

P

  • (∇Φ(W∗X))y∗(X) − (∇Φ(W∗X))y′(X) ≤ ǫ
  • ≤ c ǫα.

Ensures separation between class- conditional probabilities, controls Tǫ. Pictorial illustration for the binary case

P(y | x, w∗) xTw∗ 2ǫ0 Alekh Agarwal Selective sampling for multiclass prediction

slide-45
SLIDE 45

Results for low noise

Corollary Under the multiclass Tsybakov condition, BBQǫ rule yields with probability at least 1 − 2δ RT T = O γ2

ud2K

γ2

ℓ NT

1+α

2

  • .

Alekh Agarwal Selective sampling for multiclass prediction

slide-46
SLIDE 46

Results for low noise

Corollary Under the multiclass Tsybakov condition, BBQǫ rule yields with probability at least 1 − 2δ RT T = O γ2

ud2K

γ2

ℓ NT

1+α

2

  • .

Similar result for DGS rule 1/√NT when α = 0 and exp(−c0NT) as α → ∞

Alekh Agarwal Selective sampling for multiclass prediction

slide-47
SLIDE 47

Results for low noise

Corollary Under the multiclass Tsybakov condition, BBQǫ rule yields with probability at least 1 − 2δ RT T = O γ2

ud2K

γ2

ℓ NT

1+α

2

  • .

Similar result for DGS rule 1/√NT when α = 0 and exp(−c0NT) as α → ∞ RT = Ω(N−(1+α)/2

T

) under noise condition ⇒ optimality

Alekh Agarwal Selective sampling for multiclass prediction

slide-48
SLIDE 48

Numerical simulations

Synthetic mixture of Gaussians data in R1000 Evaluated BBQ, DGS, Random and Passive 0/1 cost matrix, multiclass logistic loss

1000 2000 3000 4000 5000 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25

Number of queries Regret ratio Regret versus number of queries

Passive Random BBQ DGS 1000 2000 3000 4000 5000 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4

Number of queries Regret ratio Regret versus number of queries

K = 5 K = 10

Plots showing the ratio of active to passive regret, as a function of the number of queries

Alekh Agarwal Selective sampling for multiclass prediction

slide-49
SLIDE 49

Model mismatch

1000 2000 3000 4000 5000 6000 7000 0.95 1 1.05 1.1 1.15 1.2 1.25

Number of queries Regret ratio Regret versus number of queries

Passive Random BBQ DGS

Plot of regret ratio under model mismatch scenario

Alekh Agarwal Selective sampling for multiclass prediction

slide-50
SLIDE 50

Model mismatch

1000 2000 3000 4000 5000 6000 7000 0.95 1 1.05 1.1 1.15 1.2 1.25

Number of queries Regret ratio Regret versus number of queries

Passive Random BBQ DGS

Plot of regret ratio under model mismatch scenario

Additional safety guarantee ensuring never worse than random under model mismatch in the paper

Alekh Agarwal Selective sampling for multiclass prediction

slide-51
SLIDE 51

Conclusions

Efficient active learning algorithm for cost-sensitive multiclass GLM Bounds on regret and label complexity Generalization of Tsybakov noise condition in binary case Optimal regret with the number of queries under noise condition Applications to communication efficient distributed learning

Alekh Agarwal Selective sampling for multiclass prediction

slide-52
SLIDE 52

Para-active learning

Sift for informative examples in parallel Update model on selected examples

  • Alekh Agarwal

Selective sampling for multiclass prediction

slide-53
SLIDE 53

Thank You

Alekh Agarwal Selective sampling for multiclass prediction