Bandit Multiclass Linear Classification: Efficient Algorithms for - - PowerPoint PPT Presentation

bandit multiclass linear classification efficient
SMART_READER_LITE
LIVE PREVIEW

Bandit Multiclass Linear Classification: Efficient Algorithms for - - PowerPoint PPT Presentation

Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case Alina Beygelzimer David Pal Balazs Szorenyi (Yahoo!) (Yahoo!) (Yahoo!) Devanathan Chen-Yu Wei Chicheng Zhang Thiruvenkatachari (USC) (Microsoft) (NYU)


slide-1
SLIDE 1

Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case

Alina Beygelzimer (Yahoo!) David Pal (Yahoo!) Balazs Szorenyi (Yahoo!) Devanathan Thiruvenkatachari (NYU) Chen-Yu Wei (USC) Chicheng Zhang (Microsoft)

slide-2
SLIDE 2

Bandit multiclass classification

For t = 1, 2, . . . , T:

slide-3
SLIDE 3

Bandit multiclass classification

For t = 1, 2, . . . , T:

  • 1. Example (xt, yt) is chosen, where
slide-4
SLIDE 4

Bandit multiclass classification

For t = 1, 2, . . . , T:

  • 1. Example (xt, yt) is chosen, where

xt ∈ Rd is the feature (shown to the learner), − − − − − →

slide-5
SLIDE 5

Bandit multiclass classification

For t = 1, 2, . . . , T:

  • 1. Example (xt, yt) is chosen, where

xt ∈ Rd is the feature (shown to the learner), yt ∈ [K] is the label (hidden). − − − − − →

slide-6
SLIDE 6

Bandit multiclass classification

For t = 1, 2, . . . , T:

  • 1. Example (xt, yt) is chosen, where

xt ∈ Rd is the feature (shown to the learner), yt ∈ [K] is the label (hidden).

  • 2. Predict class label

yt ∈ [K]. − − − − − → ← − − − − −

slide-7
SLIDE 7

Bandit multiclass classification

For t = 1, 2, . . . , T:

  • 1. Example (xt, yt) is chosen, where

xt ∈ Rd is the feature (shown to the learner), yt ∈ [K] is the label (hidden).

  • 2. Predict class label

yt ∈ [K].

  • 3. Observe feedback

zt = 1 [ yt = yt] ∈ {0, 1}. − − − − − → ← − − − − − − − − − − →

slide-8
SLIDE 8

Bandit multiclass classification

For t = 1, 2, . . . , T:

  • 1. Example (xt, yt) is chosen, where

xt ∈ Rd is the feature (shown to the learner), yt ∈ [K] is the label (hidden).

  • 2. Predict class label

yt ∈ [K].

  • 3. Observe feedback

zt = 1 [ yt = yt] ∈ {0, 1}. − − − − − → ← − − − − − − − − − − → Goal: minimize the total number of mistakes T

t=1 zt.

slide-9
SLIDE 9

Challenge: efficient algorithms in the separable setting

Definition

A dataset is called γ-linearly separable if there exists w1, . . . , wK such that wy, x ≥

  • wy′, x
  • + γ,

∀y′ = y, for all (x, y) in the dataset. (with the constraint K

i=1 wi2 ≤ 1)

slide-10
SLIDE 10

Challenge: efficient algorithms in the separable setting

Definition

A dataset is called γ-linearly separable if there exists w1, . . . , wK such that wy, x ≥

  • wy′, x
  • + γ,

∀y′ = y, for all (x, y) in the dataset. (with the constraint K

i=1 wi2 ≤ 1) Class 1 Class 3 w1 − w2, x = 0 w1 − w3, x = 0 w2 − w3, x = 0 Class 2

slide-11
SLIDE 11

Related work

Algorithm Mistake Bound Efficient?

1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees

slide-12
SLIDE 12

Related work

Algorithm Mistake Bound Efficient? Minimax algorithm [DH13] O(K/γ2) No

1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees

slide-13
SLIDE 13

Related work

Algorithm Mistake Bound Efficient? Minimax algorithm [DH13] O(K/γ2) No Banditron [KSST08] 1 O(

  • TK/γ2)

Yes

1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees

slide-14
SLIDE 14

Related work

Algorithm Mistake Bound Efficient? Minimax algorithm [DH13] O(K/γ2) No Banditron [KSST08] 1 O(

  • TK/γ2)

Yes This work 2

O(min(K log2(1/γ),√ 1/γ log K))

Yes

1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees

slide-15
SLIDE 15

Related work

Algorithm Mistake Bound Efficient? Minimax algorithm [DH13] O(K/γ2) No Banditron [KSST08] 1 O(

  • TK/γ2)

Yes This work 2

O(min(K log2(1/γ),√ 1/γ log K))

Yes Contribution: first efficient algorithm that breaks the √ T barrier

1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees

slide-16
SLIDE 16

Algorithm

(One-versus-rest approach)

slide-17
SLIDE 17

Algorithm

(One-versus-rest approach)

slide-18
SLIDE 18

Algorithm

(One-versus-rest approach) If ≥ 1 of them respond YES:

  • yt ← any one of those YES labels

If all of them respond NO:

  • yt ← uniform from {1, . . . , K}
slide-19
SLIDE 19

Algorithm

(One-versus-rest approach) If ≥ 1 of them respond YES:

  • yt ← any one of those YES labels

If all of them respond NO:

  • yt ← uniform from {1, . . . , K}

E[#mistakes(alg)] ≤ K

i #mistakes(i)

slide-20
SLIDE 20

Algorithm

◮ Each non-linear binary classifier learns the support of class i, which lies in an intersection of K − 1 halfspaces with a margin [KS04].

Class 1 Class 3 w1 − w2, x = 0 w1 − w3, x = 0 w2 − w3, x = 0 Class 2

slide-21
SLIDE 21

Algorithm

◮ Each non-linear binary classifier learns the support of class i, which lies in an intersection of K − 1 halfspaces with a margin [KS04].

Class 1 Class 3 w1 − w2, x = 0 w1 − w3, x = 0 w2 − w3, x = 0 Class 2

◮ Choice: kernel Perceptron with rational kernel [SSSS11]: K(x, x′) = 1 1 − 1

2x, x′.

slide-22
SLIDE 22

Algorithm

◮ Each non-linear binary classifier learns the support of class i, which lies in an intersection of K − 1 halfspaces with a margin [KS04].

Class 1 Class 3 w1 − w2, x = 0 w1 − w3, x = 0 w2 − w3, x = 0 Class 2

◮ Choice: kernel Perceptron with rational kernel [SSSS11]: K(x, x′) = 1 1 − 1

2x, x′.

  • Thu. Poster#158