SLIDE 1
Bandit Multiclass Linear Classification: Efficient Algorithms for - - PowerPoint PPT Presentation
Bandit Multiclass Linear Classification: Efficient Algorithms for - - PowerPoint PPT Presentation
Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case Alina Beygelzimer David Pal Balazs Szorenyi (Yahoo!) (Yahoo!) (Yahoo!) Devanathan Chen-Yu Wei Chicheng Zhang Thiruvenkatachari (USC) (Microsoft) (NYU)
SLIDE 2
SLIDE 3
Bandit multiclass classification
For t = 1, 2, . . . , T:
- 1. Example (xt, yt) is chosen, where
SLIDE 4
Bandit multiclass classification
For t = 1, 2, . . . , T:
- 1. Example (xt, yt) is chosen, where
xt ∈ Rd is the feature (shown to the learner), − − − − − →
SLIDE 5
Bandit multiclass classification
For t = 1, 2, . . . , T:
- 1. Example (xt, yt) is chosen, where
xt ∈ Rd is the feature (shown to the learner), yt ∈ [K] is the label (hidden). − − − − − →
SLIDE 6
Bandit multiclass classification
For t = 1, 2, . . . , T:
- 1. Example (xt, yt) is chosen, where
xt ∈ Rd is the feature (shown to the learner), yt ∈ [K] is the label (hidden).
- 2. Predict class label
yt ∈ [K]. − − − − − → ← − − − − −
SLIDE 7
Bandit multiclass classification
For t = 1, 2, . . . , T:
- 1. Example (xt, yt) is chosen, where
xt ∈ Rd is the feature (shown to the learner), yt ∈ [K] is the label (hidden).
- 2. Predict class label
yt ∈ [K].
- 3. Observe feedback
zt = 1 [ yt = yt] ∈ {0, 1}. − − − − − → ← − − − − − − − − − − →
SLIDE 8
Bandit multiclass classification
For t = 1, 2, . . . , T:
- 1. Example (xt, yt) is chosen, where
xt ∈ Rd is the feature (shown to the learner), yt ∈ [K] is the label (hidden).
- 2. Predict class label
yt ∈ [K].
- 3. Observe feedback
zt = 1 [ yt = yt] ∈ {0, 1}. − − − − − → ← − − − − − − − − − − → Goal: minimize the total number of mistakes T
t=1 zt.
SLIDE 9
Challenge: efficient algorithms in the separable setting
Definition
A dataset is called γ-linearly separable if there exists w1, . . . , wK such that wy, x ≥
- wy′, x
- + γ,
∀y′ = y, for all (x, y) in the dataset. (with the constraint K
i=1 wi2 ≤ 1)
SLIDE 10
Challenge: efficient algorithms in the separable setting
Definition
A dataset is called γ-linearly separable if there exists w1, . . . , wK such that wy, x ≥
- wy′, x
- + γ,
∀y′ = y, for all (x, y) in the dataset. (with the constraint K
i=1 wi2 ≤ 1) Class 1 Class 3 w1 − w2, x = 0 w1 − w3, x = 0 w2 − w3, x = 0 Class 2
SLIDE 11
Related work
Algorithm Mistake Bound Efficient?
1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees
SLIDE 12
Related work
Algorithm Mistake Bound Efficient? Minimax algorithm [DH13] O(K/γ2) No
1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees
SLIDE 13
Related work
Algorithm Mistake Bound Efficient? Minimax algorithm [DH13] O(K/γ2) No Banditron [KSST08] 1 O(
- TK/γ2)
Yes
1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees
SLIDE 14
Related work
Algorithm Mistake Bound Efficient? Minimax algorithm [DH13] O(K/γ2) No Banditron [KSST08] 1 O(
- TK/γ2)
Yes This work 2
O(min(K log2(1/γ),√ 1/γ log K))
Yes
1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees
SLIDE 15
Related work
Algorithm Mistake Bound Efficient? Minimax algorithm [DH13] O(K/γ2) No Banditron [KSST08] 1 O(
- TK/γ2)
Yes This work 2
O(min(K log2(1/γ),√ 1/γ log K))
Yes Contribution: first efficient algorithm that breaks the √ T barrier
1See also [HK11, BOZ17, FKL+18, ..] that have similar guarantees
SLIDE 16
Algorithm
(One-versus-rest approach)
SLIDE 17
Algorithm
(One-versus-rest approach)
SLIDE 18
Algorithm
(One-versus-rest approach) If ≥ 1 of them respond YES:
- yt ← any one of those YES labels
If all of them respond NO:
- yt ← uniform from {1, . . . , K}
SLIDE 19
Algorithm
(One-versus-rest approach) If ≥ 1 of them respond YES:
- yt ← any one of those YES labels
If all of them respond NO:
- yt ← uniform from {1, . . . , K}
E[#mistakes(alg)] ≤ K
i #mistakes(i)
SLIDE 20
Algorithm
◮ Each non-linear binary classifier learns the support of class i, which lies in an intersection of K − 1 halfspaces with a margin [KS04].
Class 1 Class 3 w1 − w2, x = 0 w1 − w3, x = 0 w2 − w3, x = 0 Class 2
SLIDE 21
Algorithm
◮ Each non-linear binary classifier learns the support of class i, which lies in an intersection of K − 1 halfspaces with a margin [KS04].
Class 1 Class 3 w1 − w2, x = 0 w1 − w3, x = 0 w2 − w3, x = 0 Class 2
◮ Choice: kernel Perceptron with rational kernel [SSSS11]: K(x, x′) = 1 1 − 1
2x, x′.
SLIDE 22
Algorithm
◮ Each non-linear binary classifier learns the support of class i, which lies in an intersection of K − 1 halfspaces with a margin [KS04].
Class 1 Class 3 w1 − w2, x = 0 w1 − w3, x = 0 w2 − w3, x = 0 Class 2
◮ Choice: kernel Perceptron with rational kernel [SSSS11]: K(x, x′) = 1 1 − 1
2x, x′.
◮
- Thu. Poster#158