Machine Learning
Online Learning
1
Some slides based on lectures from Dan Roth, Avrim Blum and others
Online Learning Machine Learning 1 Some slides based on lectures - - PowerPoint PPT Presentation
Online Learning Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others Big picture 2 Big picture Last lecture: Linear models 3 Big picture Linear models How good is a learning algorithm? 4 Big picture
1
Some slides based on lectures from Dan Roth, Avrim Blum and others
2
3
Last lecture: Linear models
4
Linear models How good is a learning algorithm?
5
Linear models How good is a learning algorithm? Online learning
6
Linear models How good is a learning algorithm? Online learning Perceptron, Winnow
7
Linear models How good is a learning algorithm? Online learning PAC, Empirical Risk Minimization Perceptron, Winnow Support Vector Machines
8
9
10
11
12
13
14
15
Current hypothesis ℎ! One example: x Prediction ℎ!(x) Loop forever:
16
Current hypothesis ℎ! One example: x Prediction ℎ!(x) Loop forever:
Only need to define how prediction and update behave Can such a simple scheme work? How do we quantify what “work” means?
– Instance space: 𝒴 (dimensionality 𝑜) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜)
– Learner is given 𝐲 ∈ 𝒴, randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback
– 𝑁
! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of
examples for the target function 𝑔 – 𝑁
! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made
by 𝐵 for any target function in 𝐷 and any sequence S of examples
17
– Instance space: 𝒴 (dimensionality 𝑜) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜)
– Learner is given 𝐲 ∈ 𝒴, randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback
– 𝑁
! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of
examples for the target function 𝑔 – 𝑁
! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made
by 𝐵 for any target function in 𝐷 and any sequence S of examples
18
– Instance space: 𝒴 (dimensionality 𝑜) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜)
– Learner is given 𝐲 ∈ 𝒴, randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback
– 𝑁
! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of
examples for the target function 𝑔 – 𝑁
! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made
by 𝐵 for any target function in 𝐷 and any sequence S of examples
19
– Instance space: 𝒴 (dimensionality 𝑜) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜)
– Learner is given 𝐲 ∈ 𝒴, randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback
– 𝑁
! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of
examples for the target function 𝑔 – 𝑁
! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made
by 𝐵 for any target function in 𝐷 and any sequence S of examples
20
– Instance space: 𝒴 (dimensionality 𝑜) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜)
– Learner is given 𝐲 ∈ 𝒴, randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback
– 𝑁
! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of
examples for the target function 𝑔 – 𝑁
! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made
by 𝐵 for any target function in 𝐷 and any sequence S of examples
21
– Instance space: 𝒴 (dimensionality 𝑜) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜)
– Learner is given 𝐲 ∈ 𝒴, randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback
– 𝑁
! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of
examples for the target function 𝑔 – 𝑁
! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made
by 𝐵 for any target function in 𝐷 and any sequence S of examples
22
23
24
25
26
learning?
previously unseen data?
27
– Too simple – Global behavior: not clear when will the mistakes be made
– Simple – Many issues arise already in this setting – Generic conversion to other learning models (online-to-batch conversion)
28
29
30
31
In the 𝑗𝑢ℎ stage of the algorithm: – 𝐷% = all concepts in C consistent with all i – 1 previously seen examples – Choose randomly 𝑔 ∈ 𝐷% and use it to predict the next example
32
In the 𝑗𝑢ℎ stage of the algorithm: – 𝐷% = all concepts in C consistent with all i – 1 previously seen examples – Choose randomly 𝑔 ∈ 𝐷% and use it to predict the next example
33
In the 𝑗𝑢ℎ stage of the algorithm: – 𝐷% = all concepts in C consistent with all i – 1 previously seen examples – Choose randomly 𝑔 ∈ 𝐷% and use it to predict the next example
34
In the 𝑗𝑢ℎ stage of the algorithm: – 𝐷% = all concepts in C consistent with all i – 1 previously seen examples – Choose randomly 𝑔 ∈ 𝐷% and use it to predict the next example
35
In the 𝑗𝑢ℎ stage of the algorithm: – 𝐷% = all concepts in C consistent with all i – 1 previously seen examples – Choose randomly 𝑔 ∈ 𝐷% and use it to predict the next example
36
In the 𝑗𝑢ℎ stage of the algorithm: – 𝐷% = all concepts in C consistent with all i – 1 previously seen examples – Choose randomly 𝑔 ∈ 𝐷% and use it to predict the next example
37
Questions?
In the 𝑗𝑢ℎ stage of the algorithm: – 𝐷% = all concepts in C consistent with all i – 1 previously seen examples – Choose randomly 𝑔 ∈ 𝐷% and use it to predict the next example
38
Is this a mistake bound algorithm? Depends on what 𝐷 is Can we do better than CON?
39
40
41
42
43
44
45
46
47
Questions?
48
49
For the most difficult concept in the class, for the most difficult sequence of examples, the optimal mistake bound algorithm makes the fewest number of mistakes
50
For the most difficult concept in the class, for the most difficult sequence of examples, the optimal mistake bound algorithm makes the fewest number of mistakes
51
52
& ≈ 2&𝑜&
53
& ≈ 2&𝑜&
54
& ≈ 2&𝑜&
55
& ≈ 2&𝑜&
56
The Halving algorithm is not efficient. Elimination is an efficient algorithm that realizes the mistake bound of the Halving algorithm
& ≈ 2&𝑜&
57
& ≈ 2&𝑜&
58
Why?
59
– Theorem [Haussler 1988]: Given a sample on n attributes that is consistent with a conjunctive concept, it is NP-hard to find a pure conjunctive hypothesis that is both consistent with the sample and has the minimum number of attributes. – Same holds for Disjunctions
Given a collection of sets that cover X, define a set of examples so that learning the best (dis/con)junction implies a minimal cover.
60
– Same holds for Disjunctions
Given a collection of sets that cover X, define a set of examples so that learning the best (dis/con)junction implies a minimal cover.
61
– Same holds for Disjunctions
62
– Same holds for Disjunctions
63
– Same holds for Disjunctions
64
computation time
65