CS 6316 Machine Learning
Introduction to Learning Theory
Yangfeng Ji
Department of Computer Science University of Virginia
CS 6316 Machine Learning Introduction to Learning Theory Yangfeng - - PowerPoint PPT Presentation
CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer Science University of Virginia Overview 1. A Toy Example 2. A Formal Model 3. Empirical Risk Minimization 4. Finite Hypothesis Classes 5. PAC
Department of Computer Science University of Virginia
1
2
3
x1 x2 + + + +
5
◮ A triangle ◮ A rectangle ◮ A circle
x1 x2 + + + +
x1 x2 + + + +
7
x1 x2 + + + +
8
x1 x2 + + + +
x1 x2 + + + +
10
x1 x2 + + + +
Center x2 radius
Figure: Visualizing the Circle hypothesis class.
11
x1 x2 + + + +
Center x2 radius
1A precise definition will be provided later in this lecture.
12
◮ Assume we know the underlying data distribution ◮ Assume at least one of the classifiers we pick will
completely solve the problem
13
x1 x2 + + + +
16
17
Figure: 1K examples generated with the previous process.
18
distribution D
◮ h(x) f (x): prediction from hypothesis h does not
match the labeling function output
◮ LD, f (h): the error of h is measured with respect to D
and f
19
20
p(y +1) p(y −1) 1 2 p(x | y +1) N(x; 2, 1) p(x | y −1) N(x; −2, 1) (4)
−6 −4 −2 2 4 6 5 · 10−2 0.1 0.15 0.2
21
−6 −4 −2 2 4 6 5 · 10−2 0.1 0.15 0.2
22
−6 −4 −2 2 4 6 5 · 10−2 0.1 0.15 0.2
22
23
25
Figure: 1K examples generated with the previous process.
26
h∈H
27
x1 x2 + + + +
x1 x2 + + + +
x1 x2 + + + +
29
2labeling function, in the context of our discussion
30
32
32
3Please refer to the demo code for more examples
33
f {hi : i ∈ [10]}
i 10
i 10 ≤ x ≤ 1
34
35
◮ try the hypotheses one by one and find the best ◮ time complexity O(|H
f |)
36
has both positive and negative instances
4Run the demo code about ten times, you may be able to see this happens once.
37
38
39
◮ Missing critical information about the data
distribution D
◮ LS(hS) 0, but LD(hS) 0
◮ The realizability assumption is unrealistic for
practical applications
40
42
43
p(x; α, β) 1 B(α, β) xα−1(1 − x)β−1 (16)
44
45
46
47
47
48
48
◮ Because the training set is randomly generated, which
can be non-representative
◮ Because we can only finite number of training
examples, even though the training set is representative
49
50
51
52
53
55
56
◮ Revise D as a joint distribution over X× Y ◮ Revise the true risk of a prediction rule h to be
LD(h) P(x,y)∼D[h(x) y] (27)
◮ Revise the empirical risk remains the same
LS(h) |{i ∈ [m] : h(xi) yi}| m (28)
57
h′∈HLD(h′) + ǫ
58
h′∈HLD(h′) + ǫ
h′∈HLD(h′) 0
h′∈HLD(h′) + ǫ
59
60
2
60
2
60
D 1 2B(x; 4, 1)
+ 1 2B(x, 1, 4)
(34)
61
D 1 2B(x; 4, 1)
+ 1 2B(x, 1, 4)
(34)
61
62
Mitchell, T. M. (1997). Machine learning. McGraw-Hill. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
63