Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due - - PowerPoint PPT Presentation

announcements
SMART_READER_LITE
LIVE PREVIEW

Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due - - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office hours Come to office hours


slide-1
SLIDE 1

Active Learning and

Optimized Information Gathering

Lecture 7 – Learning Theory

CS 101.2 Andreas Krause

slide-2
SLIDE 2

2

Announcements

Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29

Any time is ok.

Office hours

Come to office hours before your presentation! Andreas: Monday 3pm-4:30pm, 260 Jorgensen Ryan: Wednesday 4:00-6:00pm, 109 Moore

slide-3
SLIDE 3

3

Recap Bandit Problems

Bandit problems

Online optimization under limited feedback

Exploration—Exploitation dilemma Algorithms with low regret:

ε-greedy, UCB1

Payoffs can be

Probabilistic Adversarial (oblivious / adaptive)

slide-4
SLIDE 4

4

More complex bandits

Bandits with many arms

Online linear optimization (online shortest paths …) X-armed bandits (Lipschitz mean payoff function) Gaussian process optimization (Bayesian assumptions about mean payoffs)

Bandits with state

Contextual bandits Reinforcement learning

Key tool: Optimism in the face of uncertainty ☺

slide-5
SLIDE 5

5

Course outline

1.

Online decision making

2.

Statistical active learning

3.

Combinatorial approaches

slide-6
SLIDE 6

6

Spam or Ham?

Labels are expensive (need to ask expert) Which labels should we obtain to maximize classification accuracy?

  • Spam

Ham

  • x1

x2 label = sign(w0 + w1 x1 + w2 x2) (linear separator)

slide-7
SLIDE 7

7

Outline

Background in learning theory Sample complexity Key challenges Heuristics for active learning Principled algorithms for active learning

slide-8
SLIDE 8

8

Credit scoring

??? 50 82 1 36 1 42 70 Defaulted? Credit score

Want decision rule that performs well for unseen examples (generalization)

slide-9
SLIDE 9

9

More general: Concept learning

Set X of instances True concept c: X {0,1} Hypothesis h: X {0,1} Hypothesis space H = {h1, …, hn, …} Want to pick good hypothesis

(agrees with true concept on most instances)

slide-10
SLIDE 10

10

Example: Binary thresholds

Input domain: X={1,2,…,100} True concept c: c(x) = +1 if x≥ t c(x) = -1 if x < t

  • - -
  • +

++ + 100 1 Threshold t

slide-11
SLIDE 11

11

How good is a hypothesis?

Set X of instances, concept c: X {0,1} Hypothesis h: X {0,1} , H = {h1, …, hn, …} Distribution PX over X errortrue(h) = Want h* = argminh∈ H errortrue(h) Can’t compute errortrue(h)!

slide-12
SLIDE 12

12

Concept learning

Data set D = {(x1,y1),…,(xN,yN)}, xi ∈ X, yi ∈ {0,1} Assume xi drawn independently from PX; yi = c(xi) Also assume c ∈ H h consistent with D More data fewer consistent hypotheses Learning strategy:

Collect “enough” data Output consistent hypothesis h Hope that errortrue(h) is small

slide-13
SLIDE 13

13

Sample complexity

Let ε>0 How many samples do we need s.t. all consistent hypotheses have error< ε?? Def: h ∈ H bad Suppose h∈ H is bad. Let x ∼ PX, y = c(x). Then:

slide-14
SLIDE 14

14

Sample complexity

P( h bad and “survives” 1 data point) = P( h bad and “survives” n data points) = P( remains ≥ 1 bad h after n data points) =

slide-15
SLIDE 15

15

Probability of bad hypothesis

slide-16
SLIDE 16

16

Sample complexity for finite hypothesis spaces [Haussler ’88]

Theorem: Suppose

|H| < ∞, Data set |D|=n drawn i.i.d. from PX (no noise) 0<ε <1

Then for any h ∈ H consistent with D: P( errortrue(h) > ε) ≤ |H| exp(-ε n) “PAC-bound” (probably approximately correct)

slide-17
SLIDE 17

17

How can we use this result?

P( errortrue(h) ≥ ε ) ≤ |H| exp(-ε n) = δ Possibilities:

Given δ, n solve for ε Given ε and δ, solve for n (Given ε, n, solve for δ)

slide-18
SLIDE 18

18

Example: Credit scoring

X = {1,2,…1000} H = binary thresholds on X |H| = Want error ≤ 0.01 with probability .999 Need n ≥ 1382 samples

slide-19
SLIDE 19

19

Limitations

How do we find consistent hypothesis? What if |H| = ∞? What if there’s noise in the data? (or c ∉ H)

slide-20
SLIDE 20

20

Credit scoring

??? 44 81 70 1 52 48 1 36 Defaulted? Credit score

No binary threshold function explains this data with 0 error

slide-21
SLIDE 21

21

Noisy data

Sets of instances X and labels Y = {0,1} Suppose (X,Y)∼ PXY Hypothesis space H errortrue(h) = Ex,y[ |h(x) − y| ] Want to find argminh∈ H errortrue(h)

slide-22
SLIDE 22

22

Learning from noisy data

Suppose D = {(x1,y1),…,(xn,yn)} where (xi,yi) ∼ PX,Y errortrain(h) = (1/n) ∑i |h(xi) − yi| Learning strategy with noisy data

Collect “enough“ data Output h’ = argminh∈ H errortrain(h) Hope that errortrue(h’) ≈ minh ∈H errortrue(h)

slide-23
SLIDE 23

23

Estimating error

How many samples do we need to accurately estimate the true error? Data set D = {(x1,y1),…,(xn,yn)} where (xi,yi) ∼ PX,Y zi = |h(xi) - yi | ∈ {0,1} zi are i.i.d. samples from Bernoulli RV Z = |h(X) - Y| errortrain(h) = errortrue(h) = How many samples s.t. |errortrain(h) – errortrue(h)| is small??

slide-24
SLIDE 24

24

Estimating error

How many samples do we need to accurately estimate the true error? Applying Chernoff-Hoeffding bound: P( |errortrue(h) – errortrain(h)| ≥ ε) ≤ exp(-2n ε2)

slide-25
SLIDE 25

25

Sample complexity with noise

Call h∈ H bad if errortrue(h) > errortrain(h) + ε P(h bad “survives” n training examples) ≤ exp(-2 n ε2) P(remains ≥ 1 bad h after n examples) ≤

slide-26
SLIDE 26

26

PAC Bound for noisy data

Theorem: Suppose

|H| < ∞, Data set |D|=n drawn i.i.d. from PXY 0<ε <1

Then for any h ∈ H it holds that errortrueh ≤ errortrainh

  • |H| /δ

n

slide-27
SLIDE 27

27

PAC Bounds: Noise vs. no noise

Want error ≤ ε with probability 1-δ No noise: n ≥ 1/ε ( log |H| + log 1/δ ) Noise: n ≥ 1/ε2 ( log |H| + log 1/δ )

slide-28
SLIDE 28

28

Limitations

How do we find consistent hypothesis? What if |H| = ∞? What if there’s noise in the data? (or c ∉ H)

slide-29
SLIDE 29

29

Credit scoring

??? 44.3141 81.3321 70.1111 1 52.3847 1 48.7983 1 36.1200 Defaulted? Credit score

Want to classify continuous instance space |H| = ∞ ∞ ∞ ∞

slide-30
SLIDE 30

30

Large hypothesis spaces

Idea: Labels of few data points imply labels of many unlabeled data points

slide-31
SLIDE 31

31

How many points can be arbitrarily classified using binary thresholds?

slide-32
SLIDE 32

32

How many points can be arbitrarily classified using linear separators? (1D)

slide-33
SLIDE 33

33

How many points can be arbitrarily classified using linear separators? (2D)

slide-34
SLIDE 34

34

VC dimension

Let S ⊆ X be a set of instances A Dichotomy is a nontrivial partition of S = S1 ∪ S0 S is shattered by hypothesis space H if for any dichotomy, there exists a consistent hypothesis h (i.e., h(x)=1 if x∈ S1 and h(x)=0 if x∈ S0) The VC (Vapnik-Chervonenkis) dimension VC(H) of H is the size of the largest set S shattered by H (possibly ∞) VC(H) ≤

slide-35
SLIDE 35

35

VC Generalization bound

errortrueh ≤ errortrainh

  • |H| /δ

n

errortrueh ≤ errortrainh

  • V CH
  • n

V CH

  • δ

n

Bound for finite hypothesis spaces VC-dimension based bound

slide-36
SLIDE 36

36

Applications

Allows to prove generalization bounds for large hypothesis spaces with structure. For many popular hypothesis classes, VC dimension known

Binary thresholds Linear classifiers Decision trees Neural networks

slide-37
SLIDE 37

37

Passive learning protocol

Data source PX,Y (produces inputs xi and labels yi) Data set Dn = {(x1,y1),…,(xn,yn)} Learner outputs hypothesis h errortrue(h) = Ex,y |h(x) − y|

slide-38
SLIDE 38

38

From passive to active learning

  • Spam

Ham

  • Some labels “more informative” than others
slide-39
SLIDE 39

39

Statistical passive/active learning protocol

Data source PX (produces inputs xi) data set Dn = {(x1,y1),…,(xn,yn)} Learner outputs hypothesis h errortrue(h) = Ex~P[h(x) ≠ c(x)] Active learner assembles by selectively obtaining labels

slide-40
SLIDE 40

40

Passive learning

Input domain: D=[0,1] True concept c: c(x) = +1 if x≥ t c(x) = -1 if x < t Passive learning: Acquire all labels yi ∈{+,-} 1 Threshold t

slide-41
SLIDE 41

41

Active learning

Input domain: D=[0,1] True concept c: c(x) = +1 if x≥ t c(x) = -1 if x < t Passive learning: Acquire all labels yi ∈{+,-} Active learning: Decide which labels to obtain 1 Threshold t

slide-42
SLIDE 42

42

Comparison

Active learning can exponentially reduce the number

  • f required labels!

O(log 1/ε) Active learning Ω(1/ε) Passive learning Labels needed to learn with classification error ε

slide-43
SLIDE 43

43

Key challenges

PAC Bounds we’ve seen so far crucially depend on i.i.d. data!! Actively assembling data set causes bias!

If we’re not careful, active learning can do worse!

slide-44
SLIDE 44

44

What you need to know

Concepts, hypotheses PAC bounds (probably approximate correct)

For noiseless (“realizable”) case For noisy (“unrealizable”) case

VC dimension Active learning protocol