Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 7 – Learning Theory CS 101.2 Andreas Krause

Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office hours Come to office hours before your presentation! Andreas: Monday 3pm-4:30pm , 260 Jorgensen Ryan: Wednesday 4:00-6:00pm, 109 Moore 2

Recap Bandit Problems Bandit problems Online optimization under limited feedback Exploration—Exploitation dilemma Algorithms with low regret: ε -greedy, UCB1 Payoffs can be Probabilistic Adversarial (oblivious / adaptive) 3

More complex bandits Bandits with many arms Online linear optimization (online shortest paths …) X-armed bandits (Lipschitz mean payoff function) Gaussian process optimization (Bayesian assumptions about mean payoffs) Bandits with state Contextual bandits Reinforcement learning Key tool : Optimism in the face of uncertainty ☺ 4

Course outline Online decision making 1. Statistical active learning 2. Combinatorial approaches 3. 5

Spam or Ham? x 2 � � � Spam � � � � � � � � � � � � � � � � � Ham x 1 label = sign(w 0 + w 1 x 1 + w 2 x 2 ) (linear separator) Labels are expensive (need to ask expert) Which labels should we obtain to maximize classification accuracy? 6

Outline Background in learning theory Sample complexity Key challenges Heuristics for active learning Principled algorithms for active learning 7

Credit scoring Credit score Defaulted? 70 0 42 1 36 1 82 0 50 ??? Want decision rule that performs well for unseen examples ( generalization ) 8

More general: Concept learning Set X of instances True concept c: X � {0,1} Hypothesis h: X � {0,1} Hypothesis space H = {h 1 , …, h n , …} Want to pick good hypothesis (agrees with true concept on most instances) 9

Example: Binary thresholds Input domain: X={1,2,…,100} True concept c: -- - - + ++ + c(x) = +1 if x ≥ t 100 1 Threshold t c(x) = -1 if x < t 10

How good is a hypothesis? Set X of instances, concept c: X � {0,1} Hypothesis h: X � {0,1} , H = {h 1 , …, h n , …} Distribution P X over X error true (h) = Want h* = argmin h ∈ H error true (h) Can’t compute error true (h)! 11

Concept learning Data set D = {(x 1 ,y 1 ),…,(x N ,y N )}, x i ∈ X, y i ∈ {0,1} Assume x i drawn independently from P X ; y i = c(x i ) Also assume c ∈ H h consistent with D � More data � fewer consistent hypotheses Learning strategy: Collect “enough” data Output consistent hypothesis h Hope that error true (h) is small 12

Sample complexity Let ε >0 How many samples do we need s.t. all consistent hypotheses have error< ε ?? Def: h ∈ H bad � Suppose h ∈ H is bad. Let x ∼ P X , y = c(x). Then: 13

Sample complexity P( h bad and “survives” 1 data point) = P( h bad and “survives” n data points) = P( remains ≥ 1 bad h after n data points) = 14

Probability of bad hypothesis 15

Sample complexity for finite hypothesis spaces [Haussler ’88] Theorem: Suppose |H| < ∞ , Data set |D|=n drawn i.i.d. from P X (no noise) 0< ε <1 Then for any h ∈ H consistent with D: P( error true (h) > ε ) ≤ |H| exp(- ε n) “PAC-bound” (probably approximately correct) 16

How can we use this result? P( error true (h) ≥ ε ) ≤ |H| exp(- ε n) = δ Possibilities: Given δ , n solve for ε Given ε and δ , solve for n (Given ε , n, solve for δ) 17

Example: Credit scoring X = {1,2,…1000} H = binary thresholds on X |H| = Want error ≤ 0.01 with probability .999 Need n ≥ 1382 samples 18

Limitations How do we find consistent hypothesis? What if |H| = ∞ ? What if there’s noise in the data? (or c ∉ H) 19

Credit scoring Credit score Defaulted? 36 1 48 0 52 1 70 0 81 0 44 ??? No binary threshold function explains this data with 0 error 20

Noisy data Sets of instances X and labels Y = {0,1} Suppose (X,Y) ∼ P XY Hypothesis space H error true (h) = E x,y [ |h(x) − y| ] Want to find argmin h ∈ H error true (h) 21

Learning from noisy data Suppose D = {(x 1 ,y 1 ),…,(x n ,y n )} where (x i ,y i ) ∼ P X,Y error train (h) = (1/n) ∑ i |h(x i ) − y i | Learning strategy with noisy data Collect “enough“ data Output h’ = argmin h ∈ H error train (h) Hope that error true (h’) ≈ min h ∈ H error true (h) 22

Estimating error How many samples do we need to accurately estimate the true error? Data set D = {(x 1 ,y 1 ),…,(x n ,y n )} where (x i ,y i ) ∼ P X,Y z i = |h(x i ) - y i | ∈ {0,1} z i are i.i.d. samples from Bernoulli RV Z = |h(X) - Y| error train (h) = error true (h) = How many samples s.t. |error train (h) – error true (h)| is small?? 23

Estimating error How many samples do we need to accurately estimate the true error? Applying Chernoff-Hoeffding bound: P( |error true (h) – error train (h)| ≥ ε ) ≤ exp(-2n ε 2 ) 24

Sample complexity with noise Call h ∈ H bad if error true (h) > error train (h) + ε P(h bad “survives” n training examples) ≤ exp(-2 n ε 2 ) P(remains ≥ 1 bad h after n examples) ≤ 25

PAC Bound for noisy data Theorem: Suppose |H| < ∞ , Data set |D|=n drawn i.i.d. from P XY 0< ε <1 Then for any h ∈ H it holds that � �� | H | � �� /δ error true � h � ≤ error train � h � � � n 26

PAC Bounds: Noise vs. no noise Want error ≤ ε with probability 1- δ No noise: n ≥ 1/ ε ( log |H| + log 1/ δ ) n ≥ 1/ ε 2 ( log |H| + log 1/ δ ) Noise: 27

Limitations How do we find consistent hypothesis? What if |H| = ∞ ? What if there’s noise in the data? (or c ∉ H) 28

Credit scoring Credit score Defaulted? 36.1200 1 48.7983 1 52.3847 1 70.1111 0 81.3321 0 44.3141 ??? Want to classify continuous instance space |H| = ∞ ∞ ∞ ∞ 29

Large hypothesis spaces Idea : Labels of few data points imply labels of many unlabeled data points 30

How many points can be arbitrarily classified using binary thresholds? 31

How many points can be arbitrarily classified using linear separators? (1D) 32

How many points can be arbitrarily classified using linear separators? (2D) 33

VC dimension Let S ⊆ X be a set of instances A Dichotomy is a nontrivial partition of S = S 1 ∪ S 0 S is shattered by hypothesis space H if for any dichotomy, there exists a consistent hypothesis h (i.e., h(x)=1 if x ∈ S 1 and h(x)=0 if x ∈ S 0 ) The VC (Vapnik-Chervonenkis) dimension VC(H) of H is the size of the largest set S shattered by H (possibly ∞ ) VC(H) ≤ 34

VC Generalization bound Bound for finite hypothesis spaces � �� | H | � �� /δ error true � h � ≤ error train � h � � � n VC-dimension based bound � � � � � � n � �� V C � H � � � �� V C � H � δ error true � h � ≤ error train � h � � n 35

Applications Allows to prove generalization bounds for large hypothesis spaces with structure. For many popular hypothesis classes, VC dimension known Binary thresholds Linear classifiers Decision trees Neural networks 36

Passive learning protocol Data source P X,Y (produces inputs x i and labels y i ) Data set D n = {(x 1 ,y 1 ),…,(x n ,y n )} Learner outputs hypothesis h error true (h) = E x,y |h(x) − y| 37

From passive to active learning � � � Spam � � � � � � � � � � � � � � � � � Ham Some labels “more informative” than others 38

Statistical passive/active learning protocol Data source P X (produces inputs x i ) Active learner assembles data set D n = {(x 1 ,y 1 ),…,(x n ,y n )} by selectively obtaining labels Learner outputs hypothesis h error true (h) = E x~P [h(x) ≠ c(x)] 39

Passive learning Input domain: D=[0,1] True concept c: 0 1 c(x) = +1 if x ≥ t Threshold t c(x) = -1 if x < t Passive learning: Acquire all labels y i ∈� {+,-} 40

Active learning Input domain: D=[0,1] True concept c: 0 1 c(x) = +1 if x ≥ t Threshold t c(x) = -1 if x < t Passive learning: Acquire all labels y i ∈� {+,-} Active learning: Decide which labels to obtain 41

Comparison Labels needed to learn with classification error ε Passive learning Ω( 1/ ε ) Active learning O(log 1/ ε ) Active learning can exponentially reduce the number of required labels! 42

Key challenges PAC Bounds we’ve seen so far crucially depend on i.i.d. data!! Actively assembling data set causes bias ! If we’re not careful, active learning can do worse ! 43

What you need to know Concepts, hypotheses PAC bounds (probably approximate correct) For noiseless (“realizable”) case For noisy (“unrealizable”) case VC dimension Active learning protocol 44

Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office hours Come to office hours

DHTs and Sharding Aurojit Panda Announcements Announcements Fill out the Github consent

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 6 Monday, February 2 Announcements 2 Announcements Homework 2 due Monday 2/2 @

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 6 Friday, September 13 Announcements 2 Announcements Homework 2 due Tuesday

61A Lecture 24 Monday, March 30 Announcements 2 Announcements Homework 7 due Wednesday 4/8

61A Lecture 37 Wednesday, April 29 Announcements 2 Announcements Homework 9 (4 pts) due

CS 61A Lecture 10 Friday, February 13 Announcements 2 Announcements Guerrilla Section 2 is

61A Lecture 14 Wednesday, February 25 Announcements 2 Announcements Project 2 due Thursday

Linearizability & CAP Announcements No hours this week. Announcements No hours this

61A Lecture 13 Wednesday, October 2 Announcements 2 Announcements Homework 3 deadline

61A Lecture 24 Friday, November 1 Announcements 2 Announcements Homework 7 due Tuesday 11/5

61A Extra Lecture 2 Thursday, February 5 Announcements 2 Announcements If you want 1 unit

CS 61A Lecture 11 Wednesday, February 18 Announcements 2 Announcements Optional Hog Contest

Announcements Lecture 22 System Development Leah Perlmutter / Summer 2018 Announcements

Lecture 30: Conclusion Brian Hou August 11, 2016 Announcements Announcements Final Exam

Scheduling Don Porter CSE 306 Last time We went through the high-level theory of scheduling

Temporal-Di ff erence Learning What is MC estimation doing? Adaptive Dynamic Programming V ( s t

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj

Detector challenges at CLIC contrasted with the LHC case CERN detector seminar 12 Oct.

Chapter 1 Introduction Chapter 1 Objectives Know the difference between computer

182.694 Microcontroller VU Martin Perner SS 2017 Featuring Today: Assembler Programming Weekly

IN INPU PUT DE T DEVI VICES CES Man aninder nder Kau aur professorm

Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office hours Come to office hours

DHTs and Sharding Aurojit Panda Announcements Announcements Fill out the Github consent

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 6 Monday, February 2 Announcements 2 Announcements Homework 2 due Monday 2/2 @

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 6 Friday, September 13 Announcements 2 Announcements Homework 2 due Tuesday

61A Lecture 24 Monday, March 30 Announcements 2 Announcements Homework 7 due Wednesday 4/8

61A Lecture 37 Wednesday, April 29 Announcements 2 Announcements Homework 9 (4 pts) due

CS 61A Lecture 10 Friday, February 13 Announcements 2 Announcements Guerrilla Section 2 is

61A Lecture 14 Wednesday, February 25 Announcements 2 Announcements Project 2 due Thursday

Linearizability &amp; CAP Announcements No hours this week. Announcements No hours this

61A Lecture 13 Wednesday, October 2 Announcements 2 Announcements Homework 3 deadline

61A Lecture 24 Friday, November 1 Announcements 2 Announcements Homework 7 due Tuesday 11/5

61A Extra Lecture 2 Thursday, February 5 Announcements 2 Announcements If you want 1 unit

CS 61A Lecture 11 Wednesday, February 18 Announcements 2 Announcements Optional Hog Contest

Announcements Lecture 22 System Development Leah Perlmutter / Summer 2018 Announcements

Lecture 30: Conclusion Brian Hou August 11, 2016 Announcements Announcements Final Exam

Scheduling Don Porter CSE 306 Last time We went through the high-level theory of scheduling

Temporal-Di ff erence Learning What is MC estimation doing? Adaptive Dynamic Programming V ( s t

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj

Detector challenges at CLIC contrasted with the LHC case CERN detector seminar 12 Oct.

Chapter 1 Introduction Chapter 1 Objectives Know the difference between computer

182.694 Microcontroller VU Martin Perner SS 2017 Featuring Today: Assembler Programming Weekly

IN INPU PUT DE T DEVI VICES CES Man aninder nder Kau aur professorm

Linearizability & CAP Announcements No hours this week. Announcements No hours this