Announcements Homework 1: Due today Office hours Come to office - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 8 – Active Learning CS 101.2 Andreas Krause

Announcements Homework 1: Due today Office hours Come to office hours before your presentation! Andreas: Monday 3pm-4:30pm , 260 Jorgensen Ryan: Wednesday 4:00-6:00pm, 109 Moore 2

Outline Background in learning theory Sample complexity Key challenges Heuristics for active learning Principled algorithms for active learning 3

Spam or Ham? x 2 � � � Spam � � � � � � � � � � � � � � � � � Ham x 1 label = sign(w 0 + w 1 x 1 + w 2 x 2 ) (linear separator) Labels are expensive (need to ask expert) Which labels should we obtain to maximize classification accuracy? 4

Recap: Concept learning Set X of instances, with distribution P X True concept c: X � {0,1} Data set D = {(x 1 ,y 1 ),…,(x n ,y n )}, x i ∼ P X , y i = c(x i ) Hypothesis h: X � {0,1} from H = {h 1 , …, h n , …} Assume c ∈ H (c also called “target hypothesis”) error true (h) = E X |c(x)-h(x)| error train (h) = (1/n) ∑ i |c(x i )-h(x i )| If n large enough, error true (h) ≈ ≈ ≈ ≈ error train (h) for all h 5

Recap: PAC Bounds How many samples n to we need to get error ≤ ε with probability 1- δ ? No noise: n ≥ 1/ ε ( log |H| + log 1/ δ ) n ≥ 1/ ε 2 ( log |H| + log 1/ δ ) Noise: Requires that data is i.i.d.! Today: Mainly no-noise case (more next week) 6

Statistical passive/active learning protocol Data source P X (produces inputs x i ) Active learner assembles data set D n = {(x 1 ,y 1 ),…,(x n ,y n )} by selectively obtaining labels Learner outputs hypothesis h error true (h) = E x~P [h(x) ≠ c(x)] Data set NOT sampled i.i.d.!! 7

Example: Uncertainty sampling Budget of m labels Draw n unlabeled examples Repeat until we’ve picked m labels Assign each unlabeled data an “uncertainty score” Greedily pick the most uncertain example One of the most commonly used class of heuristics! 8

Uncertainty sampling for linear separators 9

Active learning bias 10

Active learning bias If we can pick at most m = n/2 labels, with overwhelmingly high probability, US pick points such that there remains a hypothesis with error > .1!!! With standard passive learning, error � 0 as n � ∞ 11

Wish list for active learning Minimum requirement Consistency: Generalization error should go to 0 asymptotically We’d like more than that: Fallback guarantee: Convergence rate of error of active learning “at least as good” as passive learning What we’re really after Rate improvement : Error of active learning decreases much faster than for passive learning 12

From passive to active Passive PAC learning Collect data set D of n ≥ 1/ ε ( log |H| + log 1/ δ ) data points 1. and their labels i.i.d. from P X Output consistent hypothesis h 2. With probability at least 1- δ , error true (h) ≤ ε 3. Key idea Sample n unlabeled data points D X ={x 1 ,…,x n } i.i.d. Actively query labels until all hypotheses consistent with these labels agree on the labels of all unlabeled data 13

Why might this work? 14

Formalization: “Relevant” hypothesis Data set D = {(x 1 ,y 1 ),…,(x n ,y n )}, Hypothesis space H Input data: D X = {x 1 ,…,x n } Relevant hypothesis H’(D X ) = H’ = Restriction of H on D X Formally: H’ = {h’: D X � {0,1} ∃ h ∈ H s.t. ∀ x ∈ D X : h’(x)=h(x)} 15

Example: Threshold functions 16

Version space Input data D X = {x 1 ,…,x n } Partially labeled: Have L = {(x i1 ,y i1 ),…,(x im ,y im )} The (relevant) version space is the set of all relevant hypotheses consistent with the labels L Formally: Why useful? Partial labels L imply all remaining labels for D X � |V|=1 17

Version space Input data D X = {x 1 ,…,x n } Partially labeled: Have L = {(x i1 ,y i1 ),…,(x im ,y im )} The (relevant) version space is the set of all relevant hypotheses consistent with the labels L Formally: V(D X ,L) = V = {h’ ∈ H’(D X ): h’(x ij )=y ij for 1 ≤ j ≤ m} Why useful? Partial labels L imply all remaining labels for D X � |V|=1 18

Example: Binary thresholds 19

Pool-based active learning with fallback Collect n ≥ 1/ ε ( log |H| + log 1/ δ ) unlabeled data 1. points D X from P X Actively request labels L until there remains a single 2. hypothesis h’ ∈ H’ that’s consistent with these labels (i.e., |V(H’,L)| = 1) Output any hypothesis h ∈ H consistent with the 3. obtained labels. With probability ≥ 1- δ error true (h) ≤ ε Get PAC guarantees for active learning Bounds on #labels for fixed error ε carry over from passive to active � � Fallback guarantee � � 20

Pool-based active learning with fallback Collect n ≥ 1/ ε ( log |H| + log 1/ δ ) unlabeled data 1. points D X from P X Actively request labels L until there remains a single 2. hypothesis h’ ∈ H’ that’s consistent with these labels (i.e., |V(H’,L)| = 1) Output any hypothesis h ∈ H consistent with the 3. obtained labels. With probability ≥ 1- δ error true (h) ≤ ε 22

Generalizing binary search [Dasgupta ’04] Want to shrink the version space (number of consistent hypotheses) as quickly as possible. General (greedy) approach: For each unlabeled instance x i compute v i,1 = v i,0 = v i = min {v i,1 , v i,0 } Obtain label y i for x i where i = argmax j {v j } 24

Ideal case 25

Is it always possible to half the version space? 26

Typical case much more benign 27

Query trees A query tree is a rooted, labeled tree on the relevant hypothesis H’ Each node is labeled with an input x ∈ D X Each edge is labeled with {0,1} Each path from root to hypothesis h’ ∈ H’ is a labeling L such that V(D X ,L) = {h’} Want query trees of minimum height 28

Example: linear separators (2D) 30

Number of labels needed to identify hypothesis Depends on target hypothesis! Binary thresholds (on n inputs D_X) Optimal query tree needs O(log n) labels! ☺ For linear separators in 2D (on n inputs D_X) For some hypotheses, even optimal tree needs n labels � On average, optimal query tree needs O(log n) labels! ☺ � Average-case analysis of active learning 31

Average case query tree learning Query tree T Cost(T) = 1/|H’| ∑ h` ∈ H ’ depth(h’,T) Want T* = argmin T Cost(T) Superexponential number of query trees � Finding the optimal one is hard � 32

Greedy construction of query trees [Dasgupta ’04] Algorithm GreedyTree(D X , L) V’ = H’(D X ) If V’={h} return Leaf(h) Else For each unlabeled instance x i compute v i,1 = |V’(H’,L ∪ {(x i ,1)}| and v i,0 = |V’(H’,L ∪ {(x i ,0)}| v i = min {v i,1 , v i,0 } Let i = argmax j {v j } LeftSubTree = GreedyTree(D X , L ∪ {(x i ,1)}) RightSubTree = GreedyTree(D X , L ∪ {(x i ,0)}) return Node x i with children LeftSubTree (1) and RightSubTree(0) 33

Near-optimality of greedy tree [Dasgupta ’04] Theorem : Let T* = argmin T Cost(T) Then GreedyTree constructs a query tree T such that Cost(T) = O(log |H’|) Cost(T*) 34

Limitations of this algorithm Often computationally intractable Finding “most-disagreeing” hypothesis is difficult No-noise assumption Will see how we can relax these assumptions in the talks next week. 35

Bayesian or not Bayesian? Greedy querying needs at most O(log |H’|) queries more than optimal query tree on average Assumes prior distribution (uniform) on hypotheses If our assumption is wrong, generalization bound still holds ! (but might need more labels) Can also do a pure Bayesian analysis: Query by Committee algorithm [Freund et al ’97] Assumes that Nature draws hypotheses from known prior distribution 36

Query by Committee Assume prior distribution on hypotheses Sample a “committee” of 2k hypotheses drawn from the prior distribution Search for an input such that k “members” assign label 1, and k “members” assign 0, and query that label (“maximal disagreement”) Theorem [Freund et al ’97] For linear separators in R d where both the coefficients w and the data X are drawn uniformly from the unit sphere, QBC requires exponentially fewer labels than passive learning to achieve same error 37

Announcements Homework 1: Due today Office hours Come to office - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 8 Active Learning CS 101.2 Andreas Krause Announcements Homework 1: Due today Office hours Come to office hours before your presentation! Andreas: Monday 3pm-4:30pm , 260

DHTs and Sharding Aurojit Panda Announcements Announcements Fill out the Github consent

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 6 Monday, February 2 Announcements 2 Announcements Homework 2 due Monday 2/2 @

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 6 Friday, September 13 Announcements 2 Announcements Homework 2 due Tuesday

61A Lecture 24 Monday, March 30 Announcements 2 Announcements Homework 7 due Wednesday 4/8

61A Lecture 37 Wednesday, April 29 Announcements 2 Announcements Homework 9 (4 pts) due

CS 61A Lecture 10 Friday, February 13 Announcements 2 Announcements Guerrilla Section 2 is

61A Lecture 14 Wednesday, February 25 Announcements 2 Announcements Project 2 due Thursday

Linearizability & CAP Announcements No hours this week. Announcements No hours this

61A Lecture 13 Wednesday, October 2 Announcements 2 Announcements Homework 3 deadline

61A Lecture 24 Friday, November 1 Announcements 2 Announcements Homework 7 due Tuesday 11/5

61A Extra Lecture 2 Thursday, February 5 Announcements 2 Announcements If you want 1 unit

CS 61A Lecture 11 Wednesday, February 18 Announcements 2 Announcements Optional Hog Contest

Announcements Lecture 22 System Development Leah Perlmutter / Summer 2018 Announcements

Lecture 30: Conclusion Brian Hou August 11, 2016 Announcements Announcements Final Exam

Statistical methods in bioinformatics Brief introduction, statistical models, dimension

L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Announcements

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

Basis of Neural Networks School of Data Science, Fudan

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Going be y ond linear regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v

Machine Learning Regression Where we are Inputs Prob- Density ability Estimator Inputs

Lecture 6 Mojtaba Soltanalian- UIC msol@uic.edu http://msol.people.uic.edu Based on ECE 531

Announcements Homework 1: Due today Office hours Come to office - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 8 Active Learning CS 101.2 Andreas Krause Announcements Homework 1: Due today Office hours Come to office hours before your presentation! Andreas: Monday 3pm-4:30pm , 260

DHTs and Sharding Aurojit Panda Announcements Announcements Fill out the Github consent

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 6 Monday, February 2 Announcements 2 Announcements Homework 2 due Monday 2/2 @

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 6 Friday, September 13 Announcements 2 Announcements Homework 2 due Tuesday

61A Lecture 24 Monday, March 30 Announcements 2 Announcements Homework 7 due Wednesday 4/8

61A Lecture 37 Wednesday, April 29 Announcements 2 Announcements Homework 9 (4 pts) due

CS 61A Lecture 10 Friday, February 13 Announcements 2 Announcements Guerrilla Section 2 is

61A Lecture 14 Wednesday, February 25 Announcements 2 Announcements Project 2 due Thursday

Linearizability &amp; CAP Announcements No hours this week. Announcements No hours this

61A Lecture 13 Wednesday, October 2 Announcements 2 Announcements Homework 3 deadline

61A Lecture 24 Friday, November 1 Announcements 2 Announcements Homework 7 due Tuesday 11/5

61A Extra Lecture 2 Thursday, February 5 Announcements 2 Announcements If you want 1 unit

CS 61A Lecture 11 Wednesday, February 18 Announcements 2 Announcements Optional Hog Contest

Announcements Lecture 22 System Development Leah Perlmutter / Summer 2018 Announcements

Lecture 30: Conclusion Brian Hou August 11, 2016 Announcements Announcements Final Exam

Statistical methods in bioinformatics Brief introduction, statistical models, dimension

L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Announcements

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

Basis of Neural Networks School of Data Science, Fudan

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Going be y ond linear regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v

Machine Learning Regression Where we are Inputs Prob- Density ability Estimator Inputs

Lecture 6 Mojtaba Soltanalian- UIC msol@uic.edu http://msol.people.uic.edu Based on ECE 531

Linearizability & CAP Announcements No hours this week. Announcements No hours this