Announcements Homework 1: Due today Office hours Come to office - - PowerPoint PPT Presentation

announcements
SMART_READER_LITE
LIVE PREVIEW

Announcements Homework 1: Due today Office hours Come to office - - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 8 Active Learning CS 101.2 Andreas Krause Announcements Homework 1: Due today Office hours Come to office hours before your presentation! Andreas: Monday 3pm-4:30pm , 260


slide-1
SLIDE 1

Active Learning and

Optimized Information Gathering

Lecture 8 – Active Learning

CS 101.2 Andreas Krause

slide-2
SLIDE 2

2

Announcements

Homework 1: Due today Office hours

Come to office hours before your presentation! Andreas: Monday 3pm-4:30pm, 260 Jorgensen Ryan: Wednesday 4:00-6:00pm, 109 Moore

slide-3
SLIDE 3

3

Outline

Background in learning theory Sample complexity Key challenges Heuristics for active learning Principled algorithms for active learning

slide-4
SLIDE 4

4

Spam or Ham?

Labels are expensive (need to ask expert) Which labels should we obtain to maximize classification accuracy?

  • Spam

Ham

  • x1

x2 label = sign(w0 + w1 x1 + w2 x2) (linear separator)

slide-5
SLIDE 5

5

Recap: Concept learning

Set X of instances, with distribution PX True concept c: X {0,1} Data set D = {(x1,y1),…,(xn,yn)}, xi ∼ PX, yi = c(xi) Hypothesis h: X {0,1} from H = {h1, …, hn, …} Assume c ∈ H (c also called “target hypothesis”) errortrue(h) = EX |c(x)-h(x)| errortrain(h) = (1/n) ∑i |c(xi)-h(xi)| If n large enough, errortrue(h) ≈ ≈ ≈ ≈ errortrain(h) for all h

slide-6
SLIDE 6

6

Recap: PAC Bounds

How many samples n to we need to get error ≤ ε with probability 1-δ ? No noise: n ≥ 1/ε ( log |H| + log 1/δ ) Noise: n ≥ 1/ε2 ( log |H| + log 1/δ )

Requires that data is i.i.d.!

Today: Mainly no-noise case (more next week)

slide-7
SLIDE 7

7

Statistical passive/active learning protocol

Data source PX (produces inputs xi) data set Dn = {(x1,y1),…,(xn,yn)} Learner outputs hypothesis h errortrue(h) = Ex~P[h(x) ≠ c(x)] Active learner assembles by selectively obtaining labels Data set NOT sampled i.i.d.!!

slide-8
SLIDE 8

8

Example: Uncertainty sampling

Budget of m labels Draw n unlabeled examples Repeat until we’ve picked m labels

Assign each unlabeled data an “uncertainty score” Greedily pick the most uncertain example

One of the most commonly used class of heuristics!

slide-9
SLIDE 9

9

Uncertainty sampling for linear separators

slide-10
SLIDE 10

10

Active learning bias

slide-11
SLIDE 11

11

Active learning bias

If we can pick at most m = n/2 labels, with

  • verwhelmingly high probability, US pick points such

that there remains a hypothesis with error > .1!!! With standard passive learning, error 0 as n∞

slide-12
SLIDE 12

12

Wish list for active learning

Minimum requirement

Consistency: Generalization error should go to 0 asymptotically

We’d like more than that:

Fallback guarantee: Convergence rate of error of active learning “at least as good” as passive learning

What we’re really after

Rate improvement: Error of active learning decreases much faster than for passive learning

slide-13
SLIDE 13

13

From passive to active

Passive PAC learning

1.

Collect data set D of n ≥ 1/ε ( log |H| + log 1/δ ) data points and their labels i.i.d. from PX

2.

Output consistent hypothesis h

3.

With probability at least 1-δ, errortrue(h) ≤ ε

Key idea

Sample n unlabeled data points DX={x1,…,xn} i.i.d. Actively query labels until all hypotheses consistent with these labels agree on the labels of all unlabeled data

slide-14
SLIDE 14

14

Why might this work?

slide-15
SLIDE 15

15

Formalization: “Relevant” hypothesis

Data set D = {(x1,y1),…,(xn,yn)}, Hypothesis space H Input data: DX = {x1,…,xn} Relevant hypothesis H’(DX) = H’ = Restriction of H on DX Formally: H’ = {h’: DX{0,1} ∃ h∈ H s.t. ∀ x∈ DX: h’(x)=h(x)}

slide-16
SLIDE 16

16

Example: Threshold functions

slide-17
SLIDE 17

17

Version space

Input data DX = {x1,…,xn} Partially labeled: Have L = {(xi1,yi1),…,(xim,yim)} The (relevant) version space is the set of all relevant hypotheses consistent with the labels L Formally: Why useful?

Partial labels L imply all remaining labels for DX |V|=1

slide-18
SLIDE 18

18

Version space

Input data DX = {x1,…,xn} Partially labeled: Have L = {(xi1,yi1),…,(xim,yim)} The (relevant) version space is the set of all relevant hypotheses consistent with the labels L Formally: V(DX,L) = V = {h’∈ H’(DX): h’(xij)=yij for 1 ≤ j ≤ m} Why useful?

Partial labels L imply all remaining labels for DX |V|=1

slide-19
SLIDE 19

19

Example: Binary thresholds

slide-20
SLIDE 20

20

Pool-based active learning with fallback

1.

Collect n ≥ 1/ε ( log |H| + log 1/δ ) unlabeled data points DX from PX

2.

Actively request labels L until there remains a single hypothesis h’∈ H’ that’s consistent with these labels (i.e., |V(H’,L)| = 1)

3.

Output any hypothesis h∈H consistent with the

  • btained labels. With probability ≥ 1-δ

errortrue(h)≤ ε Get PAC guarantees for active learning Bounds on #labels for fixed error ε carry over from passive to active

  • Fallback guarantee
slide-21
SLIDE 21

21

Wish list for active learning

Minimum requirement

Consistency: Generalization error should go to 0 asymptotically

We’d like more than that:

Fallback guarantee: Convergence rate of error of active learning “at least as good” as passive learning

What we’re really after

Rate improvement: Error of active learning decreases much faster than for passive learning

slide-22
SLIDE 22

22

Pool-based active learning with fallback

1.

Collect n ≥ 1/ε ( log |H| + log 1/δ ) unlabeled data points DX from PX

2.

Actively request labels L until there remains a single hypothesis h’∈ H’ that’s consistent with these labels (i.e., |V(H’,L)| = 1)

3.

Output any hypothesis h∈H consistent with the

  • btained labels. With probability ≥ 1-δ

errortrue(h)≤ ε

slide-23
SLIDE 23

23

Example: Threshold functions

slide-24
SLIDE 24

24

Generalizing binary search [Dasgupta ’04]

Want to shrink the version space (number of consistent hypotheses) as quickly as possible. General (greedy) approach:

For each unlabeled instance xi compute vi,1 = vi,0 = vi = min {vi,1, vi,0 } Obtain label yi for xi where i = argmaxj {vj}

slide-25
SLIDE 25

25

Ideal case

slide-26
SLIDE 26

26

Is it always possible to half the version space?

slide-27
SLIDE 27

27

Typical case much more benign

slide-28
SLIDE 28

28

Query trees

A query tree is a rooted, labeled tree on the relevant hypothesis H’ Each node is labeled with an input x ∈ DX Each edge is labeled with {0,1} Each path from root to hypothesis h’∈ H’ is a labeling L such that V(DX,L) = {h’} Want query trees of minimum height

slide-29
SLIDE 29

29

Example: Threshold functions

slide-30
SLIDE 30

30

Example: linear separators (2D)

slide-31
SLIDE 31

31

Number of labels needed to identify hypothesis

Depends on target hypothesis! Binary thresholds (on n inputs D_X)

Optimal query tree needs O(log n) labels! ☺

For linear separators in 2D (on n inputs D_X)

For some hypotheses, even optimal tree needs n labels On average, optimal query tree needs O(log n) labels! ☺

Average-case analysis of active learning

slide-32
SLIDE 32

32

Average case query tree learning

Query tree T Cost(T) = 1/|H’| ∑h`∈ H’ depth(h’,T) Want T* = argminT Cost(T) Superexponential number of query trees Finding the optimal one is hard

slide-33
SLIDE 33

33

Greedy construction of query trees [Dasgupta ’04]

Algorithm GreedyTree(DX, L) V’ = H’(DX) If V’={h} return Leaf(h) Else For each unlabeled instance xi compute vi,1 = |V’(H’,L ∪ {(xi,1)}| and vi,0 = |V’(H’,L ∪ {(xi,0)}| vi = min {vi,1, vi,0} Let i = argmaxj {vj} LeftSubTree = GreedyTree(DX, L ∪ {(xi,1)}) RightSubTree = GreedyTree(DX, L ∪ {(xi,0)}) return Node xi with children LeftSubTree (1) and RightSubTree(0)

slide-34
SLIDE 34

34

Near-optimality of greedy tree [Dasgupta ’04]

Theorem: Let T* = argminT Cost(T) Then GreedyTree constructs a query tree T such that Cost(T) = O(log |H’|) Cost(T*)

slide-35
SLIDE 35

35

Limitations of this algorithm

Often computationally intractable

Finding “most-disagreeing” hypothesis is difficult

No-noise assumption Will see how we can relax these assumptions in the talks next week.

slide-36
SLIDE 36

36

Bayesian or not Bayesian?

Greedy querying needs at most O(log |H’|) queries more than optimal query tree on average Assumes prior distribution (uniform) on hypotheses If our assumption is wrong, generalization bound still holds! (but might need more labels) Can also do a pure Bayesian analysis:

Query by Committee algorithm [Freund et al ’97] Assumes that Nature draws hypotheses from known prior distribution

slide-37
SLIDE 37

37

Query by Committee

Assume prior distribution on hypotheses Sample a “committee” of 2k hypotheses drawn from the prior distribution Search for an input such that k “members” assign label 1, and k “members” assign 0, and query that label (“maximal disagreement”) Theorem [Freund et al ’97] For linear separators in Rd where both the coefficients w and the data X are drawn uniformly from the unit sphere, QBC requires exponentially fewer labels than passive learning to achieve same error

slide-38
SLIDE 38

38

Example: Threshold functions

slide-39
SLIDE 39

39

Wish list for active learning

Minimum requirement

Consistency: Generalization error should go to 0 asymptotically

We’d like more than that:

Fallback guarantee: Convergence rate of error of active learning “at least as good” as passive learning

What we’re really after

Rate improvement: Error of active learning decreases much faster than for passive learning

slide-40
SLIDE 40

40

Beyond pool-based analysis

Pool-based active learning just one convenient analysis technique

Gets around active learning bias by generalizing from a pool drawn i.i.d. at random In pool-based analysis, there are examples where active learning does not outperform passive learning

Exciting recent theoretical results show that using a more involved analysis, active learning always helps (asymptotically) [Balcan, Hanneke, Wortman COLT ’08] Also other active learning paradigms

E.g.: Active querying (constructing rather than selecting inputs)

slide-41
SLIDE 41

41

What you need to know

Uncertainty sampling Active learning bias Pool-based active learning scheme Relevant hypotheses and version spaces Generalized binary search algorithm