Introduction to Machine Learning Active Learning Barnabs Pczos 1 - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Active Learning Barnabs Pczos 1 - - PowerPoint PPT Presentation

Introduction to Machine Learning Active Learning Barnabs Pczos 1 Credits Some of the slides are taken from Nina Balcan. 2 Classic Supervised Learning Paradigm is Insufficient Nowadays Modern applications: massive amounts of raw data.


slide-1
SLIDE 1

1

Introduction to Machine Learning

Active Learning

Barnabás Póczos

slide-2
SLIDE 2

2

Credits

Some of the slides are taken from Nina Balcan.

slide-3
SLIDE 3

3

Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts.

Billions of webpages Images

Classic Supervised Learning Paradigm is Insufficient Nowadays

Sensor measurements

slide-4
SLIDE 4

4

Modern applications: massive amounts of raw data.

Expert

We need techniques that minimize need for expert/human intervention => Active Learning

Modern ML: New Learning Approaches

The Large Synoptic Survey Telescope 15 Terabytes of data … every night

slide-5
SLIDE 5

5

 Active Learning Intro ▪ Batch Active Learning vs Selective Sampling Active Learning ▪ Exponential Improvement on # of labels ▪ Sampling bias: Active Learning can hurt performance  Active Learning with SVM  Gaussian Processes

▪ Regression ▪ Properties of Multivariate Gaussian distributions ▪ Ridge regression ▪ GP = Bayesian Ridge Regression + Kernel trick

 Active Learning with Gaussian Processes

Contents

slide-6
SLIDE 6

6

  • Two faces of active learning. Sanjoy Dasgupta. 2011.
  • Active Learning. Bur Settles. 2012.
  • Active Learning. Balcan-Urner. Encyclopedia of Algorithms. 2015

Additional resources

slide-7
SLIDE 7

7

A Label for that Example Request for the Label of an Example A Label for that Example Request for the Label of an Example Data Source A set of Unlabeled examples

. . .

Algorithm outputs a classifier w.r.t D Learning Algorithm Expert

  • Learner can choose specific examples to be labeled.
  • Goal: use fewer labeled examples

[pick informative examples to be labeled].

Underlying data distr. D.

Batch Active Learning

slide-8
SLIDE 8

8

Unlabeled example 𝑦3 Unlabeled example 𝑦1 Unlabeled example 𝑦2 Request for label or let it go? Data Source Learning Algorithm Expert Request label A label 𝑧1for example 𝑦1 Let it go Algorithm outputs a classifier w.r.t D Request label A label 𝑧3for example 𝑦3

  • Selective sampling AL (Online AL): stream of unlabeled examples,

when each arrives make a decision to ask for label or not.

  • Goal: use fewer labeled examples

[pick informative examples to be labeled].

Underlying data distr. D.

Selective Sampling Active Learning

slide-9
SLIDE 9

9

  • Need to choose the label requests carefully, to get

informative labels.

  • Guaranteed to output a relatively good classifier for

most learning problems.

  • Doesn’t make too many label requests.

Hopefully a lot less than passive learning.

What Makes a Good Active Learning Algorithm?

slide-10
SLIDE 10

10

  • YES! (sometimes)
  • We often need far fewer labels for active learning

than for passive.

  • This is predicted by theory and has been observed

in practice.

Can adaptive querying really do better than passive/random sampling?

slide-11
SLIDE 11

11

  • Threshold fns on the real line:

w

+

  • Exponential improvement.

hw(x) = 1(x ¸ w), C = {hw: w 2 R}

  • How can we recover the correct labels with ≪ N queries?
  • Do binary search (query at half)!

Active: only O(log 1/ϵ) labels. Passive supervised: Ω(1/ϵ) labels to find an -accurate threshold.

+

  • Active Algorithm

Just need O(log N) labels!

  • N = O(1/ϵ) we are guaranteed to get a classifier of error ≤ ϵ.
  • Get N unlabeled examples
  • Output a classifier consistent with the N inferred labels.

Can adaptive querying help? [CAL92, Dasgupta04]

slide-12
SLIDE 12

12

Uncertainty sampling in SVMs common and quite useful in practice.

  • At any time during the alg., we have a “current guess” wt
  • f the separator: the max-margin separator of all labeled

points so far.

  • Request the label of the example closest to the current

separator.

E.g., [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010; Schohon Cohn, ICML 2000]

Active SVM Algorithm

Active SVM

slide-13
SLIDE 13

13

Active SVM seems to be quite useful in practice.

  • Find 𝑥𝑢 the max-margin

separator of all labeled points so far.

  • Request the label of the example

closest to the current separator: minimizing 𝑦𝑗 ⋅ 𝑥𝑢 .

[Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010]

Algorithm (batch version)

Input Su={x1, …,xmu} drawn i.i.d from the underlying source D Start: query for the labels of a few random 𝑦𝑗s.

For 𝒖 = 𝟐, …., (highest uncertainty)

Active SVM

slide-14
SLIDE 14

14

  • Uncertainty sampling works sometimes….
  • However, we need to be very very very careful!!!
  • Myopic, greedy techniques can suffer from sampling bias.

(The active learning algorithm samples from a different (x,y) distribution than the true data)

  • A bias created because of the querying strategy; as time goes
  • n the sample is less and less representative of the true data

source.

[Dasgupta10]

DANGER!!!

slide-15
SLIDE 15

15

DANGER!!!

  • Main tension: want to choose informative points, but also

want to guarantee that the classifier we output does well on true random examples from the underlying distribution.

  • Observed in practice too!!!!
slide-16
SLIDE 16

16

Interesting open question to analyze under what conditions they are successful.

Other Interesting Active Learning Techniques used in Practice

slide-17
SLIDE 17

17

Centroid of largest unsampled cluster

[Jaime G. Carbonell]

Density-Based Sampling

slide-18
SLIDE 18

18

Closest to decision boundary (Active SVM)

[Jaime G. Carbonell]

Uncertainty Sampling

slide-19
SLIDE 19

19

Maximally distant from labeled x’s

[Jaime G. Carbonell]

Maximal Diversity Sampling

slide-20
SLIDE 20

20

Uncertainty + Diversity criteria Density + uncertainty criteria

[Jaime G. Carbonell]

Ensemble-Based Possibilities

slide-21
SLIDE 21

21

 Active learning could be really helpful, could provide exponential improvements in label complexity (both theoretically and practically)!  Need to be very careful due to sampling bias.  Common heuristics

  • (e.g., those based on uncertainty sampling).

What You Should Know so far

slide-22
SLIDE 22

22

Gaussian Processes for Regression

slide-23
SLIDE 23

23

http://www.gaussianprocess.org/ Some of these slides are taken from D. Lizotte, R. Parr, C. Guesterin

Additional resources

slide-24
SLIDE 24

24

  • Nonmyopic Active Learning of Gaussian Processes: An

Exploration–Exploitation Approach. A.Krause and C. Guestrin, ICML 2007

  • Near-Optimal Sensor Placements in Gaussian Processes: Theory,

Efficient Algorithms and Empirical Studies. A.Krause, A. Singh, and C. Guestrin, Journal of Machine Learning Research 9 (2008)

  • Bayesian Active Learning for Posterior Estimation, Kandasamy,

K., Schneider, J., and Poczos, B, International Joint Conference on Artificial Intelligence (IJCAI), 2015

Additional resources

slide-25
SLIDE 25

25

Why GPs for Regression?

Motivation: All the above regression method give point estimates. We would like a method that could also provide confidence during the estimation. Regression methods: Linear regression, multilayer precpetron, ridge regression, support vector regression, kNN regression, etc… Application in Active Learning: This method can be used for active learning: query the next point and its label where the uncertainty is the highest

slide-26
SLIDE 26

26

  • Here’s where the function will

most likely be. (expected function)

  • Here are some examples of

what it might look like. (sampling from the posterior distribution [blue, red, green functions)

  • Here is a prediction of what

you’ll see if you evaluate your function at x’, with confidence

GPs can answer the following questions:

Why GPs for Regression?

slide-27
SLIDE 27

27

Properties of Multivariate Gaussian Distributions

slide-28
SLIDE 28

28

1D Gaussian Distribution

Parameters

  • Mean, 
  • Variance, 2
slide-29
SLIDE 29

29

Multivariate Gaussian

slide-30
SLIDE 30

30

 A 2-dimensional Gaussian is defined by

  • a mean vector  = [ 1, 2 ]
  • a covariance matrix:

where i,j

2 = E[ (xi – i) (xj – j) ]

is (co)variance  Note:  is symmetric, “positive semi-definite”: x: xT  x  0

       

2 2 , 2 2 2 , 1 2 1 , 2 2 1 , 1

   

Multivariate Gaussian

slide-31
SLIDE 31

31

Multivariate Gaussian examples

 = (0,0)

        1 8 . 8 . 1

slide-32
SLIDE 32

32

Multivariate Gaussian examples

 = (0,0)

        1 8 . 8 . 1

slide-33
SLIDE 33

33

Marginal distributions of Gaussians are Gaussian Given: The marginal distribution is:

               

bb ba ab aa b a b a x

x x ) , ( ), , (   

Useful Properties of Gaussians

slide-34
SLIDE 34

34

Marginal distributions of Gaussians are Gaussian

slide-35
SLIDE 35

35

Block Matrix Inversion

Theorem Definition: Schur complements

slide-36
SLIDE 36

36

Conditional distributions of Gaussians are Gaussian Notation: Conditional Distribution:

               

 bb ba ab aa 1

             

bb ba ab aa

Useful Properties of Gaussians

slide-37
SLIDE 37

37

Higher Dimensions

 Visualizing > 3 dimensional Gaussian random variables is… difficult  Means and variances of marginal variable s are practical, but then we don’t see correlations between those variables  Marginals are Gaussian, e.g., f(6) ~ N(µ(6), σ2(6)) Visualizing an 8-dimensional Gaussian variable f: µ(6) σ2(6)

6 5 4 3 2 1 7 8

slide-38
SLIDE 38

38

Yet Higher Dimensions

Why stop there?

slide-39
SLIDE 39

39

Getting Ridiculous

Why stop there?

slide-40
SLIDE 40

40

Gaussian Process

 Probability distribution indexed by an arbitrary set (integer, real, finite dimensional vector, etc)  Each element (indexed by x) is a Gaussian distribution

  • ver the reals with mean µ(x)

 These distributions are dependent/correlated as defined by k(x,z)  Any finite subset of indices defines a multivariate Gaussian distribution

Definition of GP:

slide-41
SLIDE 41

41

Gaussian Process

Distribution over functions….

If our regression model is a GP, then it won’t be a point estimate anymore! It can provide regression estimates with confidence

Domain (index set) of the functions can be pretty much whatever

  • Reals
  • Real vectors
  • Graphs
  • Strings
  • Sets
slide-42
SLIDE 42

42

Bayesian Updates for GPs

  • How can we do regression and learn the GP

from data?

  • We will be Bayesians today:
  • Start with GP prior
  • Get some data
  • Compute a posterior
slide-43
SLIDE 43

43

Samples from the prior distribution

Picture is taken from Rasmussen and Williams

slide-44
SLIDE 44

44

Samples from the posterior distribution

Picture is taken from Rasmussen and Williams

slide-45
SLIDE 45

45

Prior

Zero mean Gaussians with covariance k(x,z)

slide-46
SLIDE 46

46

Data

slide-47
SLIDE 47

47

Posterior

slide-48
SLIDE 48

48

Ridge Regression

Linear regression: Ridge regression: The Gaussian Process is a Bayesian Generalization

  • f the kernelized ridge regression
slide-49
SLIDE 49

49

Weight Space View

GP = Bayesian ridge regression in feature space + Kernel trick to carry out computations

The training data

slide-50
SLIDE 50

50

Bayesian Analysis of Linear Regression with Gaussian noise

Linear regression: Linear regression with noise:

slide-51
SLIDE 51

51

Bayesian Analysis of Linear Regression with Gaussian noise

The likelihood:

slide-52
SLIDE 52

52

Bayesian Analysis of Linear Regression with Gaussian noise

The prior: Now, we can calculate the posterior:

slide-53
SLIDE 53

53

Bayesian Analysis of Linear Regression with Gaussian noise

After “completing the square”

MAP estimation Ridge Regression

slide-54
SLIDE 54

54

Bayesian Analysis of Linear Regression with Gaussian noise

This posterior covariance matrix doesn’t depend on the observations y, A strange property of Gaussian Processes

slide-55
SLIDE 55

55

Projections of Inputs into Feature Space

The Bayesian linear regression suffers from limited expressiveness To overcome the problem ) go to a feature space and do linear regression there a., explicit features b., implicit features (kernels)

slide-56
SLIDE 56

56

Explicit Features

Linear regression in the feature space

slide-57
SLIDE 57

57

Explicit Features

The predictive distribution after feature map: Reminder: This is what we had without feature maps:

slide-58
SLIDE 58

58

Explicit Features

Shorthands: The predictive distribution after feature map:

slide-59
SLIDE 59

59

Explicit Features

The predictive distribution after feature map: A problem with (*) is that it needs an NxN matrix inversion...

(*)

(*) can be rewritten: Theorem:

slide-60
SLIDE 60

60

Proofs

  • Mean expression. We need:
  • Variance expression. We need:

Lemma: Matrix inversion Lemma:

slide-61
SLIDE 61

61

From Explicit to Implicit Features

slide-62
SLIDE 62

62

From Explicit to Implicit Features

The feature space always appears in the form of:

Lemma:

No need to know the explicit N dimensional features. Their inner product is enough.

slide-63
SLIDE 63

63

GP pseudo code

Inputs:

slide-64
SLIDE 64

64

GP pseudo code (continued)

Outputs:

slide-65
SLIDE 65

65

Results

slide-66
SLIDE 66

66

Results using Netlab , Sin function

slide-67
SLIDE 67

67

Results using Netlab, Sin function

Increased # of training points

slide-68
SLIDE 68

68

Results using Netlab, Sin function

Increased noise

slide-69
SLIDE 69

69

Results using Netlab, Sinc function

slide-70
SLIDE 70

70

Applications: Sensor placement

Temperature modeling with GP

Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies. A.Krause, A. Singh, and C. Guestrin, Journal of Machine Learning Research (2008)

slide-71
SLIDE 71

71

An example of placements chosen using entropy and mutual information criteria on temperature data. Diamonds indicate the positions chosen using entropy; squares the positions chosen using MI.

Applications: Sensor placement

Entropy criterion: Mutual information criterion:

slide-72
SLIDE 72

 Properties of Multivariate Gaussian distribution  Gaussian process = Bayesian Ridge Regression  GP Algorithm  GP application in active learning

What You Should Know

slide-73
SLIDE 73

73

Thanks for the Attention! ☺