Active Learning SPiNCOM reading group Sep. 30 th , 2016 Dimitris - - PowerPoint PPT Presentation

active learning
SMART_READER_LITE
LIVE PREVIEW

Active Learning SPiNCOM reading group Sep. 30 th , 2016 Dimitris - - PowerPoint PPT Presentation

Active Learning SPiNCOM reading group Sep. 30 th , 2016 Dimitris Berberidis 1 A toy example: Alien fruits Consider alien fruits of various shapes Train classifier to distinguish safe fruits from dangerous ones Passive learning:


slide-1
SLIDE 1

1

Active Learning

SPiNCOM reading group

  • Sep. 30th , 2016

Dimitris Berberidis

slide-2
SLIDE 2

A toy example: Alien fruits

2

 Consider alien fruits of various shapes  Train classifier to distinguish safe fruits from dangerous ones  Passive learning: Training data are given by uniform sampling and labeling  Our setting

  • Obtaining labels costly
  • Unlabeled instances easily available
slide-3
SLIDE 3

A toy example: alien fruits

3

 What if we sample fruits smartly instead of randomly?  can be identified with using far fewer samples

slide-4
SLIDE 4

4

Active learning

 Active learning (AL) scenarios considered General Goal: For a given budget of labeled training data, maximize learner’s

accuracy by actively selecting which instances (feature vectors) to label (“query”).

Pool-based sampling Selective sampling Query synthesis

First to be considered,

  • ften not applicable

Ideal for online settings with streaming data More general, OUR FOCUS

slide-5
SLIDE 5

Roadmap

5

 Expected error minimization  Conclusions  Uncertainty sampling

  • Expected error reduction
  • Variance reduction
  • Batch queries and submodularity

 Searching the hypothesis space

Burr Settles, “Active Learning”, Synthesis lectures on AI and ML, 2012.

  • Query by disagreement
  • Query by committee

 Cluster-based AL  AL + semi-supervised learning  A unified view

slide-6
SLIDE 6

6

Uncertainty sampling

 Most popular AL method: Intuitive, easy to implement  Support vector classifier: uncertain about points close to decision boundary

the key

slide-7
SLIDE 7

7

Measures of uncertainty

 Limitation: Utility scores based on output of single (possibly bad) hypothesis.

Least confident: Least margin: Highest entropy: where

 Uncertainty of label as modeled by (e.g. for l.r.)

slide-8
SLIDE 8

8

Searching through the hypothesis space

 Instance points in correspond to hyperplanes in

  • Max. margin methods (e.g. SVMs) lead to hypotheses in center of
  • Labeling instances close to decision hyperplane approx. bisects
  • Instances that greatly reduce the volume of are of interest.

 Version space : Subset of all hypotheses consistent with tr. data

slide-9
SLIDE 9

9

Query by disagreement

 “Store” version space implicitly with following trick  One of the oldest AL algorithms [Cohn et al., ‘94]  Limitations: Too complex, all controversial instances treated equally

slide-10
SLIDE 10

10

Query by committee

 Label instance most controversial among committee members  Key difference: VE cannot distinguish between case (a) and (b)  Independently train a committee of hypotheses.

Vote entropy: Soft vote entropy: KL divergence:

slide-11
SLIDE 11

11

Information theoretic interpretation

 Problem can be reformulated in more convenient form  Ideally maximize information between label r.v. and  Another alternative formulation (recall KL-based QBC)

Measures disagreement

 Uncertainty sampling focuses on maximizing

  • QBC approximates second term with and

QBC approximates:

slide-12
SLIDE 12

12

Bound on label complexity

 Label complexity for passive learning ( assume )

where is expected error rate and VC dimension measures complexity of To achieve one needs

 QBD achieves logarithmically lower label complexity (if does not explode )  Dis. coef. : Quantifies how fast the reg. of disagreement shrinks

slide-13
SLIDE 13

13

Alien fruit example: A problematic case

 Generally: Both unc. sampling and QBD may suffer high generalization error  Candidate queries A and B both bisect (appear equally informative)

  • However, generalization error depends on the (ignored) distribution of input
slide-14
SLIDE 14

14

Expected error reduction

 Ideally select query by minimizing expected generalization error  Less stringent objective: Expected log-loss  (Extremely) high complexity required to retrain model for each candidate

Retrained model using

slide-15
SLIDE 15

15

Variance reduction

 Focus on minimizing variance of predictions of unlabeled data  Learners expected error can be decomposed  Question: Can we minimize variance without retraining?

  • Design of experiments approach (typically for regression)

Noise Bias Variance

 Noise is ind. of training data and bias is due to model class (e.g. linear model)

slide-16
SLIDE 16

16

Optimal experimental design

 Fisher information matrix (FIM) of model  Can easily be adapted to minimize variance of predictions  FIM can be efficiently updated using the Woodberry matrix identity  Covariance of parameter estimates lower bounded by  A-optimal design:

Additive property of FIM Fisher information ratio Fisher score

slide-17
SLIDE 17

17

Batch queries and submodularity

 Maximizing the variance difference can be submodular  Submodularity property for functions over sets ( )  Greedy approach on submodular function guarantees:  Query a batch of instances

  • Not necessarily the individually best
  • Key is to avoid correlated instances

 For linear regression FIM is ind. of (offline computation !)

slide-18
SLIDE 18

18

Density-weighted methods

 Information density heuristic

  • Instances more representative of input distribution are promoted

 Pathological case: Least confident (most uncertain) instance is an outlier

  • B in fact more informative than A

 Back to classification  Error and variance reduction less sensitive to outliers but costly

Information utility score (e.g. entropy) Similarity measure (e.g. Eucledian distance)

slide-19
SLIDE 19

19

Hierarchical cluster-based AL

 Assist AL by clustering the input space

  • Obtain data and find initial coarse clustering
  • Query instances from different clusters
  • Iteratively refine clusters so that they become more “pure”
  • Focus querying on more impure clusters

 Working assumption: Cluster structure is correlated with label structure

  • If not, above algorithm degrades to random sampling
slide-20
SLIDE 20

20

Active and semi-supervised learning

 Entropy regularization complementary to error reduction w. log-loss  Self training is complementary to uncertainty sampling [Yarowsky, ‘95]  Two approaches are complementary

  • AL minimizes labeling effort by querying most informative instances
  • Semi-sup. learning exploits latent structure (unlabeled) to improve accuracy

 Co-training complementary to QBD [Blum and Mitchel, ‘98]

slide-21
SLIDE 21

21

Unified view (I)

 Approximations lead to uncertainty sampling heuristic  Since true label is unknown, one resorts to  Ideal: Maximize total gain in information

Uncertainty sampling

slide-22
SLIDE 22

22

Unified view (II)

 A different approximation  Log-loss minimization and variance-reduction target the above measure

Depends on current state of and is unchanged for all queries

 Approximation given by density weighted methods

slide-23
SLIDE 23

23

Overview

slide-24
SLIDE 24

Cost of annotating specific query Cost of prediction 24

Practical considerations

 Real labeling costs  Skewed label distributions (class imbalance)  Unreliable oracles (e.g. labels given by human experts)  When AL is used training data are biased to model class

  • If unsure about model, random sampling may be preferable

 Multi-task AL (multiple labels per instance)

slide-25
SLIDE 25

25

Conclusions

 Possible research directions

  • Use of AL methods in learning over graphs (GSP, classification over graphs)
  • Use o MCMC and IS to approx. posterior in complex models (e.g. BMRF)

 AL allows for sample (label) complexity reduction

  • Simple heuristics: Uncertainty sampling, QBD,QBC, cluster-based AL
  • High complexity near-optimal methods: Expected error/variance reduction
  • Encompasses optimal experimental design
  • Linked to semi-supervised learning
  • Information-theoretic interpretations