Bayesian Batch Active Learning as Sparse Subset Approximation - - PowerPoint PPT Presentation

bayesian batch active learning as sparse subset
SMART_READER_LITE
LIVE PREVIEW

Bayesian Batch Active Learning as Sparse Subset Approximation - - PowerPoint PPT Presentation

Bayesian Batch Active Learning as Sparse Subset Approximation Robert Pinsler Jonathan Gordon Eric Nalisnick Jos Miguel Hernndez-Lobato October 2019 Research Talk Introduction Acquiring labels for supervised learning can be costly


slide-1
SLIDE 1

Bayesian Batch Active Learning as Sparse Subset Approximation

Research Talk October 2019

Robert Pinsler Jonathan Gordon Eric Nalisnick José Miguel Hernández-Lobato

slide-2
SLIDE 2

Introduction

  • Acquiring labels for supervised learning can be costly and time-consuming
  • In such settings, active learning (AL) enables data-efficient model training by

intelligently selecting points for which labels should be requested

slide-3
SLIDE 3

Introduction

Model Unlabeled pool set Oracle Labeled training set Select queries Train model Pool-based active learning (AL)

slide-4
SLIDE 4

Sequential AL loop

Select data point Update model

Train model Query single data point

Introduction

slide-5
SLIDE 5

Batch AL approaches:

  • scale to large datasets and models
  • enable parallel data acquisition
  • (ideally) trade off diversity and representativeness

Sequential AL loop

Select data point Update model

Train model Query single data point

Introduction

How to construct such a batch?

slide-6
SLIDE 6

Bayesian Batch Active Learning

Bayesian approach: Choose set of points that maximally reduces uncertainty

  • ver parameter posterior
  • NP-hard, but greedy approximations exist: MaxEnt, BALD
  • Naïve batch strategy: Select b best points according to acquisition function
slide-7
SLIDE 7

Bayesian Batch Active Learning

Bayesian approach: Choose set of points that maximally reduces uncertainty

  • ver parameter posterior
  • NP-hard, but greedy approximations exist: MaxEnt, BALD
  • Naïve batch strategy: Select b best points according to acquisition function

MaxEnt BALD Budget is wasted on selecting nearby points

slide-8
SLIDE 8

Bayesian Batch Active Learning

Bayesian approach: Choose set of points that maximally reduces uncertainty

  • ver parameter posterior
  • NP-hard, but greedy approximations exist: MaxEnt, BALD
  • Naïve batch strategy: Select b best points according to acquisition function

MaxEnt BALD Budget is wasted on selecting nearby points Idea: Re-cast batch construction as optimizing a sparse subset approximation to complete data posterior

slide-9
SLIDE 9

Bayesian Batch Active Learning

Bayesian approach: Choose set of points that maximally reduces uncertainty

  • ver parameter posterior
  • NP-hard, but greedy approximations exist: MaxEnt, BALD
  • Naïve batch strategy: Select b best points according to acquisition function

MaxEnt BALD Ours Idea: Re-cast batch construction as optimizing a sparse subset approximation to complete data posterior

slide-10
SLIDE 10
  • Coreset: Summarize data by sparse, weighted subset
  • Bayesian coreset: Approximate posterior by sparse, weighted subset

Idea: Re-cast batch construction as optimizing a sparse subset approximation to complete data posterior We take inspiration from Bayesian coresets

Related Work Bayesian coresets

slide-11
SLIDE 11
  • Coreset: Summarize data by sparse, weighted subset
  • Bayesian coreset: Approximate posterior by sparse, weighted subset
  • Batch AL with Bayesian coresets: Batch = Bayesian coreset

Idea: Re-cast batch construction as optimizing a sparse subset approximation to complete data posterior We take inspiration from Bayesian coresets

Related Work Bayesian coresets

slide-12
SLIDE 12

Batch Construction as Sparse Subset Approximation

Choose batch such that best approximates

slide-13
SLIDE 13

Batch Construction as Sparse Subset Approximation

Choose batch such that best approximates We don't know the labels of the points in the pool set before querying them

slide-14
SLIDE 14

Batch Construction as Sparse Subset Approximation

Choose batch such that best approximates We don't know the labels of the points in the pool set before querying them T ake expectation w.r.t. current predictive posterior distribution:

slide-15
SLIDE 15

Batch Construction as Sparse Subset Approximation

Choose batch such that best approximates We don't know the labels of the points in the pool set before querying them T ake expectation w.r.t. current predictive posterior distribution:

slide-16
SLIDE 16

Batch Construction as Sparse Subset Approximation Hilbert coresets

slide-17
SLIDE 17

Batch Construction as Sparse Subset Approximation Hilbert coresets

slide-18
SLIDE 18

Batch Construction as Sparse Subset Approximation Hilbert coresets

  • Considers directionality of residual error → adaptively construct batch

while accounting for similarity between data points (induced by norm)

  • Still intractable!
slide-19
SLIDE 19

Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization

slide-20
SLIDE 20

Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization

  • 1. Relax constraints

1

slide-21
SLIDE 21

Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization

  • 1. Relax constraints
  • 2. Apply Frank-Wolfe algorithm
  • Geometrically motivated convex optimization algorithm
  • Iteratively selects vector most aligned with residual error
  • Corresponds to adding at most one data point to batch in every iteration

1 2

slide-22
SLIDE 22

Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization

  • 1. Relax constraints
  • 2. Apply Frank-Wolfe algorithm
  • Geometrically motivated convex optimization algorithm
  • Iteratively selects vector most aligned with residual error
  • Corresponds to adding at most one data point to batch in every iteration
  • 3. Project continuous weights back to feasible space (i.e. binarize them)

1 2 3

slide-23
SLIDE 23

Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization

  • 1. Relax constraints
  • 2. Apply Frank-Wolfe algorithm
  • Geometrically motivated convex optimization algorithm
  • Iteratively selects vector most aligned with residual error
  • Corresponds to adding at most one data point to batch in every iteration
  • 3. Project continuous weights back to feasible space (i.e. binarize them)

1 2 3 Which norm is appropriate?

slide-24
SLIDE 24

Batch Construction as Sparse Subset Approximation Choice of Inner Products

Norm is induced by inner product, e.g.

  • 1. Weighted Fisher inner product

+ Leads to simple, interpretable expressions for linear models

  • - Requires taking gradients w.r.t. parameters
  • - Scales quadratically with pool set size
slide-25
SLIDE 25

Batch Construction as Sparse Subset Approximation Choice of Inner Products

Norm is induced by inner product, e.g.

  • 1. Weighted Fisher inner product

+ Leads to simple, interpretable expressions for linear models

  • - Requires taking gradients w.r.t. parameters
  • - Scales quadratically with pool set size

Example: Linear regression

slide-26
SLIDE 26

Batch Construction as Sparse Subset Approximation Choice of Inner Products

Norm is induced by inner product, e.g.

  • 1. Weighted Fisher inner product
  • Connections to BALD, leverage scores and influence functions
  • Probit regression also yields interpretable closed-form solution

+ Leads to simple, interpretable expressions for linear models

  • - Requires taking gradients w.r.t. parameters
  • - Scales quadratically with pool set size

Example: Linear regression

slide-27
SLIDE 27

Batch Construction as Sparse Subset Approximation Choice of Inner Products

Norm is induced by inner product, e.g.

  • 1. Weighted Fisher inner product
  • 2. Weighted Euclidean inner product

+ Leads to simple, interpretable expressions for linear models

  • - Requires taking gradients w.r.t. parameters
  • - Scales quadratically with pool set size

+ Only requires tractable likelihood computations + Scalable to large pool set sizes (linearly) and complex, non-linear models through random projections

  • - No gradient information utilized
slide-28
SLIDE 28

Batch Construction as Sparse Subset Approximation Choice of Inner Products

Norm is induced by inner product, e.g.

  • 1. Weighted Fisher inner product
  • 2. Weighted Euclidean inner product

+ Leads to simple, interpretable expressions for linear models

  • - Requires taking gradients w.r.t. parameters
  • - Scales quadratically with pool set size

+ Only requires tractable likelihood computations + Scalable to large pool set sizes (linearly) and complex, non-linear models through random projections

  • - No gradient information utilized

J-dimensional random projection in Euclidean space

slide-29
SLIDE 29

Experimental Setup

(i) Does our approach avoid correlated queries? closed form (ii) Is our method competitive in the small-data regime? closed form (iii) Does our method scale to large datasets and models? projections Experiments

slide-30
SLIDE 30

Experimental Setup

(i) Does our approach avoid correlated queries? closed form (ii) Is our method competitive in the small-data regime? closed form (iii) Does our method scale to large datasets and models? projections

Deterministic feature extractor (e.g. ConvNet) Stochastic fully connected layer Exact inference (regression) Mean-field VI (classification)

Model: Neural Linear Experiments

?

slide-31
SLIDE 31

Experiments: Probit Regression Does our approach avoid correlated queries?

BALD ACS-FW

slide-32
SLIDE 32

Experiments: Probit Regression Does our approach avoid correlated queries?

BALD ACS-FW No change

slide-33
SLIDE 33

Experiments: Probit Regression Does our approach avoid correlated queries?

BALD ACS-FW No change Rotates in data space

slide-34
SLIDE 34

Experiments: Probit Regression Does our approach avoid correlated queries?

BALD ACS-FW And again...

slide-35
SLIDE 35

Experiments: Probit Regression Does our approach avoid correlated queries?

BALD ACS-FW ACS-FW queries diverse batch of points

slide-36
SLIDE 36

Experiments: Regression Is our method competitive in the small-data regime?

slide-37
SLIDE 37

Experiments: Regression Is our method competitive in the small-data regime?

Competitive on small data, even more beneficial for larger N

slide-38
SLIDE 38

Experiments: Regression Does our method scale to large datasets and models?

slide-39
SLIDE 39

Experiments: Classification Does our method scale to large datasets and models?

Enables efficient AL at scale, without any sacrifice in performance

slide-40
SLIDE 40

Conclusion

Introduced novel Bayesian batch AL approach

  • Based on sparse subset approximations
  • Produces diverse batches, enabling efficient AL at scale
  • Yields interpretable closed-form solutions
  • Generalizes to arbitrary models using random projections

Future Work

  • Leverage Frank-Wolfe weights in more principled way
  • Investigate interactions with other approximate inference methods
  • Apply to continual learning