Conditioning by adaptive sampling for robust design David Brookes - - PowerPoint PPT Presentation

conditioning by adaptive sampling for robust design
SMART_READER_LITE
LIVE PREVIEW

Conditioning by adaptive sampling for robust design David Brookes - - PowerPoint PPT Presentation

Conditioning by adaptive sampling for robust design David Brookes Jennifer Listgarten Biophysics Graduate Group EECS and Center for Computational Biology University California, Berkeley University California, Berkeley Motivating problem:


slide-1
SLIDE 1

Conditioning by adaptive sampling for robust design

David Brookes Biophysics Graduate Group University California, Berkeley Jennifer Listgarten EECS and Center for Computational Biology University California, Berkeley

slide-2
SLIDE 2

Motivating problem: design protein sequences

  • Proteins are made up of sequences of amino acids (20 possibilities)
  • Huge variety of proteins whose function we would like to improve
slide-3
SLIDE 3

Motivating problem: design protein sequences

  • Proteins are made up of sequences of amino acids (20 possibilities)
  • Huge variety of proteins whose function we would like to improve

Proteins that fluoresce

slide-4
SLIDE 4

Motivating problem: design protein sequences

  • Proteins are made up of sequences of amino acids (20 possibilities)
  • Huge variety of proteins whose function we would like to improve

Proteins that fluoresce … that act as drugs

slide-5
SLIDE 5

Motivating problem: design protein sequences

  • Proteins are made up of sequences of amino acids (20 possibilities)
  • Huge variety of proteins whose function we would like to improve

Proteins that fluoresce … that act as drugs … that fixate carbon in the atmosphere

slide-6
SLIDE 6

Motivating problem: design protein sequences

  • Proteins are made up of sequences of amino acids (20 possibilities)
  • Huge variety of proteins whose function we would like to improve

Proteins that fluoresce …. that act as drugs … that fixate carbon in the atmosphere … that deliver gene-editing tools to tissues

slide-7
SLIDE 7

How to map sequence to function?

A law of molecular biology:

Sequence Structure

Hughes A, Mort M, Carlisle F, et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014;85:A10. http://www.rcsb.org/structure/6FWW

Function

How to map sequence to function?

A law of molecular biology:

Sequence Structure

Hughes A, Mort M, Carlisle F, et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014;85:A10. http://www.rcsb.org/structure/6FWW

Function

ex: fluorescence

slide-8
SLIDE 8

Bypassing the structure relationships

Sequence Structure

Hughes A, Mort M, Carlisle F, et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014;85:A10. http://www.rcsb.org/structure/6FWW

Function

A law of molecular biology:

High throughput experiments (& ML)

slide-9
SLIDE 9

Can we solve the inverse problem?

Sequence Structure

Hughes A, Mort M, Carlisle F, et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014;85:A10. http://www.rcsb.org/structure/6FWW

Function

A law of molecular biology:

Design problem: Given a model, find sequences with desired function

slide-10
SLIDE 10

Why is protein design difficult?

  • Huge, rugged search space

⟹ size scales as 20$

Atoms in universe Grains of sand on earth

slide-11
SLIDE 11

Why is protein design difficult?

  • Huge, rugged search space

⟹ size scales as 20$

  • Discrete search space (no gradients)

Atoms in universe Grains of sand on earth

slide-12
SLIDE 12

Why is protein design difficult?

  • Huge, rugged search space

⟹ size scales as 20$

  • Discrete search space (no gradients)
  • Uncertainty in predictor

https://livingthing.danmackinlay.name/gaussian_processes.html69

Atoms in universe Grains of sand on earth

slide-13
SLIDE 13

Possible solution: model-based optimization (MBO)

Idea: replace the standard (hard) objective

e.g. the space of sequences

slide-14
SLIDE 14

Possible solution: model-based optimization (MBO)

Idea: replace the standard (hard) objective with a potentially easier one

the space of sequences model over sequence space

slide-15
SLIDE 15

Possible solution: model-based optimization (MBO)

Idea: replace the standard (hard) objective with a potentially easier one Solution approach is to iterate: 1. Sample from “search model” 𝑞 𝑦 𝜄 2. Evaluate samples on 𝑔 𝑦 3. Adjust 𝜄 so the model favors samples with large function evals

slide-16
SLIDE 16

Possible solution: model-based optimization (MBO)

Idea: replace the standard (hard) objective with a potentially easier one Solution approach is to iterate: 1. Sample from “search model” 𝑞 𝑦 𝜄 2. Evaluate samples on 𝑔 𝑦 3. Adjust 𝜄 so the model favors sequences with large function evals

ü Model can sample broad areas of sequence space ü Does not require gradients of 𝑔 ü Can incorporate uncertainty

slide-17
SLIDE 17

Our aim is solve the MBO objective:

First attempt at MBO for protein design:

Design by Adaptive Sampling (DbAS)

slide-18
SLIDE 18

Our aim is solve the MBO objective: where

  • 𝑞 𝑦 𝜄 is the search model (VAE, HMM…)

First attempt at MBO for protein design:

Design by Adaptive Sampling (DbAS)

slide-19
SLIDE 19

Our aim is solve the MBO objective: where

  • 𝑞 𝑦 𝜄 is the search model (VAE, HMM…)
  • 𝑇 is desired set of property values

à e.g. fluorescence > 𝛽

First attempt at MBO for protein design:

Design by Adaptive Sampling (DbAS)

slide-20
SLIDE 20

Our aim is solve the MBO objective: where

  • 𝑞 𝑦 𝜄 is the search model (VAE, HMM…)
  • 𝑇 is desired set of property values

à e.g. fluorescence > 𝛽

  • 𝑄(𝑇|𝑦) is a stochastic predictive model (“oracle”)

that maps sequences to property

First attempt at MBO for protein design:

Design by Adaptive Sampling (DbAS)

slide-21
SLIDE 21

Design by Adaptive Sampling (cont.)

Two issues:

  • 1. 𝜄 is in the expectation

distribution.

slide-22
SLIDE 22

Design by Adaptive Sampling (cont.)

Two issues:

  • 1. 𝜄 is in the expectation

distribution.

maximize a lower bound

slide-23
SLIDE 23

Design by Adaptive Sampling (cont.)

Two issues:

  • 1. 𝜄 is in the expectation

distribution.

  • 2. MC estimates for rare

events.

maximize a lower bound

slide-24
SLIDE 24

Design by Adaptive Sampling (cont.)

Two issues:

  • 1. 𝜄 is in the expectation

distribution.

  • 2. MC estimates for rare

events.

maximize a lower bound

anneal a sequence of relaxations: 𝑇0 → 𝑇, where 𝑇0 ⊃ 𝑇034

slide-25
SLIDE 25

Design by Adaptive Sampling (cont.)

Two issues:

  • 1. 𝜄 is in the expectation

distribution.

  • 2. MC estimates for rare

events.

maximize a lower bound

≥ Anneal and MC

slide-26
SLIDE 26

Design by Adaptive Sampling (cont.)

Two issues:

  • 1. 𝜄 is in the expectation

distribution.

  • 2. MC estimates for rare

events.

maximize a lower bound

≥ Anneal and MC

Assumes oracle is unbiased and has good uncertainty estimates

slide-27
SLIDE 27

How pathological oracles lead you astray

slide-28
SLIDE 28

How pathological oracles lead you astray

Acceptable Many training examples

slide-29
SLIDE 29

How pathological oracles lead you astray

Acceptable Many training examples Fewer training examples Pathological

slide-30
SLIDE 30

How pathological oracles lead you astray

Acceptable Pathological

Idea: estimate training distribution of x conditioned on high values of oracle

slide-31
SLIDE 31

Fixing pathological oracles w/ conditioning

Idea: estimate training distribution of x conditioned on high values of oracle

slide-32
SLIDE 32

Fixing pathological oracles w/ conditioning

Idea: estimate training distribution of x conditioned on high values of oracle Don’t have access to training distribution, but can build a model 𝑞 𝒚 𝜾 7 to approximate it

slide-33
SLIDE 33

Conditioning by Adaptive Sampling (CbAS)

≥ Anneal and MC Previous formulation: New formulation:

𝑞 𝒚 𝜾(𝟏) models the training distribution

slide-34
SLIDE 34

≥ Anneal and MC Previous formulation: New formulation: =

Conditioning by Adaptive Sampling (CbAS)

slide-35
SLIDE 35

≥ Anneal and MC Previous formulation: New formulation: =

Can’t anneal when sampling

  • dist. doesn’t change!

Conditioning by Adaptive Sampling (CbAS)

slide-36
SLIDE 36

≥ Anneal and MC Previous formulation: New formulation: = =

Importance sampling proposal dist.

Conditioning by Adaptive Sampling (CbAS)

slide-37
SLIDE 37

≥ Anneal and MC Previous formulation: New formulation: = = Anneal and MC ≥ Anneal and MC

Conditioning by Adaptive Sampling (CbAS)

slide-38
SLIDE 38

Testing is fundamentally different

  • We don’t trust our oracle and generally

can’t query the ground truth

slide-39
SLIDE 39

Testing is fundamentally different

  • We don’t trust our oracle and generally

can’t query the ground truth

  • We can’t hold-out a test set of good

sequences

  • Near-zero chance of any of these

sequences being found by the method Test set

slide-40
SLIDE 40

Testing is fundamentally different

  • We don’t trust our oracle and generally

can’t query the ground truth

  • We can’t hold-out a test set of good

sequences

  • Near-zero chance of any of these

sequences being found by the method

  • We can’t use some canonical test

function as the oracle

  • In our problem it is untrustworthy
slide-41
SLIDE 41

Testing strategy

  • Simulate a ground truth based on real data

à “Ground truth” is a GP mean function

Ground truth GP

slide-42
SLIDE 42

Testing strategy

  • Simulate a ground truth based on real data

à “Ground truth” is a GP mean function

  • Ground truth vales values are sampled from

the GP for given sequences

  • Use these input-output pairs to train oracles.

Ground truth GP Oracles Training data

slide-43
SLIDE 43

Testing strategy

  • Simulate a ground truth based on real data

à “Ground truth” is a GP mean function

  • Ground truth vales values are sampled from

the GP for given sequences

  • Use these input-output pairs to train oracles
  • Coerce training set so these oracles exhibit

pathologies

slide-44
SLIDE 44

Results

slide-45
SLIDE 45

Results

Model-based optimizations

Use weighted ML updates with weights:

  • CbAS:

: 𝑦 𝜄 7 : 𝑦 𝜄(0) 𝑄 𝑇 0 𝑦)

  • DbAS:

𝑄 𝑇 0 𝑦)

  • RWR:

𝑓< = >

  • CEM-PI: 𝕞 @A > BC D (𝑦)
  • FB-VAE: 𝕞 = > BC D (𝑦) w/ additional

considerations

slide-46
SLIDE 46

Results

Model-based optimizations Gradient descent on latent spaces

slide-47
SLIDE 47

Results

Oracle Ground truth Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 Seq9

What does each bar represent?

slide-48
SLIDE 48

Results

Oracle Ground truth Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 Seq9 Greater than ith percentile Mean

What does each bar represent?

slide-49
SLIDE 49

Results

slide-50
SLIDE 50

Wrap-up

  • Introduced a new model-based optimization method

that is robust to pathological oracles

  • Specifically targeted for discrete design problems
  • Ongoing work to move beyond proof-of-principle:
  • Collaboration with wet-lab to perform end-to-end

validation

slide-51
SLIDE 51

Thanks!

Funding: