Conditioning by adaptive sampling for robust design
David Brookes Biophysics Graduate Group University California, Berkeley Jennifer Listgarten EECS and Center for Computational Biology University California, Berkeley
Conditioning by adaptive sampling for robust design David Brookes - - PowerPoint PPT Presentation
Conditioning by adaptive sampling for robust design David Brookes Jennifer Listgarten Biophysics Graduate Group EECS and Center for Computational Biology University California, Berkeley University California, Berkeley Motivating problem:
David Brookes Biophysics Graduate Group University California, Berkeley Jennifer Listgarten EECS and Center for Computational Biology University California, Berkeley
Proteins that fluoresce
Proteins that fluoresce … that act as drugs
Proteins that fluoresce … that act as drugs … that fixate carbon in the atmosphere
Proteins that fluoresce …. that act as drugs … that fixate carbon in the atmosphere … that deliver gene-editing tools to tissues
A law of molecular biology:
Hughes A, Mort M, Carlisle F, et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014;85:A10. http://www.rcsb.org/structure/6FWW
A law of molecular biology:
Hughes A, Mort M, Carlisle F, et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014;85:A10. http://www.rcsb.org/structure/6FWW
ex: fluorescence
Hughes A, Mort M, Carlisle F, et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014;85:A10. http://www.rcsb.org/structure/6FWW
A law of molecular biology:
High throughput experiments (& ML)
Hughes A, Mort M, Carlisle F, et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014;85:A10. http://www.rcsb.org/structure/6FWW
A law of molecular biology:
Design problem: Given a model, find sequences with desired function
⟹ size scales as 20$
Atoms in universe Grains of sand on earth
⟹ size scales as 20$
Atoms in universe Grains of sand on earth
⟹ size scales as 20$
https://livingthing.danmackinlay.name/gaussian_processes.html69
Atoms in universe Grains of sand on earth
Idea: replace the standard (hard) objective
e.g. the space of sequences
Idea: replace the standard (hard) objective with a potentially easier one
the space of sequences model over sequence space
Idea: replace the standard (hard) objective with a potentially easier one Solution approach is to iterate: 1. Sample from “search model” 𝑞 𝑦 𝜄 2. Evaluate samples on 𝑔 𝑦 3. Adjust 𝜄 so the model favors samples with large function evals
Idea: replace the standard (hard) objective with a potentially easier one Solution approach is to iterate: 1. Sample from “search model” 𝑞 𝑦 𝜄 2. Evaluate samples on 𝑔 𝑦 3. Adjust 𝜄 so the model favors sequences with large function evals
ü Model can sample broad areas of sequence space ü Does not require gradients of 𝑔 ü Can incorporate uncertainty
Our aim is solve the MBO objective:
Our aim is solve the MBO objective: where
Our aim is solve the MBO objective: where
à e.g. fluorescence > 𝛽
Our aim is solve the MBO objective: where
à e.g. fluorescence > 𝛽
that maps sequences to property
≥
≥
≥
≥ Anneal and MC
≥ Anneal and MC
Acceptable Many training examples
Acceptable Many training examples Fewer training examples Pathological
Acceptable Pathological
Idea: estimate training distribution of x conditioned on high values of oracle
Idea: estimate training distribution of x conditioned on high values of oracle
Idea: estimate training distribution of x conditioned on high values of oracle Don’t have access to training distribution, but can build a model 𝑞 𝒚 𝜾 7 to approximate it
≥ Anneal and MC Previous formulation: New formulation:
𝑞 𝒚 𝜾(𝟏) models the training distribution
≥ Anneal and MC Previous formulation: New formulation: =
≥ Anneal and MC Previous formulation: New formulation: =
Can’t anneal when sampling
≥ Anneal and MC Previous formulation: New formulation: = =
Importance sampling proposal dist.
≥ Anneal and MC Previous formulation: New formulation: = = Anneal and MC ≥ Anneal and MC
can’t query the ground truth
can’t query the ground truth
sequences
sequences being found by the method Test set
can’t query the ground truth
sequences
sequences being found by the method
function as the oracle
à “Ground truth” is a GP mean function
Ground truth GP
à “Ground truth” is a GP mean function
the GP for given sequences
Ground truth GP Oracles Training data
à “Ground truth” is a GP mean function
the GP for given sequences
pathologies
Model-based optimizations
Use weighted ML updates with weights:
: 𝑦 𝜄 7 : 𝑦 𝜄(0) 𝑄 𝑇 0 𝑦)
𝑄 𝑇 0 𝑦)
𝑓< = >
considerations
Model-based optimizations Gradient descent on latent spaces
Oracle Ground truth Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 Seq9
What does each bar represent?
Oracle Ground truth Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 Seq9 Greater than ith percentile Mean
What does each bar represent?
Funding: