Improving Molecular Design by Stochastic Iterative Target - - PowerPoint PPT Presentation

improving molecular design by stochastic iterative target
SMART_READER_LITE
LIVE PREVIEW

Improving Molecular Design by Stochastic Iterative Target - - PowerPoint PPT Presentation

Improving Molecular Design by Stochastic Iterative Target Augmentation Kevin Yang, Wengong Jin, Kyle Swanson, Regina Barzilay, Tommi Jaakkola 15-Second Overview Data augmentation approach: improve molecular optimization SOTA by > 10%


slide-1
SLIDE 1

Improving Molecular Design by Stochastic Iterative Target Augmentation

Kevin Yang, Wengong Jin, Kyle Swanson, Regina Barzilay, Tommi Jaakkola

slide-2
SLIDE 2

15-Second Overview

Data augmentation approach: improve molecular optimization SOTA by > 10% Broadly useful for structured generation tasks, e.g. program synthesis (shown later)

slide-3
SLIDE 3

Context: Pharmaceutical Drug Discovery

Suppose: have promising drug candidate for e.g., COVID-19

slide-4
SLIDE 4

Context: Pharmaceutical Drug Discovery

Suppose: have promising drug candidate for e.g., COVID-19 Want to make it more potent (higher property score) Have: Want:

slide-5
SLIDE 5

Task: Molecular Optimization

“Translate” input molecule to a similar molecule with better property score.

slide-6
SLIDE 6

Task: Molecular Optimization

“Translate” input molecule to a similar molecule with better property score. Dataset: collection of input-target pairs

slide-7
SLIDE 7

Why is Molecular Optimization Hard?

slide-8
SLIDE 8

Real-world ground truth evaluation: lab assay

Why is Molecular Optimization Hard?

slide-9
SLIDE 9

Real-world ground truth evaluation: lab assay

  • Slow + expensive!

Why is Molecular Optimization Hard?

slide-10
SLIDE 10

Real-world ground truth evaluation: lab assay

  • Slow + expensive!

Key Problem: Small Datasets

Why is Molecular Optimization Hard?

slide-11
SLIDE 11

Stochastic Iterative Target Augmentation

Data augmentation meta-algorithm on top of existing model

slide-12
SLIDE 12
  • Over 10% absolute gain over SOTA on both datasets

Results: Molecular Optimization

slide-13
SLIDE 13

Results: Program Synthesis

slide-14
SLIDE 14

Stochastic Iterative Target Augmentation

Data augmentation meta-algorithm on top of existing model

  • Sample input-output pairs

from generator

New “data” Some good, some bad

slide-15
SLIDE 15

Stochastic Iterative Target Augmentation

Data augmentation meta-algorithm on top of existing model

  • Sample input-output pairs

from generator

How to filter for only the good pairs?

?

New “data” Some good, some bad Filtered good “data” only

slide-16
SLIDE 16

Idea: Filter with Property Predictor

Predict

slide-17
SLIDE 17

Idea: Filter with Property Predictor

Predict This is easier than generation!

slide-18
SLIDE 18

Idea: Filter with Property Predictor

Program synthesis analogue: hard to write program, easier to run test cases

Predict This is easier than generation!

slide-19
SLIDE 19

Stochastic Iterative Target Augmentation

Data augmentation meta-algorithm on top of existing model

  • Sample input-output pairs

from generator

  • Filter with property predictor,

add good pairs to training data

Property Predictor

New “data” Some good, some bad Filtered good “data” only

slide-20
SLIDE 20

Stochastic Iterative Target Augmentation

Data augmentation meta-algorithm on top of existing model

  • Sample input-output pairs

from generator

  • Filter with property predictor,

add good pairs to training data

  • Train generator, repeat
slide-21
SLIDE 21

Outline

Setup + Evaluation Detailed Method More Empirical Analysis Program Synthesis Experiments + Results

slide-22
SLIDE 22

Outline

Setup + Evaluation Detailed Method More Empirical Analysis Program Synthesis Experiments + Results

slide-23
SLIDE 23

Real-world ground truth evaluation: lab assay

  • Slow + expensive! ( → small datasets)

Real World Molecular Optimization

slide-24
SLIDE 24

Real-world ground truth evaluation: lab assay

  • Slow + expensive! ( → small datasets)
  • Only use at final test time

Real World Molecular Optimization

slide-25
SLIDE 25

Real-world ground truth evaluation: lab assay

  • Slow + expensive! ( → small datasets)
  • Only use at final test time

Use fast + cheap in silico (i.e., computational) predictor for model validation

Real World Molecular Optimization

Lab Assay Data used to train in silico Test time only Can use anytime

slide-26
SLIDE 26

(Lab assay, in silico predictor) become (in silico predictor, proxy predictor)

Lab Assay Data used to train

Evaluation Setup

in silico Proxy Data used to train Test time only Can use anytime

slide-27
SLIDE 27

(Lab assay, in silico predictor) become (in silico predictor, proxy predictor)

  • Just train proxy on property values of molecular optimization training pairs

Lab Assay Data used to train

Evaluation Setup

in silico Proxy Data used to train Test time only Can use anytime

slide-28
SLIDE 28

Metric

“Success” if even 1/20 tries passes ground truth evaluator

slide-29
SLIDE 29

Metric

“Success” if even 1/20 tries passes ground truth evaluator Molecular optimization is hard...

slide-30
SLIDE 30

Outline

Setup + Evaluation Detailed Method More Empirical Analysis Program Synthesis Experiments + Results

slide-31
SLIDE 31

Stochastic Iterative Target Augmentation

Goal:

Somehow Target augmentation: Augment the set of correct targets for a given input.

slide-32
SLIDE 32

Stochastic Iterative Target Augmentation

1. Given inputs, sample input-target pairs from current generative model

Target augmentation: Augment the set of correct targets for a given input.

slide-33
SLIDE 33

Stochastic Iterative Target Augmentation

1. Given inputs, sample input-target pairs from current generative model 2. Filter candidate input-output pairs using property predictor

Target augmentation: Augment the set of correct targets for a given input.

slide-34
SLIDE 34

1. Given inputs, sample input-target pairs from current generative model 2. Filter candidate input-output pairs using property predictor 3. Add good pairs to training data, train model, repeat

Stochastic Iterative Target Augmentation

slide-35
SLIDE 35
  • Over 10% absolute gain over SOTA on both datasets

Results: Molecular Optimization

slide-36
SLIDE 36

Observations

  • View as Stochastic EM
slide-37
SLIDE 37

Observations

  • View as Stochastic EM
  • Why iterative? Better generator → easier to find new correct targets
slide-38
SLIDE 38

Observations

  • View as Stochastic EM
  • Why iterative? Better generator → easier to find new correct targets
  • May as well use proxy to filter samples at test time too
slide-39
SLIDE 39

Outline

Setup + Evaluation Detailed Method More Empirical Analysis Program Synthesis Experiments + Results

slide-40
SLIDE 40

FCD (embedding distance) is the molecular analogue to Inception distance in

  • images. Lower is better.

Frechet Chemnet Distance Analysis

slide-41
SLIDE 41

Improved Diversity

Diversity: average distance between different correct outputs for the same input

slide-42
SLIDE 42

Robustness to Predictor Quality

Far left point is oracle (ground truth); second-from left is learned proxy predictor. Blue line indicates baseline performance.

slide-43
SLIDE 43

Outline

Setup + Evaluation Detailed Method More Empirical Analysis Program Synthesis Experiments + Results

slide-44
SLIDE 44

Program Synthesis Task: Karel Dataset

Inputs: Test Cases Outputs: Programs Evaluate correctness using held-out test cases

slide-45
SLIDE 45

Program Synthesis Target Augmentation

slide-46
SLIDE 46

Results: Program Synthesis

slide-47
SLIDE 47

Summary

Data augmentation meta-algorithm for improving performance on structured generation tasks

slide-48
SLIDE 48

Summary

Data augmentation meta-algorithm for improving performance on structured generation tasks Significantly improves over SOTA in molecular optimization: > 10%

slide-49
SLIDE 49

Summary

Data augmentation meta-algorithm for improving performance on structured generation tasks Significantly improves over SOTA in molecular optimization: > 10% Applicable to other domains: program synthesis

slide-50
SLIDE 50

Summary

Data augmentation meta-algorithm for improving performance on structured generation tasks Significantly improves over SOTA in molecular optimization: > 10% Applicable to other domains: program synthesis

Thanks for Watching!