Feature Generation for Drug Discovery Learning Using Persistent - - PowerPoint PPT Presentation

feature generation for drug discovery learning
SMART_READER_LITE
LIVE PREVIEW

Feature Generation for Drug Discovery Learning Using Persistent - - PowerPoint PPT Presentation

Feature Generation for Drug Discovery Learning Using Persistent Homology to Create Moduli Spaces of Chemical Compounds Anthony Bak Problem Context We want to: Create new drugs to solve disease Problem Context We want to: Create new


slide-1
SLIDE 1

Feature Generation for Drug Discovery Learning

Using Persistent Homology to Create Moduli Spaces of Chemical Compounds Anthony Bak

slide-2
SLIDE 2

Problem Context

We want to:

◮ Create new drugs to solve disease

slide-3
SLIDE 3

Problem Context

We want to:

◮ Create new drugs to solve disease ◮ Find new compounds to run in drug trials

slide-4
SLIDE 4

Problem Context

We want to:

◮ Create new drugs to solve disease ◮ Find new compounds to run in drug trials ◮ Run experiments to test the inhibition properties of compounds

slide-5
SLIDE 5

Problem Context

We want to:

◮ Create new drugs to solve disease ◮ Find new compounds to run in drug trials ◮ Run experiments to test the inhibition properties of compounds ◮ Find a set of compounds a small enough number to try

slide-6
SLIDE 6

Problem Context

We want to:

◮ Create new drugs to solve disease ◮ Find new compounds to run in drug trials ◮ Run experiments to test the inhibition properties of compounds ◮ Find a set of compounds a small enough number to try ◮ Sort through all known compounds to come up with likely collection of

compounds

slide-7
SLIDE 7

Problem Context

We want to:

◮ Create new drugs to solve disease ◮ Find new compounds to run in drug trials ◮ Run experiments to test the inhibition properties of compounds ◮ Find a set of compounds a small enough number to try ← Here is our step ◮ Sort through all known compounds to come up with likely collection of

compounds

slide-8
SLIDE 8

Problem Context

We want to:

◮ Create new drugs to solve disease ◮ Find new compounds to run in drug trials ◮ Run experiments to test the inhibition properties of compounds ◮ Find a set of compounds a small enough number to try ← Here is our step ◮ Sort through all known compounds to come up with likely collection of

compounds ← Maybe with enough compute power we could do this.

slide-9
SLIDE 9

Problem Context

We want to:

◮ Create new drugs to solve disease ◮ Find new compounds to run in drug trials ◮ Run experiments to test the inhibition properties of compounds ◮ Find a set of compounds a small enough number to try ← Here is our step ◮ Sort through all known compounds to come up with likely collection of

compounds ← Maybe with enough compute power we could do this. This process is called virtual screening

slide-10
SLIDE 10

Meta Goals

◮ Solve the problem

slide-11
SLIDE 11

Meta Goals

◮ Solve the problem ◮ Use solution to illustrate new mathematical tools. Eg. persistent homology

slide-12
SLIDE 12

Meta Goals

◮ Solve the problem ◮ Use solution to illustrate new mathematical tools. Eg. persistent homology ◮ Tools illustrate what may be some unexpected mathematical concepts

(functoriality, rings of algebraic functions etc.) being applied in a data driven (not model driven) context.

slide-13
SLIDE 13

Meta Goals

◮ Solve the problem ◮ Use solution to illustrate new mathematical tools. Eg. persistent homology ◮ Tools illustrate what may be some unexpected mathematical concepts

(functoriality, rings of algebraic functions etc.) being applied in a data driven (not model driven) context.

◮ Some mathematical limitations of current methods are discussed

slide-14
SLIDE 14

Why do virtual screen at all?

◮ High throughput screening (HTS)

◮ Physical screening of large numbers of potential drugs. ◮ Very expensive

slide-15
SLIDE 15

Why do virtual screen at all?

◮ High throughput screening (HTS)

◮ Physical screening of large numbers of potential drugs. ◮ Very expensive

◮ Virtual screening

◮ Computational ◮ Typically based on biochemical knowledge ◮ Drastically reduces the cost of HTS ◮ Typical goal for a database of millions of compounds is to select 90% of the

potential inhibitors with about 10% of the total compounds.

slide-16
SLIDE 16

Why do virtual screen at all?

◮ High throughput screening (HTS)

◮ Physical screening of large numbers of potential drugs. ◮ Very expensive

◮ Virtual screening

◮ Computational ◮ Typically based on biochemical knowledge ◮ Drastically reduces the cost of HTS ◮ Typical goal for a database of millions of compounds is to select 90% of the

potential inhibitors with about 10% of the total compounds.

Many different methods:

◮ QSAR (quantitative structure-activity relationship) ◮ Pharmacophore models (points in 3D space, with radii, representing specific

types of chemical interaction)

◮ Typically, no insight into the space of compounds being examined

slide-17
SLIDE 17

Why do virtual screen at all?

◮ High throughput screening (HTS)

◮ Physical screening of large numbers of potential drugs. ◮ Very expensive

◮ Virtual screening

◮ Computational ◮ Typically based on biochemical knowledge ◮ Drastically reduces the cost of HTS ◮ Typical goal for a database of millions of compounds is to select 90% of the

potential inhibitors with about 10% of the total compounds.

Many different methods:

◮ QSAR (quantitative structure-activity relationship) ◮ Pharmacophore models (points in 3D space, with radii, representing specific

types of chemical interaction)

◮ Typically, no insight into the space of compounds being examined

Goal: To find the set of relevant bioactive compounds

slide-18
SLIDE 18

Our Example: Dihydrofolate reductase (DHFR)

◮ Tetrahydrofolate is an important precursor in the biosynthesis of purines,

thymidylate, and several important amino acids.

◮ DHFR turns dihydrofolate (DHF) into tetrahydrafolate (THF). ◮ Dihydrofolate is easily available. The reaction catalyzed by DHFR is the only

source you have for THF .

slide-19
SLIDE 19

Why DHFR

DHFR inhibitors are a class of drugs that stop DHFR from working. Why do we care?

◮ Cancer (e.g. methotrexate)

◮ DNA is made from purines (Adenine and Guanine) and pyrimidines (Thymine

and Cytosine).

◮ Stopping DHFR → no new DNA → cells cannot divide ◮ Everything dies, but cancer is growing most quickly, so (hopefully) it dies first.

slide-20
SLIDE 20

Why DHFR

DHFR inhibitors are a class of drugs that stop DHFR from working. Why do we care?

◮ Cancer (e.g. methotrexate)

◮ DNA is made from purines (Adenine and Guanine) and pyrimidines (Thymine

and Cytosine).

◮ Stopping DHFR → no new DNA → cells cannot divide ◮ Everything dies, but cancer is growing most quickly, so (hopefully) it dies first.

◮ Bacteria (e.g. trimethoprim)

◮ Bacterial DHFR has similar, but different, structure. ◮ Some DHFR inhibitors only bind bacterial DHFR, not human.

slide-21
SLIDE 21

Why DHFR

DHFR inhibitors are a class of drugs that stop DHFR from working. Why do we care?

◮ Cancer (e.g. methotrexate)

◮ DNA is made from purines (Adenine and Guanine) and pyrimidines (Thymine

and Cytosine).

◮ Stopping DHFR → no new DNA → cells cannot divide ◮ Everything dies, but cancer is growing most quickly, so (hopefully) it dies first.

◮ Bacteria (e.g. trimethoprim)

◮ Bacterial DHFR has similar, but different, structure. ◮ Some DHFR inhibitors only bind bacterial DHFR, not human.

◮ Malaria (e.g. pyrimethamine)

◮ Some DHFR inhibitors only bind malarial DHFR.

slide-22
SLIDE 22

Problem Complexity

The multi-species DHFR activity makes our problem more complicated

◮ We need to separate out compounds not just by bioactivity but per-species

bioactivity.

◮ You don’t want a drug targeting E Coli to also function as a cancer drug that

stops human cellular reproduction

◮ Ditto for other species pneumonia, malaria etc. so that we can have precise

targeting

slide-23
SLIDE 23

Structure-based DHFR drug design

Methotrexate, a DHFR-inhibitor, is the first historical example of successful anticancer structure-based drug design.

slide-24
SLIDE 24

Structure-based DHFR drug design

For comparison, a chemically similar molecule that does not inhibit DHFR:

slide-25
SLIDE 25

Structure-based DHFR drug design

Structure-based drug design is hard

◮ Design required significant biological and biochemical experiments and

knowledge as well as years of work.

slide-26
SLIDE 26

Structure-based DHFR drug design

Structure-based drug design is hard

◮ Design required significant biological and biochemical experiments and

knowledge as well as years of work.

◮ Methotrexate, designed in the late 40’s and early 50’s, is still used today as an

anticancer drug.

slide-27
SLIDE 27

Structure-based DHFR drug design

Structure-based drug design is hard

◮ Design required significant biological and biochemical experiments and

knowledge as well as years of work.

◮ Methotrexate, designed in the late 40’s and early 50’s, is still used today as an

anticancer drug.

◮ Typical side effects: hair loss, ulcers, etc. Drugs can have bad side effects but

if they’re the only option...

slide-28
SLIDE 28

Structure-based DHFR drug design

Structure-based drug design is hard

◮ Design required significant biological and biochemical experiments and

knowledge as well as years of work.

◮ Methotrexate, designed in the late 40’s and early 50’s, is still used today as an

anticancer drug.

◮ Typical side effects: hair loss, ulcers, etc. Drugs can have bad side effects but

if they’re the only option...

◮ Decades later, the first crystal structure of methotrexate bound to DHFR was

  • found. It binds upside down in the binding pocket when compared to THF!
slide-29
SLIDE 29

Structure-based DHFR drug design

Structure-based drug design is hard

◮ Design required significant biological and biochemical experiments and

knowledge as well as years of work.

◮ Methotrexate, designed in the late 40’s and early 50’s, is still used today as an

anticancer drug.

◮ Typical side effects: hair loss, ulcers, etc. Drugs can have bad side effects but

if they’re the only option...

◮ Decades later, the first crystal structure of methotrexate bound to DHFR was

  • found. It binds upside down in the binding pocket when compared to THF!

Yikes!

slide-30
SLIDE 30

Feature Engineering using Topology

slide-31
SLIDE 31

Overview

The method:

◮ For each compound we calculate a set of topological invariants (barcodes)

slide-32
SLIDE 32

Overview

The method:

◮ For each compound we calculate a set of topological invariants (barcodes) ◮ From a metric on barcodes make a finite metric space

slide-33
SLIDE 33

Overview

The method:

◮ For each compound we calculate a set of topological invariants (barcodes) ◮ From a metric on barcodes make a finite metric space ◮ We hope that compounds with certain bio-chemical properties are localized in

this space

slide-34
SLIDE 34

Overview

The method:

◮ For each compound we calculate a set of topological invariants (barcodes) ◮ From a metric on barcodes make a finite metric space ◮ We hope that compounds with certain bio-chemical properties are localized in

this space

◮ We think that with the right collection of barcodes they are.

slide-35
SLIDE 35

Overview

The method:

◮ For each compound we calculate a set of topological invariants (barcodes) ◮ From a metric on barcodes make a finite metric space ◮ We hope that compounds with certain bio-chemical properties are localized in

this space

◮ We think that with the right collection of barcodes they are.

Philosophy : We don’t use the barcodes to study individual compounds but to say how they differ from each other. There relative differences and the global structure

  • f the space allow us to make inferences.
slide-36
SLIDE 36

Dataset

To get our test dataset we wanted to get all likely compounds:

◮ Based on experience we knew that each compound needed at least one

aromatic group and a hydrophobic piece.

◮ We search a database of drug like compounds for all matching compounds ◮ We did a literature search for all known DHFR based drugs (across all

species)

◮ We combined these datasets into a single dataset with 4000 compounds

slide-37
SLIDE 37

Persistent Homology: Notation and Language

Given a filtration of a space X0 ⊆ X1 ⊆ X2... ⊆ X we have maps of homology Hi(Xl) → Hi(Xk) whenever l < k. This situation is classified by a barcode.

◮ For us, filtrations will be created by sub and super level sets of functions

Intuition: We track when a homology class is born and when it dies according to some choice of parameter.

slide-38
SLIDE 38

Persistent Homology: Rips Filtration

slide-39
SLIDE 39

Persistent Homology: Rips Filtration

◮ Start with points and distances between them

slide-40
SLIDE 40

Persistent Homology: Rips Filtration

◮ Start with points and distances between them ◮ Add edges in order of increasing pairwise distances between points

slide-41
SLIDE 41

Persistent Homology: Rips Filtration

◮ Start with points and distances between them ◮ Add edges in order of increasing pairwise distances between points

slide-42
SLIDE 42

Persistent Homology: Rips Filtration

◮ Start with points and distances between them ◮ Add edges in order of increasing pairwise distances between points ◮ Add higher dimensional simplifies when all their faces are already included.

slide-43
SLIDE 43

Persistent Homology: Barcodes

◮ Start with 9 points. Each has a barcode in dimension 0.

H0 H1

slide-44
SLIDE 44

Persistent Homology: Barcodes

◮ Start with 9 points. Each has a barcode in dimension 0. ◮ Adding the shortest edges kills two components. The rest continue.

H0 H1

slide-45
SLIDE 45

Persistent Homology: Barcodes

◮ Start with 9 points. Each has a barcode in dimension 0. ◮ Adding the shortest edges kills two components. The rest continue. ◮ We only have one component but have create two cycles in dimension 1.

H0 H1

slide-46
SLIDE 46

Persistent Homology: Barcodes

◮ Start with 9 points. Each has a barcode in dimension 0. ◮ Adding the shortest edges kills two components. The rest continue. ◮ We only have one component but have create two cycles in dimension 1. ◮ These two cycles die when the face (triangles) are filled in.

H0 H1

slide-47
SLIDE 47

Persistent Homology: Functions

Fix an underlying complex

slide-48
SLIDE 48

Persistent Homology: Functions

Assign each vertex a real value. This is the filtration function

slide-49
SLIDE 49

Persistent Homology: Functions

Assign each vertex a real value. This is the filtration function

1 2 1 2 1 2 1 2

slide-50
SLIDE 50

Persistent Homology: Functions

Assign each vertex a real value. This is the filtration function

1 2 1 2 1 2 1 2

1 1

slide-51
SLIDE 51

Persistent Homology: Functions

Extend the function to the higher dimensional simplicies in a linear fashion.

1 2 1 2 1 2 1 2

1 2 1

slide-52
SLIDE 52

Persistent Homology: Functions

Track the cycles as they are created and die.

H0 H1

1 2 1 2 1 2 1 2

1 2 1

slide-53
SLIDE 53

Persistent Homology: Functions

Track the cycles as they are created and die.

H0 H1

slide-54
SLIDE 54

Persistent Homology: Functions

Track the cycles as they are created and die.

H0 H1

1 2 1 2 1 2 1 2

slide-55
SLIDE 55

Persistent Homology: Functions

Track the cycles as they are created and die.

H0 H1

1 2 1 2 1 2 1 2

1 1

slide-56
SLIDE 56

Persistent Homology: Functions

Track the cycles as they are created and die.

H0 H1

1 2 1 2 1 2 1 2

1 2 1

slide-57
SLIDE 57

Barcode Interpretability and Engineering

◮ Different filtrations tell you different things about the structure.

slide-58
SLIDE 58

Barcode Interpretability and Engineering

◮ Different filtrations tell you different things about the structure.

◮ For the rips filtration we learn about connected components and the size of the

voids

H0 H1

slide-59
SLIDE 59

Barcode Interpretability and Engineering

◮ Different filtrations tell you different things about the structure.

◮ For the rips filtration we learn about connected components and the size of the

voids

H0 H1 ◮ For the function example we learned that there are two ends and cycles away

from the center

H0 H1

slide-60
SLIDE 60

Barcode Interpretability and Engineering

◮ Different filtrations tell you different things about the structure.

◮ For the rips filtration we learn about connected components and the size of the

voids

H0 H1 ◮ For the function example we learned that there are two ends and cycles away

from the center

H0 H1

◮ You can interpret the barcodes and you can engineer them to answer different

questions.

slide-61
SLIDE 61

Chemical Compounds as Finite Metric Spaces

A chemical compound is a finite metric space

slide-62
SLIDE 62

Chemical Compounds as Finite Metric Spaces

A chemical compound is a finite metric space

◮ Points are 3-d atomic coordinates

slide-63
SLIDE 63

Chemical Compounds as Finite Metric Spaces

A chemical compound is a finite metric space

◮ Points are 3-d atomic coordinates ◮ We considered two natural metrics

◮ The embedded (euclidean) metric ◮ The graph metric given by bonds and their lengths

slide-64
SLIDE 64

Chemical Compounds as Finite Metric Spaces

A chemical compound is a finite metric space

◮ Points are 3-d atomic coordinates ◮ We considered two natural metrics

◮ The embedded (euclidean) metric ◮ The graph metric given by bonds and their lengths

A function on a compound is a real value for each atom.

◮ The most important property is that they have intrinsic meaning for either the

chemistry or geometry.

slide-65
SLIDE 65

Filter functions

We want a rich set of filter functions to capture the bio-chemical properties of a compound.

slide-66
SLIDE 66

Filter functions

We want a rich set of filter functions to capture the bio-chemical properties of a compound.

◮ Geometric

slide-67
SLIDE 67

Filter functions

We want a rich set of filter functions to capture the bio-chemical properties of a compound.

◮ Geometric

◮ α-complex filtration

slide-68
SLIDE 68

Filter functions

We want a rich set of filter functions to capture the bio-chemical properties of a compound.

◮ Geometric

◮ α-complex filtration ◮ Rips complex

slide-69
SLIDE 69

Filter functions

We want a rich set of filter functions to capture the bio-chemical properties of a compound.

◮ Geometric

◮ α-complex filtration ◮ Rips complex ◮ Eccentricity

slide-70
SLIDE 70

Filter functions

We want a rich set of filter functions to capture the bio-chemical properties of a compound.

◮ Geometric

◮ α-complex filtration ◮ Rips complex ◮ Eccentricity

◮ Chemical

◮ Atomic mass

slide-71
SLIDE 71

Filter functions

We want a rich set of filter functions to capture the bio-chemical properties of a compound.

◮ Geometric

◮ α-complex filtration ◮ Rips complex ◮ Eccentricity

◮ Chemical

◮ Atomic mass ◮ Partial charge

slide-72
SLIDE 72

Filter functions

We want a rich set of filter functions to capture the bio-chemical properties of a compound.

◮ Geometric

◮ α-complex filtration ◮ Rips complex ◮ Eccentricity

◮ Chemical

◮ Atomic mass ◮ Partial charge

Note: Even the geometric functions already capture a lot of chemical information since they are based on chemical bonds.

slide-73
SLIDE 73

Some Parameter choices

For all of the filter functions except the α-complex and the rips complex a choice of scale is needed to build the underlying complex being filtered.

◮ We generally choose a selection of scales. Typical choices are multiples of

the carbon-carbon bond length: 1,2,4,6,8.

◮ These are ’slices’ along one direction of the multi filtration

slide-74
SLIDE 74

Some Parameter choices

For all of the filter functions except the α-complex and the rips complex a choice of scale is needed to build the underlying complex being filtered.

◮ We generally choose a selection of scales. Typical choices are multiples of

the carbon-carbon bond length: 1,2,4,6,8.

◮ These are ’slices’ along one direction of the multi filtration

For metrics using the bond graph we need need to choose a dimension cutoff for betti numbers.

◮ We generally choose 2 or 3 for calculation resource reasons

slide-75
SLIDE 75

Some Parameter choices

For all of the filter functions except the α-complex and the rips complex a choice of scale is needed to build the underlying complex being filtered.

◮ We generally choose a selection of scales. Typical choices are multiples of

the carbon-carbon bond length: 1,2,4,6,8.

◮ These are ’slices’ along one direction of the multi filtration

For metrics using the bond graph we need need to choose a dimension cutoff for betti numbers.

◮ We generally choose 2 or 3 for calculation resource reasons

slide-76
SLIDE 76

Some Parameter choices

For all of the filter functions except the α-complex and the rips complex a choice of scale is needed to build the underlying complex being filtered.

◮ We generally choose a selection of scales. Typical choices are multiples of

the carbon-carbon bond length: 1,2,4,6,8.

◮ These are ’slices’ along one direction of the multi filtration

For metrics using the bond graph we need need to choose a dimension cutoff for betti numbers.

◮ We generally choose 2 or 3 for calculation resource reasons

For the rips filtration we need to choose a maximum distance parameter.

◮ We generally choose 6 or 8 times the carbon-carbon bond length for

calculation resource reasons

slide-77
SLIDE 77

Barcode Zoo

There’s a combinatorial explosion of parameters and resulting barcodes

◮ We end up with hundreds of barcodes for each compound.

For linguistic reasons to match the language from computational chemistry we might call them topological fingerprints.

slide-78
SLIDE 78

Compounds to Metric Spaces

For each barcode we use the either the bottleneck or Wasserstein distance to form a metric space of compounds. Bottleneck: B(B1, B2) = inf

m:B1→B2

(sup

b∈B1

||b − m(b)||∞) Wasserstein: W(B1, B2) = inf

m:B1→B2

(

  • b∈B1

||b − m(b)||q

∞)

1 q

where m is a matching between the diagrams.

slide-79
SLIDE 79

Compounds to Metric Spaces

For each barcode we use the either the bottleneck or Wasserstein distance to form a metric space of compounds. Bottleneck: B(B1, B2) = inf

m:B1→B2

(sup

b∈B1

||b − m(b)||∞) Wasserstein: W(B1, B2) = inf

m:B1→B2

(

  • b∈B1

||b − m(b)||q

∞)

1 q

where m is a matching between the diagrams.

◮ Both are kinds of "edit" distances. For bottleneck we take the largest edit in

the best matching while for Wasserstein we sum the edits for each bar.

slide-80
SLIDE 80

Compounds to Metric Spaces

For each barcode we use the either the bottleneck or Wasserstein distance to form a metric space of compounds. Bottleneck: B(B1, B2) = inf

m:B1→B2

(sup

b∈B1

||b − m(b)||∞) Wasserstein: W(B1, B2) = inf

m:B1→B2

(

  • b∈B1

||b − m(b)||q

∞)

1 q

where m is a matching between the diagrams.

◮ Both are kinds of "edit" distances. For bottleneck we take the largest edit in

the best matching while for Wasserstein we sum the edits for each bar.

◮ We allow bars to be matched to ’zero’ as well to make this work

slide-81
SLIDE 81

Matching Distances

The distances are an intuitive way to understand how two barcodes differ.

B1 B2

slide-82
SLIDE 82

Matching Distances

The distances are an intuitive way to understand how two barcodes differ.

B1 B2

= ⇒

match

slide-83
SLIDE 83

Matching Distances

The distances are an intuitive way to understand how two barcodes differ.

B1 B2

= ⇒

match

slide-84
SLIDE 84

Matching Distances

The distances are an intuitive way to understand how two barcodes differ.

B1 B2

= ⇒

match

slide-85
SLIDE 85

Visualization and Discovery: Ayasdi Mapper

Known Human Inhibitors

slide-86
SLIDE 86

Visualization and Discovery: Ayasdi Mapper

Known E. Coli Inhibitors

slide-87
SLIDE 87

Machine Learning: Functions and SVM

The space of barcodes forms an algebraic variety

slide-88
SLIDE 88

Machine Learning: Functions and SVM

The space of barcodes forms an algebraic variety

slide-89
SLIDE 89

Machine Learning: Functions and SVM

The space of barcodes forms an algebraic variety

◮ We know the ring of functions. Writing writing (xi, yi) for a (birth,death) point in

a barcode some examples are:

  • (yi − xi)
  • (yi − xi)2

(yi − xi)(xi + yi) Note that xi + yi is not in the ring since want functions to be zero on diagrams with only length zero bars.

slide-90
SLIDE 90

Machine Learning: Functions and SVM

The space of barcodes forms an algebraic variety

◮ We know the ring of functions. Writing writing (xi, yi) for a (birth,death) point in

a barcode some examples are:

  • (yi − xi)
  • (yi − xi)2

(yi − xi)(xi + yi) Note that xi + yi is not in the ring since want functions to be zero on diagrams with only length zero bars.

◮ We can use this to embed our compound space into euclidean space and

then have access to many standard machine learning algorithms. Eg. SVM classifications.

slide-91
SLIDE 91

Machine Learning: Functions and SVM

The space of barcodes forms an algebraic variety

◮ We know the ring of functions. Writing writing (xi, yi) for a (birth,death) point in

a barcode some examples are:

  • (yi − xi)
  • (yi − xi)2

(yi − xi)(xi + yi) Note that xi + yi is not in the ring since want functions to be zero on diagrams with only length zero bars.

◮ We can use this to embed our compound space into euclidean space and

then have access to many standard machine learning algorithms. Eg. SVM classifications.

◮ We do this using all polynomials up to a fixed degree.

slide-92
SLIDE 92

Machine Learning: Functions and SVM

The space of barcodes forms an algebraic variety

◮ We know the ring of functions. Writing writing (xi, yi) for a (birth,death) point in

a barcode some examples are:

  • (yi − xi)
  • (yi − xi)2

(yi − xi)(xi + yi) Note that xi + yi is not in the ring since want functions to be zero on diagrams with only length zero bars.

◮ We can use this to embed our compound space into euclidean space and

then have access to many standard machine learning algorithms. Eg. SVM classifications.

◮ We do this using all polynomials up to a fixed degree. ◮ Now use standard support vector machine.

slide-93
SLIDE 93

SVM for Classification

SVM Confusion Matrix for E Coli, Human, C Albicans and P Carinii DHFR inhibitors:     101 2 3 71 3 256 17 1 25 299     = ⇒ This result is comparable to state of the art computational chemistry fingerprint and simulation based methods.

slide-94
SLIDE 94

Summary

Overview

◮ From a set of chemical compounds calculate a rich set of barcodes ◮ Use a barcode metric to form the compounds into a metric space ◮ Understand the structure of the resulting space, for eg. via known drugs.

slide-95
SLIDE 95

Summary

Overview

◮ From a set of chemical compounds calculate a rich set of barcodes ◮ Use a barcode metric to form the compounds into a metric space ◮ Understand the structure of the resulting space, for eg. via known drugs.

Computational topology

◮ Achieves state of the art accuracy for classification ◮ Provides a global view of a space inaccessible previously

slide-96
SLIDE 96

Improvements

Math:

◮ Multidimensional Persistence: Ideally we would do all filters simultaneously.

◮ Fewer parameters to choose arbitrarily. ◮ Understand how the different filtrations interact.

◮ Optimization of barcode combinations: What do we do with the barcode zoo?

slide-97
SLIDE 97

Improvements

Math:

◮ Multidimensional Persistence: Ideally we would do all filters simultaneously.

◮ Fewer parameters to choose arbitrarily. ◮ Understand how the different filtrations interact.

◮ Optimization of barcode combinations: What do we do with the barcode zoo?

Computer Science:

◮ Faster more memory efficient persistence homology calculations

slide-98
SLIDE 98

Improvements

Math:

◮ Multidimensional Persistence: Ideally we would do all filters simultaneously.

◮ Fewer parameters to choose arbitrarily. ◮ Understand how the different filtrations interact.

◮ Optimization of barcode combinations: What do we do with the barcode zoo?

Computer Science:

◮ Faster more memory efficient persistence homology calculations

Chemisty:

◮ More domain specific filters. Eg. Color filtrations. ◮ Weighted versions of filters we have

slide-99
SLIDE 99

Acknowledgements

◮ Joint work with Michael G. Lerner, Earlham College department of Physics

and Astonomy.

◮ Sponsoring Institutions: American Institute of Mathematics, Stanford

University, Ayasdi, National Institutes of Health, Earlham College.

◮ Calculation of Persistent Homology done with Dionysus

http://www.mrzv.org/software/dionysus/. Thanks Dmitriy Morozov!