Contents Trend in Computer-Aided Materials Discovery - - PowerPoint PPT Presentation

contents
SMART_READER_LITE
LIVE PREVIEW

Contents Trend in Computer-Aided Materials Discovery - - PowerPoint PPT Presentation

Contents Trend in Computer-Aided Materials Discovery High-Throughput Computational Screening & Exhaustive Enumeration Deep-Learning-based Evolutionary Design Deep-Learning-based Inverse Design Efficacy of Computer-Aided


slide-1
SLIDE 1
slide-2
SLIDE 2

Contents

1

  • Trend in Computer-Aided Materials Discovery
  • High-Throughput Computational Screening & Exhaustive

Enumeration

  • Deep-Learning-based Evolutionary Design
  • Deep-Learning-based Inverse Design
  • Efficacy of Computer-Aided Materials Discovery
slide-3
SLIDE 3

Trend in Computer-Aided Materials Discovery

2

Simulation

(low throughput)

[ 1st Gen. ]

Virtual screening

(low hit-rate)

[ 2nd Gen.] [ 3rd Gen. ]

Targeted design

(high hit-rate)

Conventional

Trial-and-Error

(high cost) Iterative experiments Pre-validation High throughput Right solutions with minimum effort

  • For accelerated materials discovery

Machine Learning First-principles Quantum Chemistry High-performance Computing

Intelligence Efficiency Rationalization

slide-4
SLIDE 4
  • Prediction of materials property based on machine learning

– Build-up of Materials vs. Property DB → Materials Informatics

Trend in Computer-Aided Materials Discovery

3

QSAR* (’62, Hansch&Fujita) ANN** in Chemistry (’71)

Graph Kernels (‘05 @ UC Irvine) Bayesian Modeling (‘09 @ MIT) (‘18 @ Harvard) SMILES *** (‘87 Weininger) Kernel methods Bayesian approaches Deep Learning (‘16 @ Stanford)

Cheminformatics Introduction stage of machine learning Development stage

Training Descriptor

SMILES: CC(C)NCC(O)COC1=CC(CC2=CC=CC=C2)=C(CC(N)=O)C=C1 Fingerprint: 011100011111101010010100100000101010001001010… Descriptor Vector graphs images

Analysis

Process of Machine Learning @ Materials Research

* QSAR: Quantitative Structure-Activity Relationship ** ANN: Artificial Neural Network *** SMILES: Simplified Molecular-input-line Systems

slide-5
SLIDE 5

Trend in Computer-Aided Materials Discovery

4

  • Materials design based on machine learning

– Inverse QSAR → Inverse Design

SMILES Autoencoder (‘16 @ Harvard)

Genetic Algorithms (’92 @ Purdue) Inverse Design (’16 @ SAIT)

GAN* for molecules (‘17 @ Harvard)

Exhaustive Generation (’12 @ Tokyo)

Deep Learning / Generative Models

Inverse QSAR (Late 80’s~)

*GAN: Generative Adversarial Network

Combinatorial Evolutionary Autoencoder Focus on autonomous molecular generation

slide-6
SLIDE 6

[ In ] Targets [ Out ] Molecules

+

(High-Throughtput Computational Screening)

Automated Simulation DB Machine Learning Materials Informatics

HTCS Inverse Design Evolutionary Design

+

Molecular Enumeration

Materials Discovery Methodologies Elemental Technologies

Trend in Computer-Aided Materials Discovery

5

Target molecules

  • In-silico technologies for materials discovery
slide-7
SLIDE 7

High-Throughput Computational Screening & Exhaustive Enumeration

6 “Landscape of phosphorescent light-emitting energies of homoleptic Ir(III)- complexes predicted by a graph-based enumeration and deep learning”, GI01.02.02, 2018 MRS fall meeting

slide-8
SLIDE 8

High-Throughput Computational Screening

7

  • Property prediction with high-performance computing for large-

scale exploration of materials candidates

Seed Fragments

Candidate Pool

Target Materials

large amounts

  • f candidates

Combination

Database

Verification

Simulation

slide-9
SLIDE 9

High-Throughput Computational Screening

8

  • ML (Machine Learning)-assisted HTCS for higher efficiency

Database

Verification

(1) Simulation + ML (2) Prioritizing calculation based on active learning Seed Fragments

Candidate Pool

Target Materials

large amounts

  • f candidates

Combination

slide-10
SLIDE 10

High-Throughput Computational Screening

9

  • Exhaustive enumeration based on graph-theory

– “Graphs”

  • Mathematical structures used to model pairwise relations between
  • bjects.
  • Made up of nodes and edges.
  • In chemistry, graph is used to model molecules, where nodes

represent atoms and edges represent bonds.

※ Exhaustive enumeration: Systematical enumeration of all possible molecules for optimal solution search

slide-11
SLIDE 11

High-Throughput Computational Screening

10

  • Complete list of non-isomorphic graphs

http://www.cadaeic.net/graphpics.htm ID

  • No. of

edges

  • No. of edges at each node
slide-12
SLIDE 12

High-Throughput Computational Screening

11

  • Landscape of phosphorescent light-emitting energies of

homoleptic Ir(III)-complex core structures

– Ir(III)-complexes

  • Widely used as phosphorescent OLED dopants.
  • Figuring out the full landscape of emission color is important for

discovering high-performing molecules in target color regions.

New J. Chem., 39, 246 (2015) ACS Appl. Mater. Interfaces, 10, 1888–1896 (2018) Organic Electronics, 63, 244–249 (2018)

slide-13
SLIDE 13

High-Throughput Computational Screening

12

  • Approach

– Consider the nodes in graph as rings and edges as ring-connections. – Limited the total number rings between 3 and 5. – Exclude non-planar type (5-21) and invalid structures as dopant. → Only 11 graphs are valid among the total 29 graphs.

slide-14
SLIDE 14

High-Throughput Computational Screening

13

  • 1. Graphs
  • 2. Skeletons
  • 3. Set Iridium positions
  • 4. Substitute some carbon atoms with nitrogen atoms
  • Enumeration

– For 5- and 6-membered rings. – Substitute some carbons of each molecule with nitrogen atoms (max. five). → Total 9,919,469 (~10M) core structures

total 405 EA

slide-15
SLIDE 15

High-Throughput Computational Screening

14

  • Property prediction

– Trained a deep-neural-network model with simulated T1 data

  • Input: ECFP(Extended Connectivity FingerPrints) of molecular structures
  • Outputs: T1 energy (phosphorescent light-emitting wavelength)

0.05 0.1 0.15 0.2 10K 20K 30K 40K 50K 60K 70K 80K

Mean Absolute Error of T1

  • f the DNN (eV)

Size of the training dataset

With 80k training data, the average prediction error was less than 0.1 eV

80k 10M = 0.8%

By simulating the properties of only 0.8% molecules, we can fully scan the chemical space of 10M!

slide-16
SLIDE 16

High-Throughput Computational Screening

15

  • Results

– Distribution of T1 values – Blue-color emitting materials are rare compared with red and green

1 2 3 4 5 6 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 1.65 1.75 1.85 1.95 2.05 2.15 2.25 2.35 2.45 2.55 2.65 2.75 2.85 2.95

Number of molecules

x 100,000

Predicted T1 (eV)

Blue (0.4%) Green (4.3%) Red (18.4%)

slide-17
SLIDE 17

Conclusions

16

  • In materials discovery, deep-learning-based HTCS is a good

alternative to conventional trial-and-error type approach.

  • Moreover, exhaustive enumeration makes it possible to

systematically explore the whole chemical space.

  • With the proposed exhaustive enumeration method based on

graph theory and deep learning, the whole landscape of 10M phosphorescent Ir-dopants could be scanned with just 0.8% computational cost compared with the pure simulation-based approach.

slide-18
SLIDE 18

Deep-Learning-based Evolutionary Design

17 “Evolutionary design of organic molecules based on deep learning and genetic algorithm”, COMP , ACS fall 2018 National Meeting

slide-19
SLIDE 19

Evolutionary Design

18

  • A generic population-based metaheuristic optimization technique
  • Uses bio-inspired operators to reach near-optimal solutions

; mutation, crossover, and selection in case of genetic algorithm

Fitness

https://en.wikipedia.org/wiki/Fitness_landscape

Average fitness Generation

+

Initial population Calculate fitness Selection Mutation Crossover New population Satisfy constraints? Done

Yes No

slide-20
SLIDE 20

Deep-Learning-Based Evolutionary Design

19

  • Proposed approach

Molecular Descriptor Molecular Evolution Fitness Evaluation Graph or ASCII string Heuristic Simple assessment Bit string (ECFP) Random DNN RNN

  • Prevent heuristic bias
  • Secure chemical validity

*ECFP (Extended Connectivity FingerPrint) DNN (Deep Neural Network), RNN (Recurrent Neural Network) SMILES (Simplified Molecular-Input Line-Entry System)

  • Versatile evaluation is possible

Mutation (n=50) Inspection of chemical validity

Decoding to SMILES (RNN)

Fitness evaluation (DNN) Selection Evolution Crossover → Mutation) Inspection of chemical validity Fitness evaluation (DNN) Seed molecule (ECFP)

Best-fit molecule

DB

1 1 0 1 1 1 0 0 1 0 1 1 0 0 1

0 0 0 1 1 0 0 1

1 0 0 0 0 0 1 0 1 1 1

Parents Crossover

1

Mutation

0 0 1 1

Iteration

Conventional Proposed Expectations

Decoding to SMILES (RNN)

slide-21
SLIDE 21

Deep Learning-Based Evolutionary Design

20

t=1

Input (ECFP*)

y = (‘CCC’,‘CCC’,‘CC(’,…, ‘)=O’) → ‘CCCC(N)=O’ t=2 t=3 t=T+1 y1=‘CCC’ y2=‘CCC’ yT=‘)=O’ y1=‘CCC’ y2=‘CCC’ y3=‘CC(’ <end> <start>

*ECFP (dimension=5,000, neighbor size=6)

  • Deep learning models
  • [DNN] 3 hidden layers, 500 hidden units in each layer
  • [RNN] 3 hidden layers, 500 long short-term memory units

Output (SMILES) Input (ECFP*) Output (Properties) DNN Model RNN Model

slide-22
SLIDE 22

Deep Learning-Based Evolutionary Design

21

  • Validation test
  • Design target: change the S1 (light-absorbing wavelength) of seed molecules
  • Training data: M.W. 200~600 g/mol from PubChem (10,000~50,000 molecules)

※1. No. of test data=No. of training data/10 ※2. Chemical validity is evaluated with RDKit,

  • No. of test data=5,000
  • 11
  • 10
  • 9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 11 -10 -9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3

R=0.945 HOMO (DFT; eV) HOMO (DNN; eV)

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3

R=0.955 LUMO (DFT; eV) LUMO (DNN; eV)

2 4 6 8 10 12 2 4 6 8 10 12

R=0.973 S1 (DFT; eV) S1 (DNN; eV)

  • No. of

training data Prediction accuracy of DNN※1 (R, MAE) Success rate of decoding※2 (RNN) S1 HOMO LUMO ① 50,000

0.973, 0.198 0.945, 0.172 0.955, 0.209

86.7% ② 30,000

0.930, 0.228 0.934, 0.191 0.945, 0.224

85.3% ③ 10,000

0.913, 0.278 0.885, 0.244 0.917, 0.287

83.2%

slide-23
SLIDE 23

Deep Learning-Based Evolutionary Design

22

  • Evolution toward the increase and decrease of S1 (eV)
  • Seed: randomly selected 50 molecules (3.8<S1<4.2)
  • Number of training data = 10k, 30k, 50k

Average rate of S1 change (%)

  • 60
  • 40
  • 20

20 40 60

100 200 300 400 500

Number of trainig data=50,000 Number of trainig data=30,000 Number of trainig data=10,000

Generation

toward the increase of S1 toward the decrease of S1

2,500 5,000 7,500 10,000 12,500 0.25 1.25 2.25 3.25 4.25 5.25 6.25 7.25 8.25 9.25

S1 (eV) Number of Molecules

4.0 eV

S1 distribution in the training data (50k)

slide-24
SLIDE 24
  • 60
  • 40
  • 20

20 40 60

100 200 300 400 500 Without constraint With constraint

Increase of S1 Decrease of S1

Deep Learning-Based Evolutionary Design

23

  • Evolution under the constraint of HOMO and LUMO (eV)
  • Seed: randomly selected 50 molecules (3.8<S1<4.2)
  • Number of training data = 50k
  • Constraint: -7.0<HOMO<-5.0,

LUMO<0.0

LUMO HOMO

0 eV

① ② ④ ③ ① ②

  • 7 eV

③ ④

  • 5 eV

Generation Average rate of S1 change (%)

HOMO & LUMO distributions in the training data (50k)

5,000 10,000 15,000 20,000

  • 10.75
  • 9.75
  • 8.75
  • 7.75
  • 6.75
  • 5.75
  • 4.75
  • 3.75
  • 2.75
  • 1.75
  • 0.75

0.25

HOMO (eV) Number of Molecules

2,500 5,000 7,500 10,000

  • 6.75
  • 5.75
  • 4.75
  • 3.75
  • 2.75
  • 1.75
  • 0.75

0.25 1.25 2.25

LUMO (eV)

0 eV

  • 5 eV
  • 7 eV

toward the increase of S1 toward the decrease of S1

0 eV

slide-25
SLIDE 25

Deep Learning-Based Evolutionary Design

24

  • Examples of evolved molecules (No. of training data = 50k)
  • Constraint (eV)
  • 7.0<HOMO<-5.0
  • LUMO<0.0
slide-26
SLIDE 26

Conclusions

25

  • A fully data-driven evolutionary molecular design based on

deep-learning models (DNN & RNN) was proposed and automatically evolved seed molecules toward target without any pre-defined chemical rules.

  • Unlike HTCS, the closed-loop evolutionary workflow guided by

deep-learning automatically derived target molecules and found rational design paths by elucidating the relationship between structural features and their effect on the molecular properties.

slide-27
SLIDE 27

Deep-Learning-based Inverse Design

26 npj Comput. Mater., 4, 67, 2018

slide-28
SLIDE 28

Deep-Learning-Based Inverse Design

27

  • Paradigm shift of ML in computer-aided materials discovery

Artificial Intelligence for Materials Design

Propose target materials

Target Properties

Screening Predict materials properties

Passive Role Active Role

  • Efficient screening based on property prediction
  • Highly depends on explicit knowledge of

chemists

  • Propose candidates via automated design
  • Provides implicit knowledge from data

Database Candidate pool

slide-29
SLIDE 29

Deep-Learning-Based Inverse Design

28

  • Implementation of inverse-design model

z=e(x) DNN Molecular property (t) Molecular descriptor (x; ECFP format) f(z) RNN d(z) Molecular structure identifier (y; SMILES format) e(·) : encoding function f(·) : property prediction function d(·) : decoding function z: encoded vector of molecular descriptor

b

Input

  • utput

encoder decoder

Hidden Factor (fixed-length vector)

slide-30
SLIDE 30

Deep-Learning-Based Inverse Design

29

  • Inverse design of light-absorbing organic molecules (1/2)
  • Training DB

‒ 50k molecules sampled from PubChem (M.W. 200~600) ‒ DFT calculations for S1

λmax (nm)

0% 10% 20% 30% 40% 50%

100 600 200 300 400 500 0% 10% 20% 30% 40% 50%

Percentage of generated molecules

c a b

100 600 200 300 400 500

0% 10% 20% 30% 40% 50%

λmax (nm)

0% 10% 20% 30% 40% 50%

Percentage of generated molecules

0% 10% 20% 30% 40% 50%

100 600 200 300 400 500

λmax (nm)

0% 10% 20% 30% 40% 50%

Percentage of generated molecules

Distribution of λmax of the inverse-designed molecules

λmax=200–300 nm λmax=300–400 nm λmax=400–500 nm 82.6%

Target Hit rate

64.8% 45.6%

*Simulation values for the 500 molecules in each target

※ About 10% of the designed molecules were found in PubChem even though those were not included in the randomly selected training library.

slide-31
SLIDE 31

a b c d

Deep-Learning-Based Inverse Design

30

  • Inverse design of light-absorbing organic molecules (2/2)

Examples of inverse-designed molecules which share the moieties with well-known dye materials

  • b. Azobenzene derivative

(λmax=527.5 nm)

  • c. Isoidoline derivative

(λmax=434.4 nm)

  • a. Antraquinone derivative

(λmax=433.4 nm)

  • d. Squaraine derivative

(λmax=503.5 nm)

slide-32
SLIDE 32

Deep-Learning-Based Inverse Design

31

  • Inverse design of hosts for blue phosphorescent OLED (1/3)
  • Target: T1 ≥ 3.00 eV
  • Training DB

‒ In-house library of 6,000 molecules by combinatorial enumeration (with nine linker (L) and fifty-seven terminal fragments (R) which are frequently employed in OLED hosts; symmetric R-L-R & R-R type enumeration). ‒ Property labeling with DFT calculations.

0% 10% 20% 30% 40% 50% 60% 2.4 2.6 2.8 3.0 3.2 3.4 3.6

Fraction of molecules T1 (eV)

Untargeted inverse design T argeted inverse design (T1 ≥ 3.00 eV) Training library

a b c

The distribution of simulated T1 (eV) energy levels for the generated 3,205 molecules

  • a. mean=2.94, std=0.15
  • b. mean=3.02, std=0.10
  • c. mean=2.92, std=0.13

The fractions of the hosts that satisfied the target (T1≥3.00 eV) 36.2% for a 58.7% for b 26.9% for c (3,497 molecules)

slide-33
SLIDE 33

Deep-Learning-Based Inverse Design

32

  • Inverse design of hosts for blue phosphorescent OLED (2/3)

a b c

T1 ML 3.13 eV DFT 3.16 eV T1 ML 3.11 eV DFT 3.12 eV T1 ML 3.12 eV DFT 3.12 eV T1 ML 3.05 eV DFT 3.06 eV T1 ML 3.18 eV DFT 3.20 eV T1 ML 3.08 eV DFT 3.12 eV T1 ML 3.03 eV DFT 3.12 eV a1 a2 a3 b1 b2 b3 c1 c2 c3 T1 ML 3.13 eV DFT 3.22 eV T1 ML 3.08 eV DFT 3.12 eV

Examples of inverse-designed host materials

Asymmetric molecules with the given fragments in the training library Symmetric molecules where the new fragments were introduced Asymmetric molecules where the new fragments were introduced Experiment (eV) HOMO (eV) LUMO (eV) S1 (eV) T1 (eV) ΔEST (eV) a1

  • 5.98
  • 2.43

3.56 3.06 0.55 b1

  • 5.96
  • 2.14

3.64 2.93 1.01 c1

  • 6.07
  • 2.65

3.38 2.97 0.46

slide-34
SLIDE 34

Deep-Learning-Based Inverse Design

33

  • Inverse design of hosts for blue phosphorescent OLED (3/3)

Total host molecules (3,205) L-(R1,R2,R3) ★ (4) R-L-R (3,010) R-R (190) R1-R1 (16) R1-R2 (174)

  • L: Linker fragment
  • R: Terminal fragment
  • Lsym: Symmetric linker
  • Lasym: Asymmetric linker

R (1) R-Lsym -R (1,931) R-Lasym -R (1,079) R1-Lsym -R1 (403) R1-Lsym -R2 (1,528) R1-Lasym -R1 (636) R1-Lasym -R2 (443)

Linker Terminal1 Terminal2 Terminal3

The connection rules of the inverse-designed molecules

slide-35
SLIDE 35

Conclusions

34

  • A fully data-driven inverse design method successfully extracted

the latent materials design rules and proposed target molecular structures without any external intervention.

  • The inverse design model successfully proposed new candidates

by modifying the assemble rules and creating new fragments.

slide-36
SLIDE 36

Efficacy of Computer-Aided Materials Discovery

35

Simulation-based Screening

Full search Total TAT takes 1 month

1st trial: 1M Candidates QC simulations take 1.5 years Fail to find the target structure 2nd trial: 1M Candidates QC simulations take 1.5 years Fail to find the target structure 3rd trial: 1M Candidates QC simulations take 1.5 years Succeed to find the right structure Total TAT took 4.5 years [Step1] Bui uildin ing the train inin ing dataset Needs only QC sim. for 50k molecules (27 days) 50K 50K

5M 5M

Inve nvers rse Des Desig ign

Inverse Design

HTCS for pre-defined chemical space

mo more re tha han n 50X speed up up (4.5 years vs. 1 mo mont nth)

[Step3] QC simulations for the proposed molecules (1 day) * QC QC sim imulatio ion tool

  • l : tur

urbomole Total computational resources=10,000 CPU In case of 10 CPU computing per molecule, the simulation requires about 13 hrs.

“The inverse design learns by itself the molecular design rules inherent in the libraries and can reduce the effort of researchers and total time to reach the goal”

[Step2] Deep learning model training with GPU (3 days)

slide-37
SLIDE 37

Prospects for AI-based Materials Development

36

Design Analytical Chemistry Simulation Synthesis

Neural Network (MD potential)

Energ rgy

DFT simulation (< 1 00 atoms) Meso- sc ale simulation (~ 1 04 atoms)

Elec tro ronic Pro ropert rties

Design S ynthesis Analysis Training Artificial Intelligence for Materials Design

Propose target materials

Target Properties

Database

slide-38
SLIDE 38

ysuk.choi@samsung.com