Contents Trend in Computer-Aided Materials Discovery - - PowerPoint PPT Presentation
Contents Trend in Computer-Aided Materials Discovery - - PowerPoint PPT Presentation
Contents Trend in Computer-Aided Materials Discovery High-Throughput Computational Screening & Exhaustive Enumeration Deep-Learning-based Evolutionary Design Deep-Learning-based Inverse Design Efficacy of Computer-Aided
Contents
1
- Trend in Computer-Aided Materials Discovery
- High-Throughput Computational Screening & Exhaustive
Enumeration
- Deep-Learning-based Evolutionary Design
- Deep-Learning-based Inverse Design
- Efficacy of Computer-Aided Materials Discovery
Trend in Computer-Aided Materials Discovery
2
Simulation
(low throughput)
[ 1st Gen. ]
Virtual screening
(low hit-rate)
[ 2nd Gen.] [ 3rd Gen. ]
Targeted design
(high hit-rate)
Conventional
Trial-and-Error
(high cost) Iterative experiments Pre-validation High throughput Right solutions with minimum effort
- For accelerated materials discovery
Machine Learning First-principles Quantum Chemistry High-performance Computing
Intelligence Efficiency Rationalization
- Prediction of materials property based on machine learning
– Build-up of Materials vs. Property DB → Materials Informatics
Trend in Computer-Aided Materials Discovery
3
QSAR* (’62, Hansch&Fujita) ANN** in Chemistry (’71)
Graph Kernels (‘05 @ UC Irvine) Bayesian Modeling (‘09 @ MIT) (‘18 @ Harvard) SMILES *** (‘87 Weininger) Kernel methods Bayesian approaches Deep Learning (‘16 @ Stanford)
Cheminformatics Introduction stage of machine learning Development stage
Training Descriptor
SMILES: CC(C)NCC(O)COC1=CC(CC2=CC=CC=C2)=C(CC(N)=O)C=C1 Fingerprint: 011100011111101010010100100000101010001001010… Descriptor Vector graphs images
Analysis
Process of Machine Learning @ Materials Research
* QSAR: Quantitative Structure-Activity Relationship ** ANN: Artificial Neural Network *** SMILES: Simplified Molecular-input-line Systems
Trend in Computer-Aided Materials Discovery
4
- Materials design based on machine learning
– Inverse QSAR → Inverse Design
SMILES Autoencoder (‘16 @ Harvard)
Genetic Algorithms (’92 @ Purdue) Inverse Design (’16 @ SAIT)
GAN* for molecules (‘17 @ Harvard)
Exhaustive Generation (’12 @ Tokyo)
Deep Learning / Generative Models
Inverse QSAR (Late 80’s~)
*GAN: Generative Adversarial Network
Combinatorial Evolutionary Autoencoder Focus on autonomous molecular generation
[ In ] Targets [ Out ] Molecules
+
(High-Throughtput Computational Screening)
Automated Simulation DB Machine Learning Materials Informatics
HTCS Inverse Design Evolutionary Design
+
Molecular Enumeration
Materials Discovery Methodologies Elemental Technologies
Trend in Computer-Aided Materials Discovery
5
Target molecules
- In-silico technologies for materials discovery
High-Throughput Computational Screening & Exhaustive Enumeration
6 “Landscape of phosphorescent light-emitting energies of homoleptic Ir(III)- complexes predicted by a graph-based enumeration and deep learning”, GI01.02.02, 2018 MRS fall meeting
High-Throughput Computational Screening
7
- Property prediction with high-performance computing for large-
scale exploration of materials candidates
Seed Fragments
Candidate Pool
Target Materials
large amounts
- f candidates
Combination
Database
Verification
Simulation
High-Throughput Computational Screening
8
- ML (Machine Learning)-assisted HTCS for higher efficiency
Database
Verification
(1) Simulation + ML (2) Prioritizing calculation based on active learning Seed Fragments
Candidate Pool
Target Materials
large amounts
- f candidates
Combination
High-Throughput Computational Screening
9
- Exhaustive enumeration based on graph-theory
– “Graphs”
- Mathematical structures used to model pairwise relations between
- bjects.
- Made up of nodes and edges.
- In chemistry, graph is used to model molecules, where nodes
represent atoms and edges represent bonds.
※ Exhaustive enumeration: Systematical enumeration of all possible molecules for optimal solution search
High-Throughput Computational Screening
10
- Complete list of non-isomorphic graphs
http://www.cadaeic.net/graphpics.htm ID
- No. of
edges
- No. of edges at each node
High-Throughput Computational Screening
11
- Landscape of phosphorescent light-emitting energies of
homoleptic Ir(III)-complex core structures
– Ir(III)-complexes
- Widely used as phosphorescent OLED dopants.
- Figuring out the full landscape of emission color is important for
discovering high-performing molecules in target color regions.
New J. Chem., 39, 246 (2015) ACS Appl. Mater. Interfaces, 10, 1888–1896 (2018) Organic Electronics, 63, 244–249 (2018)
High-Throughput Computational Screening
12
- Approach
– Consider the nodes in graph as rings and edges as ring-connections. – Limited the total number rings between 3 and 5. – Exclude non-planar type (5-21) and invalid structures as dopant. → Only 11 graphs are valid among the total 29 graphs.
High-Throughput Computational Screening
13
- 1. Graphs
- 2. Skeletons
- 3. Set Iridium positions
- 4. Substitute some carbon atoms with nitrogen atoms
- Enumeration
– For 5- and 6-membered rings. – Substitute some carbons of each molecule with nitrogen atoms (max. five). → Total 9,919,469 (~10M) core structures
total 405 EA
High-Throughput Computational Screening
14
- Property prediction
– Trained a deep-neural-network model with simulated T1 data
- Input: ECFP(Extended Connectivity FingerPrints) of molecular structures
- Outputs: T1 energy (phosphorescent light-emitting wavelength)
0.05 0.1 0.15 0.2 10K 20K 30K 40K 50K 60K 70K 80K
Mean Absolute Error of T1
- f the DNN (eV)
Size of the training dataset
With 80k training data, the average prediction error was less than 0.1 eV
80k 10M = 0.8%
By simulating the properties of only 0.8% molecules, we can fully scan the chemical space of 10M!
High-Throughput Computational Screening
15
- Results
– Distribution of T1 values – Blue-color emitting materials are rare compared with red and green
1 2 3 4 5 6 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 1.65 1.75 1.85 1.95 2.05 2.15 2.25 2.35 2.45 2.55 2.65 2.75 2.85 2.95
Number of molecules
x 100,000
Predicted T1 (eV)
Blue (0.4%) Green (4.3%) Red (18.4%)
Conclusions
16
- In materials discovery, deep-learning-based HTCS is a good
alternative to conventional trial-and-error type approach.
- Moreover, exhaustive enumeration makes it possible to
systematically explore the whole chemical space.
- With the proposed exhaustive enumeration method based on
graph theory and deep learning, the whole landscape of 10M phosphorescent Ir-dopants could be scanned with just 0.8% computational cost compared with the pure simulation-based approach.
Deep-Learning-based Evolutionary Design
17 “Evolutionary design of organic molecules based on deep learning and genetic algorithm”, COMP , ACS fall 2018 National Meeting
Evolutionary Design
18
- A generic population-based metaheuristic optimization technique
- Uses bio-inspired operators to reach near-optimal solutions
; mutation, crossover, and selection in case of genetic algorithm
Fitness
https://en.wikipedia.org/wiki/Fitness_landscape
Average fitness Generation
+
Initial population Calculate fitness Selection Mutation Crossover New population Satisfy constraints? Done
Yes No
Deep-Learning-Based Evolutionary Design
19
- Proposed approach
Molecular Descriptor Molecular Evolution Fitness Evaluation Graph or ASCII string Heuristic Simple assessment Bit string (ECFP) Random DNN RNN
- Prevent heuristic bias
- Secure chemical validity
*ECFP (Extended Connectivity FingerPrint) DNN (Deep Neural Network), RNN (Recurrent Neural Network) SMILES (Simplified Molecular-Input Line-Entry System)
- Versatile evaluation is possible
Mutation (n=50) Inspection of chemical validity
Decoding to SMILES (RNN)
Fitness evaluation (DNN) Selection Evolution Crossover → Mutation) Inspection of chemical validity Fitness evaluation (DNN) Seed molecule (ECFP)
Best-fit molecule
DB
1 1 0 1 1 1 0 0 1 0 1 1 0 0 1
0 0 0 1 1 0 0 1
1 0 0 0 0 0 1 0 1 1 1
Parents Crossover
1
Mutation
0 0 1 1
Iteration
Conventional Proposed Expectations
Decoding to SMILES (RNN)
Deep Learning-Based Evolutionary Design
20
t=1
Input (ECFP*)
y = (‘CCC’,‘CCC’,‘CC(’,…, ‘)=O’) → ‘CCCC(N)=O’ t=2 t=3 t=T+1 y1=‘CCC’ y2=‘CCC’ yT=‘)=O’ y1=‘CCC’ y2=‘CCC’ y3=‘CC(’ <end> <start>
…
*ECFP (dimension=5,000, neighbor size=6)
- Deep learning models
- [DNN] 3 hidden layers, 500 hidden units in each layer
- [RNN] 3 hidden layers, 500 long short-term memory units
Output (SMILES) Input (ECFP*) Output (Properties) DNN Model RNN Model
Deep Learning-Based Evolutionary Design
21
- Validation test
- Design target: change the S1 (light-absorbing wavelength) of seed molecules
- Training data: M.W. 200~600 g/mol from PubChem (10,000~50,000 molecules)
※1. No. of test data=No. of training data/10 ※2. Chemical validity is evaluated with RDKit,
- No. of test data=5,000
- 11
- 10
- 9
- 8
- 7
- 6
- 5
- 4
- 3
- 11 -10 -9
- 8
- 7
- 6
- 5
- 4
- 3
R=0.945 HOMO (DFT; eV) HOMO (DNN; eV)
- 5
- 4
- 3
- 2
- 1
1 2 3
- 5
- 4
- 3
- 2
- 1
1 2 3
R=0.955 LUMO (DFT; eV) LUMO (DNN; eV)
2 4 6 8 10 12 2 4 6 8 10 12
R=0.973 S1 (DFT; eV) S1 (DNN; eV)
- No. of
training data Prediction accuracy of DNN※1 (R, MAE) Success rate of decoding※2 (RNN) S1 HOMO LUMO ① 50,000
0.973, 0.198 0.945, 0.172 0.955, 0.209
86.7% ② 30,000
0.930, 0.228 0.934, 0.191 0.945, 0.224
85.3% ③ 10,000
0.913, 0.278 0.885, 0.244 0.917, 0.287
83.2%
Deep Learning-Based Evolutionary Design
22
- Evolution toward the increase and decrease of S1 (eV)
- Seed: randomly selected 50 molecules (3.8<S1<4.2)
- Number of training data = 10k, 30k, 50k
Average rate of S1 change (%)
- 60
- 40
- 20
20 40 60
100 200 300 400 500
Number of trainig data=50,000 Number of trainig data=30,000 Number of trainig data=10,000
Generation
toward the increase of S1 toward the decrease of S1
2,500 5,000 7,500 10,000 12,500 0.25 1.25 2.25 3.25 4.25 5.25 6.25 7.25 8.25 9.25
S1 (eV) Number of Molecules
4.0 eV
S1 distribution in the training data (50k)
- 60
- 40
- 20
20 40 60
100 200 300 400 500 Without constraint With constraint
Increase of S1 Decrease of S1
Deep Learning-Based Evolutionary Design
23
- Evolution under the constraint of HOMO and LUMO (eV)
- Seed: randomly selected 50 molecules (3.8<S1<4.2)
- Number of training data = 50k
- Constraint: -7.0<HOMO<-5.0,
LUMO<0.0
LUMO HOMO
0 eV
① ② ④ ③ ① ②
- 7 eV
③ ④
- 5 eV
Generation Average rate of S1 change (%)
HOMO & LUMO distributions in the training data (50k)
5,000 10,000 15,000 20,000
- 10.75
- 9.75
- 8.75
- 7.75
- 6.75
- 5.75
- 4.75
- 3.75
- 2.75
- 1.75
- 0.75
0.25
HOMO (eV) Number of Molecules
2,500 5,000 7,500 10,000
- 6.75
- 5.75
- 4.75
- 3.75
- 2.75
- 1.75
- 0.75
0.25 1.25 2.25
LUMO (eV)
0 eV
- 5 eV
- 7 eV
toward the increase of S1 toward the decrease of S1
0 eV
Deep Learning-Based Evolutionary Design
24
- Examples of evolved molecules (No. of training data = 50k)
- Constraint (eV)
- 7.0<HOMO<-5.0
- LUMO<0.0
Conclusions
25
- A fully data-driven evolutionary molecular design based on
deep-learning models (DNN & RNN) was proposed and automatically evolved seed molecules toward target without any pre-defined chemical rules.
- Unlike HTCS, the closed-loop evolutionary workflow guided by
deep-learning automatically derived target molecules and found rational design paths by elucidating the relationship between structural features and their effect on the molecular properties.
Deep-Learning-based Inverse Design
26 npj Comput. Mater., 4, 67, 2018
Deep-Learning-Based Inverse Design
27
- Paradigm shift of ML in computer-aided materials discovery
Artificial Intelligence for Materials Design
Propose target materials
Target Properties
Screening Predict materials properties
Passive Role Active Role
- Efficient screening based on property prediction
- Highly depends on explicit knowledge of
chemists
- Propose candidates via automated design
- Provides implicit knowledge from data
Database Candidate pool
Deep-Learning-Based Inverse Design
28
- Implementation of inverse-design model
z=e(x) DNN Molecular property (t) Molecular descriptor (x; ECFP format) f(z) RNN d(z) Molecular structure identifier (y; SMILES format) e(·) : encoding function f(·) : property prediction function d(·) : decoding function z: encoded vector of molecular descriptor
b
Input
- utput
encoder decoder
Hidden Factor (fixed-length vector)
Deep-Learning-Based Inverse Design
29
- Inverse design of light-absorbing organic molecules (1/2)
- Training DB
‒ 50k molecules sampled from PubChem (M.W. 200~600) ‒ DFT calculations for S1
λmax (nm)
0% 10% 20% 30% 40% 50%
100 600 200 300 400 500 0% 10% 20% 30% 40% 50%
Percentage of generated molecules
c a b
100 600 200 300 400 500
0% 10% 20% 30% 40% 50%
λmax (nm)
0% 10% 20% 30% 40% 50%
Percentage of generated molecules
0% 10% 20% 30% 40% 50%
100 600 200 300 400 500
λmax (nm)
0% 10% 20% 30% 40% 50%
Percentage of generated molecules
Distribution of λmax of the inverse-designed molecules
λmax=200–300 nm λmax=300–400 nm λmax=400–500 nm 82.6%
Target Hit rate
64.8% 45.6%
*Simulation values for the 500 molecules in each target
※ About 10% of the designed molecules were found in PubChem even though those were not included in the randomly selected training library.
a b c d
Deep-Learning-Based Inverse Design
30
- Inverse design of light-absorbing organic molecules (2/2)
Examples of inverse-designed molecules which share the moieties with well-known dye materials
- b. Azobenzene derivative
(λmax=527.5 nm)
- c. Isoidoline derivative
(λmax=434.4 nm)
- a. Antraquinone derivative
(λmax=433.4 nm)
- d. Squaraine derivative
(λmax=503.5 nm)
Deep-Learning-Based Inverse Design
31
- Inverse design of hosts for blue phosphorescent OLED (1/3)
- Target: T1 ≥ 3.00 eV
- Training DB
‒ In-house library of 6,000 molecules by combinatorial enumeration (with nine linker (L) and fifty-seven terminal fragments (R) which are frequently employed in OLED hosts; symmetric R-L-R & R-R type enumeration). ‒ Property labeling with DFT calculations.
0% 10% 20% 30% 40% 50% 60% 2.4 2.6 2.8 3.0 3.2 3.4 3.6
Fraction of molecules T1 (eV)
Untargeted inverse design T argeted inverse design (T1 ≥ 3.00 eV) Training library
a b c
The distribution of simulated T1 (eV) energy levels for the generated 3,205 molecules
- a. mean=2.94, std=0.15
- b. mean=3.02, std=0.10
- c. mean=2.92, std=0.13
The fractions of the hosts that satisfied the target (T1≥3.00 eV) 36.2% for a 58.7% for b 26.9% for c (3,497 molecules)
Deep-Learning-Based Inverse Design
32
- Inverse design of hosts for blue phosphorescent OLED (2/3)
a b c
T1 ML 3.13 eV DFT 3.16 eV T1 ML 3.11 eV DFT 3.12 eV T1 ML 3.12 eV DFT 3.12 eV T1 ML 3.05 eV DFT 3.06 eV T1 ML 3.18 eV DFT 3.20 eV T1 ML 3.08 eV DFT 3.12 eV T1 ML 3.03 eV DFT 3.12 eV a1 a2 a3 b1 b2 b3 c1 c2 c3 T1 ML 3.13 eV DFT 3.22 eV T1 ML 3.08 eV DFT 3.12 eV
Examples of inverse-designed host materials
Asymmetric molecules with the given fragments in the training library Symmetric molecules where the new fragments were introduced Asymmetric molecules where the new fragments were introduced Experiment (eV) HOMO (eV) LUMO (eV) S1 (eV) T1 (eV) ΔEST (eV) a1
- 5.98
- 2.43
3.56 3.06 0.55 b1
- 5.96
- 2.14
3.64 2.93 1.01 c1
- 6.07
- 2.65
3.38 2.97 0.46
Deep-Learning-Based Inverse Design
33
- Inverse design of hosts for blue phosphorescent OLED (3/3)
Total host molecules (3,205) L-(R1,R2,R3) ★ (4) R-L-R (3,010) R-R (190) R1-R1 (16) R1-R2 (174)
- L: Linker fragment
- R: Terminal fragment
- Lsym: Symmetric linker
- Lasym: Asymmetric linker
R (1) R-Lsym -R (1,931) R-Lasym -R (1,079) R1-Lsym -R1 (403) R1-Lsym -R2 (1,528) R1-Lasym -R1 (636) R1-Lasym -R2 (443)
★
Linker Terminal1 Terminal2 Terminal3
The connection rules of the inverse-designed molecules
Conclusions
34
- A fully data-driven inverse design method successfully extracted
the latent materials design rules and proposed target molecular structures without any external intervention.
- The inverse design model successfully proposed new candidates
by modifying the assemble rules and creating new fragments.
Efficacy of Computer-Aided Materials Discovery
35
Simulation-based Screening
Full search Total TAT takes 1 month
1st trial: 1M Candidates QC simulations take 1.5 years Fail to find the target structure 2nd trial: 1M Candidates QC simulations take 1.5 years Fail to find the target structure 3rd trial: 1M Candidates QC simulations take 1.5 years Succeed to find the right structure Total TAT took 4.5 years [Step1] Bui uildin ing the train inin ing dataset Needs only QC sim. for 50k molecules (27 days) 50K 50K
5M 5M
Inve nvers rse Des Desig ign
Inverse Design
HTCS for pre-defined chemical space
mo more re tha han n 50X speed up up (4.5 years vs. 1 mo mont nth)
[Step3] QC simulations for the proposed molecules (1 day) * QC QC sim imulatio ion tool
- l : tur
urbomole Total computational resources=10,000 CPU In case of 10 CPU computing per molecule, the simulation requires about 13 hrs.
“The inverse design learns by itself the molecular design rules inherent in the libraries and can reduce the effort of researchers and total time to reach the goal”
[Step2] Deep learning model training with GPU (3 days)
Prospects for AI-based Materials Development
36
Design Analytical Chemistry Simulation Synthesis
Neural Network (MD potential)
Energ rgy
DFT simulation (< 1 00 atoms) Meso- sc ale simulation (~ 1 04 atoms)
Elec tro ronic Pro ropert rties
Design S ynthesis Analysis Training Artificial Intelligence for Materials Design
Propose target materials
Target Properties
Database