More efficient representations of compounds for machine learning - PowerPoint PPT Presentation

More efficient representations of compounds for machine learning models Bing Huang and Anatole von Lilienfeld Institute of Physical Chemistry and National Centre for Computational Design and Discovery of Novel Materials (MARVEL) Department of Chemistry University of Basel Switzerland

1/34 Breaking the hex! Low Force Field accuracy semi-MO DFT traditional ML high GAML CCSD(T) ~1 s ~1 cpu >> 1 hour seconds Time

2/34 Machine Learning - basics feature abstraction ❑ Given data set { X 0 ; Y }, learn f : X -> Y and then infer for new X 0 ’ training test for molecules, X 0 : { Z , R }, Y : E ❑ Kernel ridge regression covariance length-scale of the data set noise-level N parameters to be regressed for N molecules + 2 global parameters M. Rupp, et al. , PRL , 2012

3/34 Machine Learning - basics ❑ More training data, better results for proper X (refer to M hereafter) at large N Poor ML model Decent Good ML model ❑ representation ( M ) central to ML training set size: N Huang, OAvL, JCP Comm (2016); arxiv.org/abs/1608.06194

4/34 Learning a simple 1-D function For KRR, f & a* f + b as rpsts are identical M M i target uniqueness similarity lack of uniqueness → absurd results → noise in training OAvL et al , IJQC (2013)

5/34 Learning a "complicated 1-D function target: Y = ( x -1)( x -2)( x +3)

6/34 General guidelines for designing M ❑ in case you know f well (exact form unknown) use it as M ❑ otherwise you’d better know how f behaves try & error use one monotonic part of f, refer it as g best g minimizes || g – f || 2 e.g., Morse potential V (x) = -100*(2*exp(-(x-1.4)) – exp(-2(x-1.4))) Performance V (x) > 1/x > exp(-(x-1.4)) > x > 1/x 6 >> -(x-1.4) 2

7/34 Representing molecules X 0 : { Z , R } prerequisites: rotation, translation invariant index permutation invariant unique (1 rpsts~2 molecule ⌫ ) a glimpse of Chemical Compound Space (CCS) J-L. Reymond et al , ACS Chem. Neuro . (2012)

8/34 Representing molecules straightforward !! Ψ / ρ / V -based E = E [ Ψ ] = E [ ρ ] = E [ V_ext ] impractical atom/electron density practical accessible projection to CCS: infinite dimension basis set M : [c_1, c_2, …, c_n]

9/34 Learning an 1-D functional extent of variation a density in the training set noninteracting fermions in 1D property: kinetic energy new density MAE < 1.0 kcal/mol J. C. Snyder, et al , PRL (2012)

10/34 fingerprint representations projection to 1-D projection to 1-D remove rotation frequency domain η -grid dependence & substitution 1-D discrete array works good for Al_n systems GDB-9 dataset OAvL et al , IJQC (2015) V. Botu, et al ., IJQC (2015)

11/34 Representing molecules why are fingerprint rpsts bad for molecules, but good for Al_n like systems? Ψ / ρ / V -based || g – f || 2 large for molecules, small for Al_n

11/34 Representing molecules Ψ / ρ / V -based many body expansion (MBE) of total energy E -based CCS: dimension is significantly reduced!!! M: [ { E (1)}, { E (12)}, { E (123)}, … ] BH, OAvL , JCP comm ., 2016

12/34 Coulomb matrix (CM) sorted CM random CM “CM”, M. Rupp, et al. , PRL, 2012

13/34 Bag of Bonds (BoB) much better than CM, why?? K. Hansen, et al ., JPCL , 2015

14/34 non-uniqueness issue homometric molecules same set of interatomic distance pairs OAvL, et al., IJQC, 2015

14/34 non-uniqueness issue LJ: Lennard-Jones 2-body vdW potential ATM: Axilrod-Teller-Muto 3-body vdW potential homometric molecules same set of interatomic distance pairs f (scaling of s ) BH, OAvL , JCP comm ., 2016

14/34 non-uniqueness issue LJ: Lennard-Jones 2-body vdW potential ATM: Axilrod-Teller-Muto 3-body vdW potential homometric molecules same set of interatomic distance pairs 2-body interaction is not enough! f (scaling of s ) BH, OAvL , JCP comm ., 2016

15/34 BAML bags of Universal force field (UFF) contributions M A BAML M B E IJ Morse/LJ M T BH, OAvL , JCP comm ., 2016

16/34 BAML V ( Z ) V ( r ) V ( θ ) V ( Φ ) BoA BoP BoT BoQ 16 16 7 7 16 16 16 16 12 12 5.25 5.25 12 12 12 12 8 8 3.5 3.5 8 8 8 8 4 1.75 4 1.75 4 4 4 4 0 0 0 0 0 0 0 0 θ (C-C-H) r (C-H) Φ (H-C-C-H) H C N O P H C N O P S H C N O P H C N O P 16 16 16 16 16 16 12 12 12 12 12 12 8 8 8 8 8 8 4 4 4 4 4 4 0 0 0 0 0 0 r (C-C) θ (C-C-C) Φ (C-C-C-C) H C N O P H C N O P H C N O P 16 16 16 16 16 16 12 12 12 12 12 12 8 8 8 8 8 8 4 4 4 4 4 4 0 0 0 0 0 0 r (C-N) Φ (C-C-C-N) θ (C-C-N) H C N O P H C N O P H C N O P

17/34 BAML ERROR database: 6k isomers (C 7 H 10 O 2 ) TRAINING SET SIZE BH, OAvL , JCP comm ., 2016

18/34 BAML 6k isomers a (C 7 H 10 O 2 ) ERROR TRAINING SET SIZE BH, OAvL , JCP comm ., 2016

18/34 BAML 6k isomers a (C 7 H 10 O 2 ) 3 outliers ERROR TRAINING SET SIZE BH, OAvL , JCP comm ., 2016

19/34 BAML 6k isomers a (C 7 H 10 O 2 ) QM9 a (134k) ERROR TRAINING SET SIZE TRAINING SET SIZE BH, OAvL , JCP comm ., 2016

20/34 BAML Comparison QM7b database (size: 7211) MAE (5k out-of-sample) BAML BoB SOAP a CM b accuracy b E (PBE0)/eV 0.05 0.08 0.04 0.16 0.15, 0.23, 0.09-0.22 α (PBE0)/ Å 3 0.07 0.09 0.05 0.11 0.05-0.27, 0.04-0.14 HOMO (GW)/eV 0.10 0.15 0.12 0.16 - Error LUMO (GW)/eV 0.11 0.16 0.12 0.16 - IP (ZINDO)/eV 0.15 0.20 0.19 0.17 0.20, 0.15 EA (ZINDO)/eV 0.07 0.17 0.13 0.11 0.16, 0.11 E 1st * (ZINDO)/eV 0.13 0.21 0.18 0.13 0.18, 0.21 a S. De, et al ., PCCP , 2016 b G. Montavon, et al. , NJP, 2013 BH, OAvL , JCP comm ., 2016

21/34 Histogram of Angles Histogram of HDAD Histogram of Distance Dihedral angles BoP V ( r ) = r BoT BoQ 16 16 16 16 16 16 V ( θ ) = θ 12 12 12 12 12 12 8 8 8 8 8 8 V ( Φ ) = Φ 4 4 4 4 4 4 0 0 0 0 0 0 r (C-H) d(C-H) d(C-H) H C N O P H C N O P H C N O P shortcoming: force prediction 16 16 16 16 16 16 12 12 12 12 12 12 8 8 8 8 8 8 4 4 4 4 4 4 0 0 0 0 0 0 r (C-C) d(C-C) d(C-C) H C N O P H C N O P H C N O P 16 16 16 16 16 16 12 12 12 12 12 12 8 8 8 8 8 8 4 4 4 4 4 4 0 0 0 0 0 0 r (C-N) d(C-N) d(C-N) H C N O P H C N O P H C N O P F. Faber, et al. , 2017, arxiv.org/abs/1702.05532

22/34 HDAD F. Faber, et al. , 2017, arxiv.org/abs/1702.05532

23/34 Why is BAML worse than HDAD? * empirical force field terms fails to describe reality in many cases * uniqueness might also be an issue * e.g., a slighted deviated Morse potential may cause uniqueness issue Bear in mind once again: * be cautious to use the target function as representations!

24/34 Improving the physics E(2) = Z i Z j /R^n QM7b dataset (size:7211) Coulomb force: good as a rpst for property: enthalpy ( H ) bonding, bad for dispersion London force: good for dispersion, decent for bond as a comprise, London wins!!

25/34 Improving the physics Atoms + L ondon + A xilrod- T eller- M uto (LATM) HB, OAvL, to be submitted (2017)

26/34 extending E -based approach 6k isomers build rpst based on decomposition of any extensive property: e.g., polarizability model Only ONE-body!! BH, OAvL , JCP comm ., 2016

27/34 Categorizing M III I II global representations Ψ / ρ / V -based continuous discrete atomic representations E -based

28/34 Go Atomic covariance P. R. Bartok et al , IJQC (2015)

29/34 Smooth Overlap of Atomic Positions (SOAP) projection to basis set A serious problem of SOAP: for large r_cutoff, how to distinguish two very different atoms around centre? application: simple crystals so far Fix SOAP for molecules by RE-Match, glory lost as an atomic rpst works best with a small r_cutoff !! P. R. Bartok et al , PRB (2013) S. De, et al ., PCCP , 2016

30/34 aLATM MBE-based approach: more natural to define atomic rpst 1. includes 2-, 3-body interactions 2. both decay with r HB, OAvL, to be submitted (2017)

33/34 why aLATM is bad at larger N ? Random sampling Rational sampling N. J. Browning, et al., JPCL (2017)

31/34 Categorizing M molecules & crystals III I II IV global representations Ψ / ρ / V -based non- continuous alchemical discrete atomic representations E -based electronic alchemical representations (Holy grail)

32/34 Go Electronic overall 2 M Elpasolite ABC 2 D 6 Crystals period group F. Faber, et al , PRL (2016)

34/34 Conclusions and Outlook 1. Almost all rpsts in literature were categorised 2. Two general approaches for rational design of rpst a. Schrödinger equation: ρ / Ψ /V_ext b. many-body expansion 3. Two general principles for rational design of rpst a. uniqueness (necessary for convergence) b. similarity to target reduces off-set of LC 4. MBE based rpst (e.g., BAML, LATM, HDAD) offer a. Meaning b. Simplicity c. Accuracy 5. and is generally better than ρ / Ψ /V_ext based approach 6. There is great potential for electronic rpst to beat everything else

Acknowledgements: Prof. Dr. O. Anatole von Lilienfeld

More efficient representations of compounds for machine learning - PowerPoint PPT Presentation

More efficient representations of compounds for machine learning models Bing Huang and Anatole von Lilienfeld Institute of Physical Chemistry and National Centre for Computational Design and Discovery of Novel Materials (MARVEL) Department of

Chapter 5 Chapter 5 Molecules and Compounds olecules and Compounds Chemical Formulas

BIOACTIVE COMPOUNDS OF ASAI PALM BIOACTIVE COMPOUNDS OF ASAI PALM BIOACTIVE COMPOUNDS OF ASAI

Chemistry 2000 Slide Set 18: Reactions of organic compounds Marc R. Roussel March 13, 2020

Introduction to Heusler compounds: From the case of Fe 2 VAl Chin Shan Lue ( ) 2017

AP Chemistry Compounds 2015-09-14 www.njctl.org Slide 3 / 163 Table of Contents: Compounds Pt.

61A Lecture 16 Announcements String Representations String Representations 4 String

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

Radiopaque Compounds for minimally invasive medical devices Confidential Purpose of Radiopaque

Chemspace Pre-Plated Compounds Description Discover our sets of Pre-Plated compounds! Fragments

Unit 3: NOMENCLATURE (ok, relax, that just means naming) Unit 3 Day 2: Ionic Compounds

Biologically Active Compounds Nina A. Kasyanenko Faculty of Physics St.-Petersburg State

Ionic Compounds and Ionic Bonding Slide 3 / 130 Table of Contents: Ionic Compounds and Ionic

Unit 6: NOMENCLATURE (ok, relax, that just means naming) Unit 3 Day 3: Molecular Compounds

1 2 Marketing Purgex Purging Compounds in Portugal 3 Marketing Purgex Purging Compounds in

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26,

World's Fastest Machine Learning With GPUs http://github.com/h2oai/h2o4gpu Speaker: Jonathan C.

The effectiveness of fiscal policy The effectiveness of fiscal policy in Australia - - selected

Understanding EDA Revolving Loan Funds April 9, 2019 1 Agenda RLF PROGRAM OVERVIEW AND RISK

Estimation of the Kernel Mean Embedding (with uncertainty) Paul Rubenstein University of

Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors

Neuromorphic Computing with Reservoir Neural Networks on Memristive Hardware Aaron Stockdill

A Contextual-Bandit Approach to Personalized News Article Recommendation Lihong li, Wei Chu,

More efficient representations of compounds for machine learning - PowerPoint PPT Presentation

More efficient representations of compounds for machine learning models Bing Huang and Anatole von Lilienfeld Institute of Physical Chemistry and National Centre for Computational Design and Discovery of Novel Materials (MARVEL) Department of

Chapter 5 Chapter 5 Molecules and Compounds olecules and Compounds Chemical Formulas

BIOACTIVE COMPOUNDS OF ASAI PALM BIOACTIVE COMPOUNDS OF ASAI PALM BIOACTIVE COMPOUNDS OF ASAI

Chemistry 2000 Slide Set 18: Reactions of organic compounds Marc R. Roussel March 13, 2020

Introduction to Heusler compounds: From the case of Fe 2 VAl Chin Shan Lue ( ) 2017

AP Chemistry Compounds 2015-09-14 www.njctl.org Slide 3 / 163 Table of Contents: Compounds Pt.

61A Lecture 16 Announcements String Representations String Representations 4 String

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. *More info blablabla *More info blablabla *More info blablabla *More

Radiopaque Compounds for minimally invasive medical devices Confidential Purpose of Radiopaque

Chemspace Pre-Plated Compounds Description Discover our sets of Pre-Plated compounds! Fragments

Unit 3: NOMENCLATURE (ok, relax, that just means naming) Unit 3 Day 2: Ionic Compounds

Biologically Active Compounds Nina A. Kasyanenko Faculty of Physics St.-Petersburg State

Ionic Compounds and Ionic Bonding Slide 3 / 130 Table of Contents: Ionic Compounds and Ionic

Unit 6: NOMENCLATURE (ok, relax, that just means naming) Unit 3 Day 3: Molecular Compounds

1 2 Marketing Purgex Purging Compounds in Portugal 3 Marketing Purgex Purging Compounds in

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26,

World's Fastest Machine Learning With GPUs http://github.com/h2oai/h2o4gpu Speaker: Jonathan C.

The effectiveness of fiscal policy The effectiveness of fiscal policy in Australia - - selected

Understanding EDA Revolving Loan Funds April 9, 2019 1 Agenda RLF PROGRAM OVERVIEW AND RISK

Estimation of the Kernel Mean Embedding (with uncertainty) Paul Rubenstein University of

Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors

Neuromorphic Computing with Reservoir Neural Networks on Memristive Hardware Aaron Stockdill

A Contextual-Bandit Approach to Personalized News Article Recommendation Lihong li, Wei Chu,

Why Transformers Work. More info blablabla More info blablabla More info blablabla More