More efficient representations of compounds for machine learning - - PowerPoint PPT Presentation

more efficient representations of compounds for machine
SMART_READER_LITE
LIVE PREVIEW

More efficient representations of compounds for machine learning - - PowerPoint PPT Presentation

More efficient representations of compounds for machine learning models Bing Huang and Anatole von Lilienfeld Institute of Physical Chemistry and National Centre for Computational Design and Discovery of Novel Materials (MARVEL) Department of


slide-1
SLIDE 1

More efficient representations of compounds for machine learning models

Bing Huang and Anatole von Lilienfeld

Institute of Physical Chemistry and National Centre for Computational Design and Discovery of Novel Materials (MARVEL) Department of Chemistry University of Basel Switzerland

slide-2
SLIDE 2

1/34

Breaking the hex!

Time Low high traditional ML CCSD(T) Force Field ~1 cpu seconds >> 1 hour ~1 s GAML accuracy DFT semi-MO

slide-3
SLIDE 3

2/34

❑ Given data set {X0; Y}, learn f: X -> Y and then infer for new X0’ ❑ Kernel ridge regression N parameters to be regressed for N molecules + 2 global parameters length-scale of the data set noise-level training test

  • M. Rupp, et al., PRL, 2012

for molecules, X0: {Z, R}, Y: E covariance feature abstraction

Machine Learning - basics

slide-4
SLIDE 4

❑ More training data, better results for proper X (refer to M hereafter) ❑ representation (M) central to ML

3/34

Decent Good ML model Poor ML model Huang, OAvL, JCP Comm (2016); arxiv.org/abs/1608.06194

training set size: N

Machine Learning - basics

at large N

slide-5
SLIDE 5

Learning a simple 1-D function

Mi M For KRR, f & a*f + b as rpsts are identical target similarity uniqueness

OAvL et al, IJQC (2013)

lack of uniqueness → absurd results → noise in training

4/34

slide-6
SLIDE 6

5/34

target: Y = (x-1)(x-2)(x+3)

Learning a "complicated 1-D function

slide-7
SLIDE 7

6/34

❑ in case you know f well (exact form unknown) use it as M ❑ otherwise you’d better know how f behaves use one monotonic part of f, refer it as g

General guidelines for designing M

try & error best g minimizes ||g – f ||2 e.g., Morse potential V(x) = -100*(2*exp(-(x-1.4)) – exp(-2(x-1.4))) Performance V(x) > 1/x > exp(-(x-1.4)) > x > 1/x6 >> -(x-1.4)2

slide-8
SLIDE 8

7/34

Representing molecules

J-L. Reymond et al, ACS Chem. Neuro. (2012)

X0: {Z, R} a glimpse of Chemical Compound Space (CCS) prerequisites: rotation, translation invariant index permutation invariant unique (1 rpsts~2 molecule ⌫)

slide-9
SLIDE 9

8/34

Representing molecules Ψ/ρ/V-based

impractical

E = E[Ψ] = E[ρ] = E[V_ext]

practical accessible projection to basis set M: [c_1, c_2, …, c_n] atom/electron density CCS: infinite dimension straightforward !!

slide-10
SLIDE 10

9/34

Learning an 1-D functional

  • J. C. Snyder, et al, PRL (2012)

noninteracting fermions in 1D property: kinetic energy

a density in the training set new density extent of variation

MAE < 1.0 kcal/mol

slide-11
SLIDE 11

fingerprint representations

OAvL et al, IJQC (2015)

  • V. Botu, et al., IJQC (2015)

projection to 1-D frequency domain & substitution projection to 1-D η-grid 1-D discrete array GDB-9 dataset works good for Al_n systems remove rotation dependence

10/34

slide-12
SLIDE 12

11/34

Representing molecules Ψ/ρ/V-based why are fingerprint rpsts bad for molecules, but good for Al_n like systems?

||g – f ||2 large for molecules, small for Al_n

slide-13
SLIDE 13

11/34

Representing molecules Ψ/ρ/V-based E-based

many body expansion (MBE) of total energy

M: [ {E(1)}, {E(12)}, {E(123)}, …]

BH, OAvL, JCP comm., 2016

CCS: dimension is significantly reduced!!!

slide-14
SLIDE 14

Coulomb matrix (CM)

“CM”, M. Rupp, et al., PRL, 2012

sorted CM random CM

12/34

slide-15
SLIDE 15

Bag of Bonds (BoB)

much better than CM, why??

  • K. Hansen, et al., JPCL, 2015

13/34

slide-16
SLIDE 16

homometric molecules

14/34

same set of interatomic distance pairs

OAvL, et al., IJQC, 2015

non-uniqueness issue

slide-17
SLIDE 17

homometric molecules

f (scaling of s)

14/34

same set of interatomic distance pairs

LJ: Lennard-Jones 2-body vdW potential ATM: Axilrod-Teller-Muto 3-body vdW potential

non-uniqueness issue

BH, OAvL, JCP comm., 2016

slide-18
SLIDE 18

homometric molecules

14/34

same set of interatomic distance pairs

LJ: Lennard-Jones 2-body vdW potential ATM: Axilrod-Teller-Muto 3-body vdW potential

2-body interaction is not enough!

non-uniqueness issue

BH, OAvL, JCP comm., 2016

f (scaling of s)

slide-19
SLIDE 19

bags of Universal force field (UFF) contributions

MB MA

EIJ

Morse/LJ MT

15/34

BAML

BAML

BH, OAvL, JCP comm., 2016

slide-20
SLIDE 20

BAML

4 8 12 16 4 8 12 16 H C N O P 1.75 3.5 5.25 7 1.75 3.5 5.25 7 H C N O P S

r(C-H)

4 8 12 16 4 8 12 16 H C N O P

r(C-C)

4 8 12 16 4 8 12 16 H C N O P

r(C-N) BoA BoP

4 8 12 16 4 8 12 16 H C N O P

θ(C-C-H)

4 8 12 16 4 8 12 16 H C N O P

θ(C-C-C)

4 8 12 16 4 8 12 16 H C N O P

θ(C-C-N) BoT

4 8 12 16 4 8 12 16 H C N O P

Φ(H-C-C-H)

4 8 12 16 4 8 12 16 H C N O P

Φ(C-C-C-C)

4 8 12 16 4 8 12 16 H C N O P

Φ(C-C-C-N) BoQ

V(Z) V(r) V(θ) V(Φ)

16/34

slide-21
SLIDE 21

database: 6k isomers (C7H10O2)

BAML

ERROR TRAINING SET SIZE

BH, OAvL, JCP comm., 2016

17/34

slide-22
SLIDE 22

6k isomersa (C7H10O2) ERROR TRAINING SET SIZE

18/34

BAML

BH, OAvL, JCP comm., 2016

slide-23
SLIDE 23

6k isomersa (C7H10O2) ERROR TRAINING SET SIZE

3 outliers

18/34

BAML

BH, OAvL, JCP comm., 2016

slide-24
SLIDE 24

6k isomersa (C7H10O2) TRAINING SET SIZE ERROR QM9a (134k) TRAINING SET SIZE

19/34

BAML

BH, OAvL, JCP comm., 2016

slide-25
SLIDE 25

Comparison

QM7b database (size: 7211) Error

BAML BoB SOAPa CMb accuracyb E (PBE0)/eV 0.05 0.08 0.04 0.16 0.15, 0.23, 0.09-0.22 α (PBE0)/ Å3 0.07 0.09 0.05 0.11 0.05-0.27, 0.04-0.14 HOMO (GW)/eV 0.10 0.15 0.12 0.16

  • LUMO (GW)/eV

0.11 0.16 0.12 0.16

  • IP (ZINDO)/eV

0.15 0.20 0.19 0.17 0.20, 0.15 EA (ZINDO)/eV 0.07 0.17 0.13 0.11 0.16, 0.11 E1st* (ZINDO)/eV 0.13 0.21 0.18 0.13 0.18, 0.21

a S. De, et al., PCCP, 2016 b G. Montavon, et al., NJP, 2013

MAE (5k out-of-sample)

20/34

BAML

BH, OAvL, JCP comm., 2016

slide-26
SLIDE 26

21/34

HDAD

4 8 12 16 4 8 12 16 H C N O P

r(C-H)

4 8 12 16 4 8 12 16 H C N O P

r(C-C)

4 8 12 16 4 8 12 16 H C N O P

r(C-N) BoP

4 8 12 16 4 8 12 16 H C N O P

d(C-H)

4 8 12 16 4 8 12 16 H C N O P

d(C-C)

4 8 12 16 4 8 12 16 H C N O P

d(C-N) BoT

4 8 12 16 4 8 12 16 H C N O P

d(C-H)

4 8 12 16 4 8 12 16 H C N O P

d(C-C)

4 8 12 16 4 8 12 16 H C N O P

d(C-N) BoQ

V(r) = r V(θ) = θ V(Φ) = Φ

Histogram of Distance Histogram of Angles Histogram of Dihedral angles

  • F. Faber, et al., 2017, arxiv.org/abs/1702.05532

shortcoming: force prediction

slide-27
SLIDE 27

22/34

HDAD

  • F. Faber, et al., 2017, arxiv.org/abs/1702.05532
slide-28
SLIDE 28

23/34

Why is BAML worse than HDAD?

* empirical force field terms fails to describe reality in many cases * uniqueness might also be an issue * e.g., a slighted deviated Morse potential may cause uniqueness issue * be cautious to use the target function as representations! Bear in mind once again:

slide-29
SLIDE 29

24/34

QM7b dataset (size:7211) property: enthalpy (H)

Improving the physics

London force: good for dispersion, decent for bond as a comprise, London wins!! Coulomb force: good as a rpst for bonding, bad for dispersion E(2) = ZiZj/R^n

slide-30
SLIDE 30

Atoms + London + Axilrod-Teller-Muto (LATM)

HB, OAvL, to be submitted (2017)

Improving the physics

25/34

slide-31
SLIDE 31

build rpst based on decomposition of any extensive property: e.g., polarizability model

extending E-based approach

6k isomers

Only ONE-body!!

BH, OAvL, JCP comm., 2016

26/34

slide-32
SLIDE 32

27/34

global representations atomic representations Ψ/ρ/V-based E-based discrete continuous

I II III

Categorizing M

slide-33
SLIDE 33

Go Atomic

  • P. R. Bartok et al, IJQC (2015)

covariance

28/34

slide-34
SLIDE 34

29/34

Smooth Overlap of Atomic Positions (SOAP)

  • P. R. Bartok et al, PRB (2013)

projection to basis set

A serious problem of SOAP: for large r_cutoff, how to distinguish two very different atoms around centre?

application: simple crystals so far

Fix SOAP for molecules by RE-Match, glory lost as an atomic rpst

  • S. De, et al., PCCP, 2016

works best with a small r_cutoff !!

slide-35
SLIDE 35

30/34

aLATM

HB, OAvL, to be submitted (2017)

MBE-based approach: more natural to define atomic rpst

  • 1. includes 2-, 3-body interactions
  • 2. both decay with r
slide-36
SLIDE 36

33/34

Rational sampling Random sampling

why aLATM is bad at larger N?

  • N. J. Browning, et al., JPCL (2017)
slide-37
SLIDE 37

31/34

global representations atomic representations electronic representations Ψ/ρ/V-based E-based alchemical non- alchemical discrete continuous molecules & crystals

I II III IV

Categorizing M

(Holy grail)

slide-38
SLIDE 38

Go Electronic

  • F. Faber, et al, PRL (2016)
  • verall 2M Elpasolite ABC2D6 Crystals

32/34

period group

slide-39
SLIDE 39

Conclusions and Outlook

  • 1. Almost all rpsts in literature were categorised
  • 2. Two general approaches for rational design of rpst
  • a. Schrödinger equation: ρ/Ψ/V_ext
  • b. many-body expansion
  • 3. Two general principles for rational design of rpst
  • a. uniqueness (necessary for convergence)
  • b. similarity to target reduces off-set of LC
  • 4. MBE based rpst (e.g., BAML, LATM, HDAD) offer
  • a. Meaning
  • b. Simplicity
  • c. Accuracy
  • 5. and is generally better than ρ/Ψ/V_ext based approach
  • 6. There is great potential for electronic rpst to beat everything else

34/34

slide-40
SLIDE 40

Acknowledgements:

  • Prof. Dr. O. Anatole von Lilienfeld