ifferentiate everything: A lesson from deep learning Lei Wang ( ) - PowerPoint PPT Presentation

∂ ifferentiate everything: A lesson from deep learning Lei Wang ( 王磊 ) https://wangleiphy.github.io Institute of Physics, CAS

Quantum Many-Body Computation U Deep Quantum Learning Computing Differentiable Programming

Differentiable Programming Andrej Karpathy Traditional Director of AI at Tesla. Previously Research Scientist at OpenAI and PhD student at Stanford. I like to train deep neural nets on large datasets. https://medium.com/@karpathy/software-2-0-a64152b37c35 Input Computer Output Program Machine Learning Input Computer Program Output Writing software 2.0 by gradient search in the program space

Differentiable Programming Andrej Karpathy Benefits of Software 2.0 Director of AI at Tesla. Previously Research Scientist at OpenAI and PhD student at Stanford. I like to train deep neural nets on large datasets. https://medium.com/@karpathy/software-2-0-a64152b37c35 • Computationally homogeneous • Simple to bake into silicon • Constant running time • Constant memory usage • Highly portable & agile • Modules can meld into an optimal whole • Better than humans Writing software 2.0 by gradient search in the program space

Demo: Inverse Schrodinger Problem Given ground state density, how to design the potential ? [ − 1 ∂ x 2 + V ( x ) ] Ψ ( x ) = E Ψ ( x ) ∂ 2 2 https://math.mit.edu/~stevenj/18.336/adjoint.pdf https://github.com/QuantumBFS/SSSS/blob/master/1_deep_learning/schrodinger.py

What is under the hood ?

What is deep learning ? Composes differentiable components to a program e.g. a neural network, then optimizes it with gradients

Automatic differentiation on computation graph θ 1 θ 2 weights ∂ x 2 ∂ x 3 θ 1 = x 2 θ 2 = x 3 ∂ θ 1 ∂ θ 2 ℒ “comb graph“ loss x 3 x 2 x 1 ∂ x 3 x 3 = ℒ ∂ℒ ℒ = 1 x 2 = x 3 data ∂ x 2 ∂ x 3 x = ∂ℒ “adjoint variable” ∂ x Pullback the adjoint through the graph

Automatic differentiation on computation graph x 1 x 2 ∂ x 2 x 1 = x 2 ∂ x 1 directed ℒ θ ∂ x 3 + x 3 acyclic graph ∂ x 1 x 3 ∂ x j ∑ x i = with ℒ = 1 x j ∂ x i j : child of i Message passing for the adjoint at each node

Advantages of automatic differentiation � � � • Accurate to the machine precision � • Same computational complexity as the function evaluation: � Baur-Strassen theorem ’83 � • Supports higher order gradients

Applications of AD Computing force Quantum optimal control Variational Hartree-Fock forward (evolution) backward (gradient) u 1 , 1 u 2 , 1 u 3 , 1 u 1 , 2 u 2 , 2 u 3 , 2 C × H 1 × H 2 × H 3 × H 1 × H 2 × H 3 H H + + Ψ C + 2 0 0 T e − i δ t e − i δ t Ψ 0 Ψ Ψ Ψ 2 1 N Ψ C 5 Ψ C 5 C 5 Ψ F F F 0 + + + Sorella and Capriotti Leung et al Tamayo-Mendoza et al J. Chem. Phys. ’10 PRA ’17 ACS Cent. Sci. ’18

More Applications… Langevin dynamics Sequence s Impute Structure X . . . Ingraham et al Protein Initialize . . . ICLR ‘19 folding Neural reparameterization Structural optimization Dense Conv Conv Conv Conv Structural Hoyer et al 1909.04240 optimization Physics Objective function CNN parameterization Design constraints (displacement) (compliance) Forward pass Gradients

Coil design in fusion reactors (stellarator) McGreivy et al 2009.00196

Computation graph McGreivy et al 1909.04240 Coil parameters Total cost Differentiable programming is more than training neural networks

Black Functional Chain magic box differential geometry rule https://colab.research.google.com/ github/google/jax/blob/master/ notebooks/autodiff_cookbook.ipynb Differentiating a general computer program (rather than neural networks) calls for deeper understanding of the technique

Reverse versus forward mode ∂ x n ⋯ ∂ x 2 ∂ x 1 ∂ℒ ∂ θ = ∂ℒ ∂ x n ∂ x n − 1 ∂ x 1 ∂ θ Reverse mode AD: Vector-Jacobian Product of primitives • Backtrace the computation graph • Needs to store intermediate results • Efficient for graphs with large fan-in Backpropagation = Reverse mode AD applied to neural networks

Reverse versus forward mode ∂ x n ⋯ ∂ x 2 ∂ x 1 ∂ℒ ∂ θ = ∂ℒ ∂ x n ∂ x n − 1 ∂ x 1 ∂ θ Forward mode AD: Jacobian-Vector Product of primitives • Same order with the function evaluation • No storage overhead • Efficient for graph with large fan-out Less efficient for scalar output, but useful for higher-order derivatives

How to think about AD ? • AD is modular, and one can control its granularity • Benefits of writing customized primitives • Reducing memory usage • Increasing numerical stability • Call to external libraries written agnostically to AD (or, even a quantum processor) https://github.com/PennyLaneAI/pennylane

Example of primitives ~200 functions to cover most of numpy in HIPS/autograd https://github.com/HIPS/autograd/blob/master/autograd/numpy/numpy_vjps.py � … Loop/Condition/Sort/Permutations are also differentiable

Differentiable programming tools HIPS/autograd SciML

Differentiable Scientific Computing • Many scientific computations (FFT, Eigen, SVD!) are differentiable • ODE integrators are differentiable with O(1) memory • Differentiable ray tracer and Differentiable fluid simulations • Differentiable Monte Carlo/Tensor Network/Functional RG/ Dynamical Mean Field Theory/Density Functional Theory/ Hartree-Fock/Coupled Cluster/Gutzwiller/Molecular Dynamics… Differentiate through domain-specific computational processes to solve learning, control, optimization and inverse problems

Differentiable Eigensolver Inverse Schrodinger Problem H Ψ ℒ V matrix diagonalization Useful for inverse Kohn-Sham problem, Jensen & Wasserman ‘17

Differentiable Eigensolver H Ψ = Ψ E What happen if H → H + dH Forward mode: ? Perturbation theory Reverse mode: How should I change H given Inverse perturbation theory! and ? ∂ℒ / ∂Ψ ∂ℒ / ∂ E Hamiltonian engineering via differentiable programming https://github.com/wangleiphy/DL4CSRC/tree/master/2-ising See also Fujita et al, PRB ‘18

Differentiable ODE integrators “Neural ODE” Chen et al, 1806.07366 Dynamics systems Principle of least actions S = ∫ ℒ ( q θ , · dx q θ , t ) dt dt = f θ ( x , t ) Classical and quantum control Optics, (quantum) mechanics, field theory…

i dU Quantum optimal control dt = HU https://qucontrol.github.io/krotov/ v1.0.0/11_other_methods.html Differentiable programing (Neural ODE) for unified, flexible, and efficient quantum control Forward mode: slow No gradient: Reverse mode w/ discretize steps: not scalable piesewise-constant assumption

Differentiable ODE integrators “Neural ODE” Chen et al, 1806.07366 Dynamics systems Principle of least actions S = ∫ ℒ ( q θ , · dx q θ , t ) dt dt = f θ ( x , t ) Classical and quantum control Optics, (quantum) mechanics, field theory…

Differentiable functional optimization The brachistochrone problem Johann Bernoulli,1696 T = ∫ x 1 1 + ( dy / dx ) 2 2 g ( y 1 − y 0 ) dx x 0 https://github.com/QuantumBFS/SSSS/tree/master/1_deep_learning/brachistochrone

Differentiable Programming Tensor Networks Liao, Liu, LW, Xiang, 1903.09650, PRX ‘19 https://github.com/wangleiphy/tensorgrad

“Tensor network is 21 century’s matrix” —Mario Szegedy Ψ Quantum circuit architecture, Neural networks and parametrization, and simulation Probabilistic graphical models

Differentiate through tensor renormalization group Computation graph × depth ln Z β � � � � � Contraction � Truncated SVD inverse free Levin, Nave, PRL ‘07 temperature energy 3 . 0 2 . 20 β 2 ∂ 2 ln Z − 1 − ∂ ln Z β ln Z � � ∂β ∂β 2 − 1 . 2 � � 2 . 5 exact exact exact energy density � specific heat free energy � � 2 . 15 � 2 . 0 − 1 . 4 � � 1 . 5 2 . 10 � − 1 . 6 1 . 0 � 2 . 05 0 . 40 0 . 45 0 . 50 0 . 40 0 . 45 0 . 50 0 . 40 0 . 45 0 . 50 β β β Compute physical observables as gradient of tensor network contraction

Differentiable spin glass solver optimal couplings tensor network energy & fields contraction optimal [ ] ∂ energy optimal configuration = [ ] ∂ field Liu, LW, Zhang, 2008.06888 https://github.com/TensorBFS/TropicalTensors.jl

Differentiable iPEPS optimization before… now, w/ differentiable programming Liao, Liu, LW, Xiang, PRX ‘19 grad = + + + + + + + 10 − 2 energy relative error + + + + 10 − 3 + + + + simple update full update 10 − 4 Corboz [34] + + + + Vanderstraeten [35] present work 10 − 5 2 3 4 5 6 7 + + + + D , Best variational energy to date Vanderstraeten et al, PRB ‘16 https://github.com/wangleiphy/tensorgrad 1 GPU (Nvidia P100) week

Differentiable iPEPS optimization Infinite size Finite size Tensor network Neural network 10 − 2 energy relative error 10 − 3 10x10 cluster simple update full update 10 − 4 Corboz [34] Vanderstraeten [35] present work Carleo & Troyer, Science ‘17 10 − 5 2 3 4 5 6 7 D Liao, Liu, LW, Xiang, PRX ‘19 Further progress for challenging physical problems: Chen et al, ‘19 Xie et al, ’20 frustrated magnets, fermions, thermodynamics … Tang et al ’20 …

ifferentiate everything: A lesson from deep learning Lei Wang ( ) - PowerPoint PPT Presentation

ifferentiate everything: A lesson from deep learning Lei Wang ( ) https://wangleiphy.github.io Institute of Physics, CAS Quantum Many-Body Computation U Deep Quantum Learning Computing Differentiable Programming

Multitask Learning Lei Tang Arizona State University Nov. 6th, 2006 Lei Tang Multitask

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

How to Make Lesson Plan By Yan Suo July 24, 2010 THINK What is a lesson plan? What is a

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

The Internet of Everything Pete Lancia Sr. Dir., Marketing 1 The Internet of Everything The

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

HOW TO MAKE A CAKE 1/11 Lesson Objectives Lesson: HOW TO MAKE A CAKE By learning this lesson

Fast Item Response Theory (IRT) Analysis by using GPUs Lei Chen lei.chen@liulishuo.com Liulishuo

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Page 2 NAME: Practice Exam 2 Problem 1 (4 points) Use the axioms of probability to prove that P

1 KINGS 1 KINGS 1 KINGS United Kingdom 1:111:43 40 Years 1:12:46 Establishment of

(AAAS) With funding and support from the National Science Foundation (NSF) Audio: Call in using

We are learning to write in sentences Just for today we will explain why we think a character

J , the all- 1 matrix. (The corresponding only G -invariant association scheme is the trivial A

a building for the glory of God? What does this mean? There are over 3 million cathedrals.

FUNERAL BLUES WD Auden WH AUDEN Pulitzer prize winning English poet, author and playwright

In collaboration with I. Yegorova and P. Salucci Outline: 1) description of the (pilot) project