ifferentiate everything: A lesson from deep learning Lei Wang ( ) - - PowerPoint PPT Presentation

ifferentiate everything a lesson from deep learning lei
SMART_READER_LITE
LIVE PREVIEW

ifferentiate everything: A lesson from deep learning Lei Wang ( ) - - PowerPoint PPT Presentation

ifferentiate everything: A lesson from deep learning Lei Wang ( ) https://wangleiphy.github.io Institute of Physics, CAS Quantum Many-Body Computation U Deep Quantum Learning Computing Differentiable Programming


slide-1
SLIDE 1

Lei Wang (王磊)

https://wangleiphy.github.io

ifferentiate everything: A lesson from deep learning

Institute of Physics, CAS

slide-2
SLIDE 2

Deep Learning Quantum Many-Body Computation Quantum Computing U

Differentiable Programming

slide-3
SLIDE 3 https://medium.com/@karpathy/software-2-0-a64152b37c35

Andrej Karpathy

Director of AI at Tesla. Previously Research Scientist at OpenAI and PhD student at Stanford. I like to train deep neural nets on large datasets.

Input Program Output Input Output Program Computer Computer Writing software 2.0 by gradient search in the program space

Differentiable Programming

Traditional Machine Learning

slide-4
SLIDE 4 https://medium.com/@karpathy/software-2-0-a64152b37c35
  • Computationally homogeneous

Benefits of Software 2.0

  • Simple to bake into silicon
  • Constant running time
  • Constant memory usage
  • Highly portable & agile
  • Modules can meld into an optimal whole
  • Better than humans

Andrej Karpathy

Director of AI at Tesla. Previously Research Scientist at OpenAI and PhD student at Stanford. I like to train deep neural nets on large datasets.

Writing software 2.0 by gradient search in the program space

Differentiable Programming

slide-5
SLIDE 5

Demo: Inverse Schrodinger Problem

Given ground state density, how to design the potential ?

[− 1 2 ∂2 ∂x2 + V(x)] Ψ(x) = EΨ(x)

https://github.com/QuantumBFS/SSSS/blob/master/1_deep_learning/schrodinger.py https://math.mit.edu/~stevenj/18.336/adjoint.pdf

slide-6
SLIDE 6

What is under the hood ?

slide-7
SLIDE 7

Composes differentiable components to a program e.g. a neural network, then optimizes it with gradients

What is deep learning ?

slide-8
SLIDE 8

“comb graph“

Automatic differentiation on computation graph

θ1 = x2 ∂x2 ∂θ1 θ2 = x3 ∂x3 ∂θ2 x2 = x3 ∂x3 ∂x2

x3 = ℒ ∂ℒ ∂x3

θ1 ℒ

x2 x3

θ2

ℒ = 1

x1

weights data “adjoint variable” x = ∂ℒ ∂x

Pullback the adjoint through the graph

loss

slide-9
SLIDE 9

xi = ∑

j: child of i

xj ∂xj ∂xi

Message passing for the adjoint at each node

ℒ = 1 with

θ ℒ

x1 x2 x3

x1 = x2 ∂x2 ∂x1 +x3 ∂x3 ∂x1

directed acyclic graph

Automatic differentiation on computation graph

slide-10
SLIDE 10
  • Accurate to the machine precision
  • Same computational complexity as the function evaluation:

Baur-Strassen theorem ’83

  • Supports higher order gradients
  • Advantages of automatic differentiation
slide-11
SLIDE 11

Applications of AD

Sorella and Capriotti

  • J. Chem. Phys. ’10

Computing force

Tamayo-Mendoza et al ACS Cent. Sci. ’18

Variational Hartree-Fock

T C C e−iδt e−iδt + + C5 N 2 C5 C5 H ×H1 ×H2 ×H3 + 1 H ×H1 ×H2 ×H3 + u1,1 u2,1 u3,1 + + Ψ F u1,2 u2,2 u3,2 forward (evolution) backward (gradient) Ψ0 Ψ Ψ Ψ Ψ 2 Ψ F Ψ F

Leung et al PRA ’17

Quantum optimal control

slide-12
SLIDE 12

More Applications…

Hoyer et al 1909.04240

Structural

  • ptimization
Conv Conv Conv CNN parameterization Design constraints Physics (displacement) Objective function (compliance) Forward pass Gradients Neural reparameterization Structural optimization Conv Dense Structure X Sequence s . . . . . . Impute Langevin dynamics Initialize

Protein folding

Ingraham et al ICLR ‘19

slide-13
SLIDE 13

McGreivy et al 2009.00196

Coil design in fusion reactors (stellarator)

slide-14
SLIDE 14

Coil parameters Total cost McGreivy et al 1909.04240

Computation graph

Differentiable programming is more than training neural networks

slide-15
SLIDE 15

https://colab.research.google.com/ github/google/jax/blob/master/ notebooks/autodiff_cookbook.ipynb

Black magic box Chain rule Functional differential geometry

Differentiating a general computer program (rather than neural networks) calls for deeper understanding of the technique

slide-16
SLIDE 16

Reverse versus forward mode

Reverse mode AD: Vector-Jacobian Product of primitives

  • Backtrace the computation graph
  • Needs to store intermediate results
  • Efficient for graphs with large fan-in

∂ℒ ∂θ = ∂ℒ ∂xn ∂xn ∂xn−1 ⋯ ∂x2 ∂x1 ∂x1 ∂θ

Backpropagation = Reverse mode AD applied to neural networks

slide-17
SLIDE 17

Reverse versus forward mode

Forward mode AD: Jacobian-Vector Product of primitives

  • Same order with the function evaluation
  • No storage overhead
  • Efficient for graph with large fan-out

∂ℒ ∂θ = ∂ℒ ∂xn ∂xn ∂xn−1 ⋯ ∂x2 ∂x1 ∂x1 ∂θ

Less efficient for scalar output, but useful for higher-order derivatives

slide-18
SLIDE 18

How to think about AD ?

  • AD is modular, and one can control its granularity
  • Benefits of writing customized primitives
  • Reducing memory usage
  • Increasing numerical stability
  • Call to external libraries written agnostically to AD

(or, even a quantum processor)

https://github.com/PennyLaneAI/pennylane

slide-19
SLIDE 19

Example of primitives

Loop/Condition/Sort/Permutations are also differentiable

~200 functions to cover most of numpy in HIPS/autograd

https://github.com/HIPS/autograd/blob/master/autograd/numpy/numpy_vjps.py

slide-20
SLIDE 20

Differentiable programming tools

HIPS/autograd

SciML

slide-21
SLIDE 21

Differentiable Scientific Computing

  • Many scientific computations (FFT, Eigen, SVD!) are differentiable
  • Differentiable ray tracer
  • Differentiable Monte Carlo/Tensor Network/Functional RG/

Dynamical Mean Field Theory/Density Functional Theory/ Hartree-Fock/Coupled Cluster/Gutzwiller/Molecular Dynamics…

  • ODE integrators are differentiable with O(1) memory

Differentiable fluid simulations and Differentiate through domain-specific computational processes to solve learning, control, optimization and inverse problems

slide-22
SLIDE 22

V

Ψ

H

matrix diagonalization

Inverse Schrodinger Problem

Differentiable Eigensolver

Useful for inverse Kohn-Sham problem, Jensen & Wasserman ‘17

slide-23
SLIDE 23

Differentiable Eigensolver

HΨ = ΨE

Forward mode: What happen if H → H + dH Perturbation theory Reverse mode: How should I change ? ∂ℒ/∂Ψ ∂ℒ/∂E and ? Inverse perturbation theory!

H given

Hamiltonian engineering via differentiable programming

https://github.com/wangleiphy/DL4CSRC/tree/master/2-ising See also Fujita et al, PRB ‘18

slide-24
SLIDE 24

Dynamics systems Principle of least actions Optics, (quantum) mechanics, field theory…

S = ∫ ℒ(qθ, · qθ, t)dt

dx dt = fθ(x, t) Classical and quantum control

Differentiable ODE integrators

“Neural ODE” Chen et al, 1806.07366

slide-25
SLIDE 25

Quantum optimal control

No gradient: not scalable Forward mode: slow Reverse mode w/ discretize steps: piesewise-constant assumption i dU dt = HU

https://qucontrol.github.io/krotov/ v1.0.0/11_other_methods.html

Differentiable programing (Neural ODE) for unified, flexible, and efficient quantum control

slide-26
SLIDE 26

Dynamics systems Principle of least actions Optics, (quantum) mechanics, field theory…

S = ∫ ℒ(qθ, · qθ, t)dt

dx dt = fθ(x, t) Classical and quantum control

Differentiable ODE integrators

“Neural ODE” Chen et al, 1806.07366

slide-27
SLIDE 27

Differentiable functional optimization

T = ∫

x1 x0

1 + (dy/dx)2 2g(y1 − y0) dx

The brachistochrone problem Johann Bernoulli,1696

https://github.com/QuantumBFS/SSSS/tree/master/1_deep_learning/brachistochrone

slide-28
SLIDE 28

Differentiable Programming Tensor Networks

Liao, Liu, LW, Xiang, 1903.09650, PRX ‘19 https://github.com/wangleiphy/tensorgrad

slide-29
SLIDE 29

Ψ

“Tensor network is 21 century’s matrix”

Neural networks and Probabilistic graphical models

—Mario Szegedy

Quantum circuit architecture, parametrization, and simulation

slide-30
SLIDE 30 0.40 0.45 0.50 β 2.05 2.10 2.15 2.20 free energy − 1 β ln Z exact 0.40 0.45 0.50 β −1.6 −1.4 −1.2 energy density −∂ ln Z ∂β exact 0.40 0.45 0.50 β 1.0 1.5 2.0 2.5 3.0 specific heat β2 ∂2 ln Z ∂β2 exact
  • Differentiate through tensor renormalization group

Computation graph ln Z β

Truncated SVD Contraction

Compute physical observables as gradient of tensor network contraction

  • × depth
Levin, Nave, PRL ‘07

inverse temperature free energy

slide-31
SLIDE 31

Differentiable spin glass solver

  • ptimal

energy

tensor network contraction

couplings & fields

  • ptimal

configuration =

  • ptimal

energy field

∂ ∂

Liu, LW, Zhang, 2008.06888 https://github.com/TensorBFS/TropicalTensors.jl

[ [ ] ]

slide-32
SLIDE 32 grad = + + + + + + + + + + + + + + + + + + + + + + + , 2 3 4 5 6 7 D 10−5 10−4 10−3 10−2 energy relative error simple update full update Corboz [34] Vanderstraeten [35] present work

now, w/ differentiable programming

Liao, Liu, LW, Xiang, PRX ‘19

before…

Differentiable iPEPS optimization

https://github.com/wangleiphy/tensorgrad 1 GPU (Nvidia P100) week Best variational energy to date

Vanderstraeten et al, PRB ‘16
slide-33
SLIDE 33 2 3 4 5 6 7 D 10−5 10−4 10−3 10−2 energy relative error simple update full update Corboz [34] Vanderstraeten [35] present work

Finite size Neural network

Carleo & Troyer, Science ‘17 10x10 cluster

Infinite size Tensor network

Liao, Liu, LW, Xiang, PRX ‘19

Further progress for challenging physical problems: frustrated magnets, fermions, thermodynamics …

Chen et al, ‘19 Xie et al, ’20 Tang et al ’20 …

Differentiable iPEPS optimization

slide-34
SLIDE 34

Differentiable Programming Quantum Circuits

neural networks — graphical models — tensor networks — quantum circuits

slide-35
SLIDE 35

⟨ H ⟩

θ

Peruzzo et al,
  • Nat. Comm. ’13

Quantum circuit as a variational ansatz

θ6 θ6 θ4 θ5 θ2 θ3 θ1 θ1 θ1

θ

Variational quantum algorithms

slide-36
SLIDE 36

Scan the single variational parameter Stochastic perturbation of 30 variational parameters

50 100 150 200 250 Iteration, k –15.6 –15.4 –15.2 –15.0 –14.8 –14.6 –14.4 –14.2 –14.0 –13.8 –13.6 Energy (hartree) Exact Final experimental result 200 –π –π/2 π/2 π j q,i,± (rad) X q 2 q,0,± Z q 3 q,0,± Z q 1 q,1,± X q 2 q,1,± Z q 3 q,1,± ' 200 200 200 200

Optimization with analytical gradient is essential for higher dimensions PRX ‘16 Nature ‘17

Optimize variational quantum circuits

slide-37
SLIDE 37

Parametrized gate of the form

e− iθ

2 Σ

Σ2 = 1

with e.g., X, Y, Z, CNOT, SWAP…

∇⟨H⟩θ = (⟨H⟩θ+π/2 − ⟨H⟩θ−π/2)/2

Li et al, PRL ’17, Mitarai et al, PRA ’18 Schuld et al, PRA ’19, Crooks, ’19…

Differentiable1 quantum circuits

measure gradient on real device Same complexity as forward mode automatic differentiation

slide-38
SLIDE 38

Differentiable2 quantum circuits

compute gradient in classical simulations Unfortunately, forward mode is slow Reverse mode is memory consuming

slide-39
SLIDE 39

Quantum circuit computation graph

|x1⟩

|x2⟩

U1

|xN⟩

. . .

|x0⟩

U2 UN

The same “comb graph” as the feedforward neural network, except that quantum computing is reversible Quantum state Unitaries

O(1) memory AD for reversible neural nets Gomez et al, 1707.04585 Chen et al, 1806.07366

slide-40
SLIDE 40

U|x⟩ |x⟩

forward

|y⟩

→ |y⟩

backward

U ←

|y⟩ ← U†

“uncompute”

⟨x|

adjoint for mat-vec multiply

Reversible AD for variational quantum circuits*

All are in-place operations without caching

*GRAPE type algorithm on the level of circuits

|x⟩ ←

|y⟩

U†

slide-41
SLIDE 41

Train a 10,000 layer, 300,000 parameter circuit on a laptop

https://yaoquantum.org/

Listing 9: 10000-layer VQE ⌥ ⌅ julia> using Yao, YaoExtensions julia> n = 10; depth = 10000; julia> circuit = dispatch!( variational_circuit(n, depth), :random); julia> gatecount(circuit) Dict{Type{#s54} where #s54 <: AbstractBlock,Int64} with 3 entries: RotationGate{1,Float64,ZGate} => 200000 RotationGate{1,Float64,XGate} => 100010 ControlBlock{10,XGate,1,1} => 100000 julia> nparameters(circuit) 300010 julia> h = heisenberg(n); julia> for i = 1:100 _, grad = expect(h, zero_state(n)=> circuit) dispatch!(-, circuit, 1e-3 * grad) println("Step $i, energy = $(expect( h, zero_state(n)=>circuit))") end ⌃ ⇧
slide-42
SLIDE 42

+ + = https://github.com/QuantumBFS/Yao.jl

  • Reversible AD engine for quantum circuits
  • Batch parallelization with GPU acceleration
  • Quantum block intermediate representation

Features:

Xiu-Zhe Luo (IOP, CAS → Waterloo & PI) Jin-Guo Liu (IOP, CAS → QuEra Computing & Harvard)

Yao.jl: Extensible, Efficient Framework for Quantum Algorithm Design

Luo, Liu, Zhang and LW, 1912.10877

slide-43
SLIDE 43

Thank you!

Ψ

Jin-Guo Liu, QuEra & Harvard Xiu-Zhe Luo Waterloo & PI Hai-Jun Liao IOP CAS Pan Zhang ITP CAS Tao Xiang IOP CAS