Machine Learning from a Continuous Viewpoint Weinan E Princeton - - PowerPoint PPT Presentation

machine learning from a continuous viewpoint weinan e
SMART_READER_LITE
LIVE PREVIEW

Machine Learning from a Continuous Viewpoint Weinan E Princeton - - PowerPoint PPT Presentation

Machine Learning from a Continuous Viewpoint Weinan E Princeton University Joint work with: Chao Ma, Lei Wu https://arxiv.org/pdf/1912.12777.pdf August 3, 2020 1 / 31 Examples Standard ML problems for which we are given the dataset: 1


slide-1
SLIDE 1

Machine Learning from a Continuous Viewpoint Weinan E Princeton University Joint work with: Chao Ma, Lei Wu https://arxiv.org/pdf/1912.12777.pdf

August 3, 2020 1 / 31

slide-2
SLIDE 2

Examples

Standard ML problems for which we are given the dataset:

1 Supervised learning: Given S = {(xj, yj = f ∗(xj)), j ∈ [n]}, learn f ∗.

Objective: Minimize population risk over the “hypothesis space” R(f) = Ex∼µ(f(x) − f ∗(x))2

2 Dimension reduction: Given S = {xj, j ∈ [n]} ⊂ RD sampled from µ, find

mapping: Φ : RD → Rd, (d ≪ D) that best preserves all important features of µ. “Auto-encoder”: Minimize reconstruction error x − ˜ x: ˜ x = Ψ(z), z = Φ(x) R(f) = Ex∼µ(x − Ψ(z))2 = Ex∼µ(x − Ψ(Φ(x)))2

August 3, 2020 2 / 31

slide-3
SLIDE 3

Non-standard ML problem, no dataset given beforehand:

1 Ground state of quantum many-body problem:

Let H = − 2

2m∆ + V be the Hamiltonian operator of the system

I(φ) = (φ, Hφ) (φ, φ) = Ex∼µφ φ(x)Hφ(x) φ(x)2 , µφ(dx) = 1 (φ, φ)φ2(x)dx subject to the constraint imposed by Pauli exclusion principle.

2 Stochastic control problems:

st+1 = st + bt(st, at) + ξt+1, st = state at time t, at = control at time t, ξt= i.i.d. noise L({at}T−1

t=0 ) = E{ξt}

T−1

  • t=0

ct(st, at(st)) + cT(sT)

  • ,

Look for feedback control: at = F(t, st), t = 0, 1, · · · , T − 1.

August 3, 2020 3 / 31

slide-4
SLIDE 4

Remark: High dimensionality

Benchmark: High dimensional integration I(g) =

  • X=[0,1]d g(x)dµ,

Im(g) = 1 m

  • j

g(xj) Grid-based quadrature rules: I(g) − Im(g) ∼ C(g) mα/d Appearance of 1/d in the exponent of m: Curse of dimensionality (CoD)! If we want m−α/d = 0.1, then m = 10d/α = 10d, if α = 1. Monte Carlo: {xj, j ∈ [m]} is uniformly distributed in X. E(I(g) − Im(g))2 = var(g) m , var(g) =

  • X

g2(x)dx −

  • X

g(x)dx 2 However, var(g) can be very large in high dimension. Variance reduction!

August 3, 2020 4 / 31

slide-5
SLIDE 5

Overall strategy:

Formulate a “nice” continuous problem, then discretize to get concrete models/algorithms.

For PDEs, “nice” = well-posed. For calculus of variation problems, “nice” = “convex”, lower semi-continuous. For machine learning, “nice” = variational problem has simple landscape.

August 3, 2020 5 / 31

slide-6
SLIDE 6

How do we represent a function? An illustrative example

Traditional approach for Fourier transform: f(x) =

  • Rd a(ω)ei(ω,x)dω,

fm(x) = 1 m

  • j

a(ωj)ei(ωj,x) {ωj} is a fixed grid, e.g. uniform. f − fmL2(X) ≤ C0m−α/dfHα(X) “New” approach: Let π be a probability distribution and f(x) =

  • Rd a(ω)ei(ω,x)π(dω) = Eω∼πa(ω)ei(ω,x)

Let {ωj} be an i.i.d. sample of π, fm(x) = 1

m

m

j=1 a(ωj)ei(ωj,x),

E|f(x) − fm(x)|2 = m−1var(f) fm(x) = 1

m

m

j=1 ajσ(ωT j x) = two-layer neural network with activation function σ(z) = eiz.

August 3, 2020 6 / 31

slide-7
SLIDE 7

Integral transform-based representation

Let σ be a scalar nonlinear function (activation function), e.g. σ = ReLU Consider functions represented in the form: f(x; θ) =

  • Rd a(w)σ(wTx)π(dw)

=Ew∼πa(w)σ(wTx) =E(a,w)∼ρaσ(wTx) θ = parameter to be optimized: θ = a(·) corresponds to a feature-based model θ = ρ corresponds to a two-layer neural network-like model. θ = (a(·), π(·)), a new model

August 3, 2020 7 / 31

slide-8
SLIDE 8

Discretize

Fourier method: π ∼ 1

N

  • j δωj where {ωj} lives on a uniform lattice. Optimize a(·):

f(x; θ) ∼ fm(x) = 1 m

  • j

a(wj)σ(wT

j x)

Neural network-based method: ρ ∼ 1

N

  • j δ(aj,ωj) ({ωj} is also optimized):

f(x; θ) ∼ fm(x) = 1 m

  • j

ajσ(wT

j x)

then optimize (say, using L-BFGS) — this is more in line with traditional numerical analysis (e.g. nonlinear finite element or meshless methods).

August 3, 2020 8 / 31

slide-9
SLIDE 9

For truly large datasets, we need to use stochastic algorithms

  • bjective function are all expressed as expectations:

R(θ) = Ex∼µ(f(x; θ) − f ∗(x))2 R(θ1, θ2) = Ex∼µ(x − Ψ(Φ(x; θ1); θ2))2 I(θ) = Ex∼µθ φ(x; θ)Hφ(x; θ) φ(x; θ)2 Gradient descent (GD) can be readily converted to stochastic gradient descent (SGD). Let F = F(θ) = Ex∼µg(θ, x) be the objective function: GD : θk+1 = θk − η∇θEx∼µg(θk, x) SGD : θk+1 = θk − η∇θg(θk, xk) where {xk} are i.i.d. random samples.

August 3, 2020 9 / 31

slide-10
SLIDE 10

Optimization: Defining gradient flows

“Free energy” = R(f) = Ex∼µ(f(x) − f ∗(x))2 f(x) =

  • a(w)σ(wTx)π(dw) = Ew∼πa(w)σ(wTx)

Follow Halperin and Hohenberg (1977): a = non-conserved, use “model A” dynamics (Allen-Cahn): ∂a ∂t = −δR δa π = conserved (probability density), use “model B” (Cahn-Hilliard): ∂π ∂t + ∇ · J = 0 J = πv, v = −∇V, V = δR δπ .

August 3, 2020 10 / 31

slide-11
SLIDE 11

Gradient flow for the feature-based model

Fix π, optimize a. ∂ta(w, t) = −δR δa (w, t) = −

  • a( ˜

w, t)K(w, ˜ w)π(d ˜ w) + ˜ f(w) K(w, ˜ w) = Ex[σ(wTx)σ( ˜ wTx)], ˜ f(w) = Ex[f ∗(x)σ(wTx)] This is an integral equation with a symmetric positive definite kernel. Decay estimates due to convexity: Let f ∗(x) = Ew∼πa∗(w)σ(wTx), I(t) = 1 2a(·, t) − a∗(·)2 + t(R(a(t)) − R(a∗)) Then we have dI dt ≤ 0, R(a(t)) ≤ C0 t

August 3, 2020 11 / 31

slide-12
SLIDE 12

Conservative gradient flow

Optimize ρ: f(x) = Eu∼ρφ(x, u) Example: u = (a, w), φ(x, u) = aσ(wTx) ∂tρ = ∇(ρ∇V ) V (u) = δR δρ (u) = Ex[(f(x) − f ∗(x))φ(x, u)] =

  • K(u, ˜

u)ρ(d˜ u) − ˜ f(u) This is the mean-field equation derived by Chizat and Bach (2018), Mei, Montanari and Nguyen (2018), Rotskoff and Vanden-Eijnden (2018), Sirignano and Spiliopoulos (2018), by studying the continuum limit of two-layer neural networks. Does not satisfy displacement convexity.

August 3, 2020 12 / 31

slide-13
SLIDE 13

Mixture model

Optimize (a, π) (a = non-conservative, π =conservative): ∂ta(w, t) = −δR δa (w, t) = −

  • a( ˜

w, t)K(w, ˜ w)π(d ˜ w, t) + ˜ f(w) ∂tπ = ∇(π∇ ˜ V ), V (w) = δR δπ (w)

August 3, 2020 13 / 31

slide-14
SLIDE 14

Discretizing the gradient flows

Discretizing the population risk (into the empirical risk) using data Discretizing the gradient flow

particle method – the dynamic version of Monte Carlo smoothed particle method – analog of vortex blob method spectral method – very effective in low dimensions

We will see that gradient descent algorithm (GD) for random feature and neural network models are simply the particle method discretization of the gradient flows discussed before.

August 3, 2020 14 / 31

slide-15
SLIDE 15

Particle method for the feature-based model

∂ta(w, t) = −δR δa (w) = −

  • a( ˜

w, t)K(w, ˜ w)π(d ˜ w) + ˜ f(w) π(dw) ∼ 1 m

  • j

δwj, a(wj, t) ∼ aj(t) Discretized version: d dtaj(t) = − 1 m

  • k

K(wj, wk)ak(t) + ˜ f(wj) This is exactly the GD for the random feature model. f(x) ∼ fm(x) = 1 m

  • j

ajσ(wT

j x)

August 3, 2020 15 / 31

slide-16
SLIDE 16

Particle method for the conservative flow

∂tρ = ∇(ρ∇V ) (1) Particle method discretization: ρ(t, du) ∼ 1 m

  • j

δuj(t) Define the loss function I(u1, · · · , um) = R(fm), fm(x) = 1 m

  • j

φ(x, uj) Lemma: Given a set of initial data {u0

j, j ∈ [m]}. The solution of (1) with initial data

ρ(0) = 1

m

m

j=1 δu0

j is given by

ρ(t) = 1 m

m

  • j=1

δuj(t) where the particles {uj(·), j ∈ [m]} solves: duj dt = −∇ujI(u1, · · · , um), uj(0) = u0

j,

j ∈ [m] This is exactly the GD dynamics for two-layer neural networks.

August 3, 2020 16 / 31

slide-17
SLIDE 17

Comparison with conventional NN models

Continuous viewpoint (in this case same as mean-field): fm(x) = 1

m

  • j ajσ(wT

j x)

Conventional NN models: fm(x) =

j ajσ(wT j x)

2.0 2.5 3.0 3.5 4.0 4.5

log10(m)

1.8 2.0 2.2 2.4 2.6

log10(n)

Test errors

2.0 2.5 3.0 3.5 4.0 4.5

log10(m)

1.8 2.0 2.2 2.4 2.6

log10(n)

Test errors

6.0 5.4 4.8 4.2 3.6 3.0 2.4 1.8 1.2 0.6

Figure: (Left) continuous viewpoint; (Right) conventional NN models. Target function is a single neuron f ∗(x) = σ(eT

1 x).

August 3, 2020 17 / 31

slide-18
SLIDE 18

Flow-based representation

Continuous dynamical system viewpoint (E (2017), Haber and Ruthotto (2017), “Neural ODEs” (Chen et al, 2018)) dz dτ = g(τ, z), z(0) = x The flow-map at time 1: x → z(x, 1). Trial functions: f = αTz(1) Will take α = 1 for simplicity.

August 3, 2020 18 / 31

slide-19
SLIDE 19

How do we choose the form of g ?

The correct form of g is given by (E, Ma and Wu, 2019): g(τ, z) = Ew∼πτa(w, τ)σ(wTz) where {πτ} is a family of probability distributions. dz dτ =Ew∼πτa(w, τ)σ(wTz) =E(a,w)∼ρτaσ(wTz) Discretize: We obtain the “residual neural network” model: zl+1 = zl + 1 LM

M

  • j=1

aj,lσ(zT

l wj,l), l = 1, 2, · · · , L − 1,

z0 = V ˜ x fL(x) = 1TzL

August 3, 2020 19 / 31

slide-20
SLIDE 20

Compositional law of large numbers

Consider the following compositional scheme: z0,L(x) = x, zl+1,L(x) = zl,L(x) + 1 LM

M

  • k=1

al,kσ(wT

l,kzl,L(x)),

(al,k, wl,k) are pairs of vectors i.i.d. sampled from a distribution ρ. Theorem (E, Ma and Wu 2019) Assume that Eρ|a||wT|2

F < ∞

where for a matrix or vector A, |A| means taking element-wise absolute value for A. Define z by z(x, 0) = x, d dτ z(x, t) = E(a,w)∼ρaσ(wTz(x, τ)). Then we have zL,L(x) → z(x, 1) almost surely as L → +∞.

August 3, 2020 20 / 31

slide-21
SLIDE 21

The optimal control problem

In a slightly more general form dz dτ = Eu∼ρτφ(z, u), z(x, 0) = x z =state, ρτ = control at time τ. The objective : Minimize R over {ρτ} R({ρτ}) = Ex∼µ(f(x) − f ∗(x))2 =

  • Rd(f(x) − f ∗(x))2dµ

where f(x) = 1Tz(x, 1)

August 3, 2020 21 / 31

slide-22
SLIDE 22

Pontryagin’s maximum principle

Define the Hamiltonian H : Rd × Rd × P2(Ω) :→ R as H(z, p, µ) = Eu∼µ[pTφ(z, u)]. The solutions of the control problem must satisfy: ρτ = argmaxρ Ex[H

  • zt,x

τ , pt,x τ , ρ

  • ], ∀τ ∈ [0, 1],

and for each x, (zt,x

τ , pt,x τ ) are defined by the forward/backward equations:

dzt,x

τ

dτ = ∇pH = Eu∼ρτ(·;t)[φ(zt,x

τ , u)]

dpt,x

τ

dτ = −∇zH = Eu∼ρτ(·;t)[∇T

zφ(zt,x τ , u)pt,x τ ].

f(x) = 1Tz(x, 1) with the boundary conditions: zt,x = x pt,x

1

= 2(f(x; ρ(·; t)) − f ∗(x))1.

August 3, 2020 22 / 31

slide-23
SLIDE 23

Gradient flow for flow-based models

Define the Hamiltonian H : Rd × Rd × P2(Ω) :→ R as H(z, p, µ) = Eu∼µ[pTφ(z, u)]. The gradient flow for {ρτ} is given by ∂tρτ(u, t) = ∇ · (ρτ(u, t)∇V (u; ρ)) , ∀τ ∈ [0, 1], where V (u; ρ) = Ex[δH δρ

  • zt,x

τ , pt,x τ , ρτ(·; t)

  • ],

and for each x, (zt,x

τ , pt,x τ ) are defined by the forward/backward equations:

dzt,x

τ

dτ = ∇pH = Eu∼ρτ(·;t)[φ(zt,x

τ , u)]

dpt,x

τ

dτ = −∇zH = Eu∼ρτ(·;t)[∇T

zφ(zt,x τ , u)pt,x τ ].

with the boundary conditions: zt,x = x pt,x

1

= 2(f(x; ρ(·; t)) − f ∗(x))1.

August 3, 2020 23 / 31

slide-24
SLIDE 24

Discretize the gradient flow

forward Euler for the flow in τ variable, step size 1/L. particle method for the GD dynamics, M samples in each layer zt,x

l+1 = zt,x l

+ 1 LM

M

  • j=1

φ(zt,x

l , uj l(t)),

l = 0, . . . , L − 1 pt,x

l

= pt,x

l+1 +

1 LM

M

  • j=1

∇zφ(zt,x

l+1, uj l+1(t))pt,x l+1,

l = 0, . . . , L − 1 duj

l(t)

dt = −Ex[∇T

wφ(zt,x l , uj l(t))pt,x l ].

This recovers the gradient descent algorithm (with back-propagation) for the ResNet: zl+1 = zl + 1 LM

M

  • j=1

φ(zl, ul).

August 3, 2020 24 / 31

slide-25
SLIDE 25

Max principle-based training method

Qianxiao Li, Long Chen, Cheng Tai and Weinan E (2017): Basic “method of successive approximation” (MSA): Initialize: θ0 ∈ U For k = 0, 1, 2, · · · : Solve dzk

τ

dτ = ∇pH(zk

τ, pk τ, θk τ),

zk

0 = V x

Solve dpk

τ

dτ = −∇zH(zk

τ, pk τ, θk τ),

pk

1 = 2(f(x; θk) − f ∗(x))1

Set θk+1

τ

= argmax θ∈ΘH(zk

τ, pk τ, θ), for each τ ∈ [0, 1]

Extended MSA: ˜ H(z, p, θ, v, q) := H(z, p, θ) − 1 2ρv − f(z, θ)2 − 1 2ρq + ∇zH(z, p, θ)2.

August 3, 2020 25 / 31

slide-26
SLIDE 26

August 3, 2020 26 / 31

slide-27
SLIDE 27

Comparison between GD and maximum principle

Maximum principle: ρτ = argmaxρ Ex[H

  • zt,x

τ , pt,x τ , ρ

  • ], ∀τ ∈ [0, 1],

GD: ∂tρτ(u, t) = ∇ ·

  • ρτ(u, t)∇Ex[δH

δρ

  • zt,x

τ , pt,x τ , ρτ(u; t)

  • ]
  • , ∀τ ∈ [0, 1],

Hybrid: Introducing a different time scale for optimization step: One time forward/backward propagation every k steps of optimization. k = 1, usual GD or SGD k = ∞, maximum principle

August 3, 2020 27 / 31

slide-28
SLIDE 28

“Mean-field” vs. “continuous”

They sometimes give rise to the same models (e.g. two-layer NNs). They are DIFFERENT viewpoints. mean field: discrete → continuous by taking the limit (more like interacting particles in stat phys) continuous formulation: continuous → discrete by discretization (more like the usual numerical analysis situation) “Continuous” formulation tries to formulate the “first principles” of ML. It allows us to think “outside the box” about ML. alternative discretization (spectral) alternative model

August 3, 2020 28 / 31

slide-29
SLIDE 29

Flow-based random feature model

dz dτ = Ew∼πτa(w, τ)σ(wTz) Random feature model: Fix {πτ}, optimize over {a(·, τ)}.

August 3, 2020 29 / 31

slide-30
SLIDE 30

Summary

Continuous formulation tries to formulate the “first principles” of ML. It gives rise to new variational and PDE-like problems. The performance of these models/algorithms are more stable to the choice of hyper-parameters (no “phase transition”). It gives rise to new models and new algorithms, more suited to the specific application. It allows us to think “outside the box” about ML.

August 3, 2020 30 / 31

slide-31
SLIDE 31

The crucial point is the representation of functions as some form of expectations. integral transform-based flow-based Other representations? In a way, neural networks are very natural.

August 3, 2020 31 / 31