Machine Learning from a Continuous Viewpoint Weinan E Princeton University Joint work with: Chao Ma, Lei Wu https://arxiv.org/pdf/1912.12777.pdf
August 3, 2020 1 / 31
Machine Learning from a Continuous Viewpoint Weinan E Princeton - - PowerPoint PPT Presentation
Machine Learning from a Continuous Viewpoint Weinan E Princeton University Joint work with: Chao Ma, Lei Wu https://arxiv.org/pdf/1912.12777.pdf August 3, 2020 1 / 31 Examples Standard ML problems for which we are given the dataset: 1
Machine Learning from a Continuous Viewpoint Weinan E Princeton University Joint work with: Chao Ma, Lei Wu https://arxiv.org/pdf/1912.12777.pdf
August 3, 2020 1 / 31
Standard ML problems for which we are given the dataset:
1 Supervised learning: Given S = {(xj, yj = f ∗(xj)), j ∈ [n]}, learn f ∗.
Objective: Minimize population risk over the “hypothesis space” R(f) = Ex∼µ(f(x) − f ∗(x))2
2 Dimension reduction: Given S = {xj, j ∈ [n]} ⊂ RD sampled from µ, find
mapping: Φ : RD → Rd, (d ≪ D) that best preserves all important features of µ. “Auto-encoder”: Minimize reconstruction error x − ˜ x: ˜ x = Ψ(z), z = Φ(x) R(f) = Ex∼µ(x − Ψ(z))2 = Ex∼µ(x − Ψ(Φ(x)))2
August 3, 2020 2 / 31
Non-standard ML problem, no dataset given beforehand:
1 Ground state of quantum many-body problem:
Let H = − 2
2m∆ + V be the Hamiltonian operator of the system
I(φ) = (φ, Hφ) (φ, φ) = Ex∼µφ φ(x)Hφ(x) φ(x)2 , µφ(dx) = 1 (φ, φ)φ2(x)dx subject to the constraint imposed by Pauli exclusion principle.
2 Stochastic control problems:
st+1 = st + bt(st, at) + ξt+1, st = state at time t, at = control at time t, ξt= i.i.d. noise L({at}T−1
t=0 ) = E{ξt}
T−1
ct(st, at(st)) + cT(sT)
Look for feedback control: at = F(t, st), t = 0, 1, · · · , T − 1.
August 3, 2020 3 / 31
Benchmark: High dimensional integration I(g) =
Im(g) = 1 m
g(xj) Grid-based quadrature rules: I(g) − Im(g) ∼ C(g) mα/d Appearance of 1/d in the exponent of m: Curse of dimensionality (CoD)! If we want m−α/d = 0.1, then m = 10d/α = 10d, if α = 1. Monte Carlo: {xj, j ∈ [m]} is uniformly distributed in X. E(I(g) − Im(g))2 = var(g) m , var(g) =
g2(x)dx −
g(x)dx 2 However, var(g) can be very large in high dimension. Variance reduction!
August 3, 2020 4 / 31
For PDEs, “nice” = well-posed. For calculus of variation problems, “nice” = “convex”, lower semi-continuous. For machine learning, “nice” = variational problem has simple landscape.
August 3, 2020 5 / 31
Traditional approach for Fourier transform: f(x) =
fm(x) = 1 m
a(ωj)ei(ωj,x) {ωj} is a fixed grid, e.g. uniform. f − fmL2(X) ≤ C0m−α/dfHα(X) “New” approach: Let π be a probability distribution and f(x) =
Let {ωj} be an i.i.d. sample of π, fm(x) = 1
m
m
j=1 a(ωj)ei(ωj,x),
E|f(x) − fm(x)|2 = m−1var(f) fm(x) = 1
m
m
j=1 ajσ(ωT j x) = two-layer neural network with activation function σ(z) = eiz.
August 3, 2020 6 / 31
Let σ be a scalar nonlinear function (activation function), e.g. σ = ReLU Consider functions represented in the form: f(x; θ) =
=Ew∼πa(w)σ(wTx) =E(a,w)∼ρaσ(wTx) θ = parameter to be optimized: θ = a(·) corresponds to a feature-based model θ = ρ corresponds to a two-layer neural network-like model. θ = (a(·), π(·)), a new model
August 3, 2020 7 / 31
Fourier method: π ∼ 1
N
f(x; θ) ∼ fm(x) = 1 m
a(wj)σ(wT
j x)
Neural network-based method: ρ ∼ 1
N
f(x; θ) ∼ fm(x) = 1 m
ajσ(wT
j x)
then optimize (say, using L-BFGS) — this is more in line with traditional numerical analysis (e.g. nonlinear finite element or meshless methods).
August 3, 2020 8 / 31
R(θ) = Ex∼µ(f(x; θ) − f ∗(x))2 R(θ1, θ2) = Ex∼µ(x − Ψ(Φ(x; θ1); θ2))2 I(θ) = Ex∼µθ φ(x; θ)Hφ(x; θ) φ(x; θ)2 Gradient descent (GD) can be readily converted to stochastic gradient descent (SGD). Let F = F(θ) = Ex∼µg(θ, x) be the objective function: GD : θk+1 = θk − η∇θEx∼µg(θk, x) SGD : θk+1 = θk − η∇θg(θk, xk) where {xk} are i.i.d. random samples.
August 3, 2020 9 / 31
“Free energy” = R(f) = Ex∼µ(f(x) − f ∗(x))2 f(x) =
Follow Halperin and Hohenberg (1977): a = non-conserved, use “model A” dynamics (Allen-Cahn): ∂a ∂t = −δR δa π = conserved (probability density), use “model B” (Cahn-Hilliard): ∂π ∂t + ∇ · J = 0 J = πv, v = −∇V, V = δR δπ .
August 3, 2020 10 / 31
Fix π, optimize a. ∂ta(w, t) = −δR δa (w, t) = −
w, t)K(w, ˜ w)π(d ˜ w) + ˜ f(w) K(w, ˜ w) = Ex[σ(wTx)σ( ˜ wTx)], ˜ f(w) = Ex[f ∗(x)σ(wTx)] This is an integral equation with a symmetric positive definite kernel. Decay estimates due to convexity: Let f ∗(x) = Ew∼πa∗(w)σ(wTx), I(t) = 1 2a(·, t) − a∗(·)2 + t(R(a(t)) − R(a∗)) Then we have dI dt ≤ 0, R(a(t)) ≤ C0 t
August 3, 2020 11 / 31
Optimize ρ: f(x) = Eu∼ρφ(x, u) Example: u = (a, w), φ(x, u) = aσ(wTx) ∂tρ = ∇(ρ∇V ) V (u) = δR δρ (u) = Ex[(f(x) − f ∗(x))φ(x, u)] =
u)ρ(d˜ u) − ˜ f(u) This is the mean-field equation derived by Chizat and Bach (2018), Mei, Montanari and Nguyen (2018), Rotskoff and Vanden-Eijnden (2018), Sirignano and Spiliopoulos (2018), by studying the continuum limit of two-layer neural networks. Does not satisfy displacement convexity.
August 3, 2020 12 / 31
Optimize (a, π) (a = non-conservative, π =conservative): ∂ta(w, t) = −δR δa (w, t) = −
w, t)K(w, ˜ w)π(d ˜ w, t) + ˜ f(w) ∂tπ = ∇(π∇ ˜ V ), V (w) = δR δπ (w)
August 3, 2020 13 / 31
Discretizing the population risk (into the empirical risk) using data Discretizing the gradient flow
particle method – the dynamic version of Monte Carlo smoothed particle method – analog of vortex blob method spectral method – very effective in low dimensions
We will see that gradient descent algorithm (GD) for random feature and neural network models are simply the particle method discretization of the gradient flows discussed before.
August 3, 2020 14 / 31
∂ta(w, t) = −δR δa (w) = −
w, t)K(w, ˜ w)π(d ˜ w) + ˜ f(w) π(dw) ∼ 1 m
δwj, a(wj, t) ∼ aj(t) Discretized version: d dtaj(t) = − 1 m
K(wj, wk)ak(t) + ˜ f(wj) This is exactly the GD for the random feature model. f(x) ∼ fm(x) = 1 m
ajσ(wT
j x)
August 3, 2020 15 / 31
∂tρ = ∇(ρ∇V ) (1) Particle method discretization: ρ(t, du) ∼ 1 m
δuj(t) Define the loss function I(u1, · · · , um) = R(fm), fm(x) = 1 m
φ(x, uj) Lemma: Given a set of initial data {u0
j, j ∈ [m]}. The solution of (1) with initial data
ρ(0) = 1
m
m
j=1 δu0
j is given by
ρ(t) = 1 m
m
δuj(t) where the particles {uj(·), j ∈ [m]} solves: duj dt = −∇ujI(u1, · · · , um), uj(0) = u0
j,
j ∈ [m] This is exactly the GD dynamics for two-layer neural networks.
August 3, 2020 16 / 31
Continuous viewpoint (in this case same as mean-field): fm(x) = 1
m
j x)
Conventional NN models: fm(x) =
j ajσ(wT j x)
2.0 2.5 3.0 3.5 4.0 4.5
log10(m)
1.8 2.0 2.2 2.4 2.6
log10(n)
Test errors
2.0 2.5 3.0 3.5 4.0 4.5
log10(m)
1.8 2.0 2.2 2.4 2.6
log10(n)
Test errors
6.0 5.4 4.8 4.2 3.6 3.0 2.4 1.8 1.2 0.6
Figure: (Left) continuous viewpoint; (Right) conventional NN models. Target function is a single neuron f ∗(x) = σ(eT
1 x).
August 3, 2020 17 / 31
Continuous dynamical system viewpoint (E (2017), Haber and Ruthotto (2017), “Neural ODEs” (Chen et al, 2018)) dz dτ = g(τ, z), z(0) = x The flow-map at time 1: x → z(x, 1). Trial functions: f = αTz(1) Will take α = 1 for simplicity.
August 3, 2020 18 / 31
The correct form of g is given by (E, Ma and Wu, 2019): g(τ, z) = Ew∼πτa(w, τ)σ(wTz) where {πτ} is a family of probability distributions. dz dτ =Ew∼πτa(w, τ)σ(wTz) =E(a,w)∼ρτaσ(wTz) Discretize: We obtain the “residual neural network” model: zl+1 = zl + 1 LM
M
aj,lσ(zT
l wj,l), l = 1, 2, · · · , L − 1,
z0 = V ˜ x fL(x) = 1TzL
August 3, 2020 19 / 31
Consider the following compositional scheme: z0,L(x) = x, zl+1,L(x) = zl,L(x) + 1 LM
M
al,kσ(wT
l,kzl,L(x)),
(al,k, wl,k) are pairs of vectors i.i.d. sampled from a distribution ρ. Theorem (E, Ma and Wu 2019) Assume that Eρ|a||wT|2
F < ∞
where for a matrix or vector A, |A| means taking element-wise absolute value for A. Define z by z(x, 0) = x, d dτ z(x, t) = E(a,w)∼ρaσ(wTz(x, τ)). Then we have zL,L(x) → z(x, 1) almost surely as L → +∞.
August 3, 2020 20 / 31
In a slightly more general form dz dτ = Eu∼ρτφ(z, u), z(x, 0) = x z =state, ρτ = control at time τ. The objective : Minimize R over {ρτ} R({ρτ}) = Ex∼µ(f(x) − f ∗(x))2 =
where f(x) = 1Tz(x, 1)
August 3, 2020 21 / 31
Define the Hamiltonian H : Rd × Rd × P2(Ω) :→ R as H(z, p, µ) = Eu∼µ[pTφ(z, u)]. The solutions of the control problem must satisfy: ρτ = argmaxρ Ex[H
τ , pt,x τ , ρ
and for each x, (zt,x
τ , pt,x τ ) are defined by the forward/backward equations:
dzt,x
τ
dτ = ∇pH = Eu∼ρτ(·;t)[φ(zt,x
τ , u)]
dpt,x
τ
dτ = −∇zH = Eu∼ρτ(·;t)[∇T
zφ(zt,x τ , u)pt,x τ ].
f(x) = 1Tz(x, 1) with the boundary conditions: zt,x = x pt,x
1
= 2(f(x; ρ(·; t)) − f ∗(x))1.
August 3, 2020 22 / 31
Define the Hamiltonian H : Rd × Rd × P2(Ω) :→ R as H(z, p, µ) = Eu∼µ[pTφ(z, u)]. The gradient flow for {ρτ} is given by ∂tρτ(u, t) = ∇ · (ρτ(u, t)∇V (u; ρ)) , ∀τ ∈ [0, 1], where V (u; ρ) = Ex[δH δρ
τ , pt,x τ , ρτ(·; t)
and for each x, (zt,x
τ , pt,x τ ) are defined by the forward/backward equations:
dzt,x
τ
dτ = ∇pH = Eu∼ρτ(·;t)[φ(zt,x
τ , u)]
dpt,x
τ
dτ = −∇zH = Eu∼ρτ(·;t)[∇T
zφ(zt,x τ , u)pt,x τ ].
with the boundary conditions: zt,x = x pt,x
1
= 2(f(x; ρ(·; t)) − f ∗(x))1.
August 3, 2020 23 / 31
forward Euler for the flow in τ variable, step size 1/L. particle method for the GD dynamics, M samples in each layer zt,x
l+1 = zt,x l
+ 1 LM
M
φ(zt,x
l , uj l(t)),
l = 0, . . . , L − 1 pt,x
l
= pt,x
l+1 +
1 LM
M
∇zφ(zt,x
l+1, uj l+1(t))pt,x l+1,
l = 0, . . . , L − 1 duj
l(t)
dt = −Ex[∇T
wφ(zt,x l , uj l(t))pt,x l ].
This recovers the gradient descent algorithm (with back-propagation) for the ResNet: zl+1 = zl + 1 LM
M
φ(zl, ul).
August 3, 2020 24 / 31
Qianxiao Li, Long Chen, Cheng Tai and Weinan E (2017): Basic “method of successive approximation” (MSA): Initialize: θ0 ∈ U For k = 0, 1, 2, · · · : Solve dzk
τ
dτ = ∇pH(zk
τ, pk τ, θk τ),
zk
0 = V x
Solve dpk
τ
dτ = −∇zH(zk
τ, pk τ, θk τ),
pk
1 = 2(f(x; θk) − f ∗(x))1
Set θk+1
τ
= argmax θ∈ΘH(zk
τ, pk τ, θ), for each τ ∈ [0, 1]
Extended MSA: ˜ H(z, p, θ, v, q) := H(z, p, θ) − 1 2ρv − f(z, θ)2 − 1 2ρq + ∇zH(z, p, θ)2.
August 3, 2020 25 / 31
August 3, 2020 26 / 31
Maximum principle: ρτ = argmaxρ Ex[H
τ , pt,x τ , ρ
GD: ∂tρτ(u, t) = ∇ ·
δρ
τ , pt,x τ , ρτ(u; t)
Hybrid: Introducing a different time scale for optimization step: One time forward/backward propagation every k steps of optimization. k = 1, usual GD or SGD k = ∞, maximum principle
August 3, 2020 27 / 31
They sometimes give rise to the same models (e.g. two-layer NNs). They are DIFFERENT viewpoints. mean field: discrete → continuous by taking the limit (more like interacting particles in stat phys) continuous formulation: continuous → discrete by discretization (more like the usual numerical analysis situation) “Continuous” formulation tries to formulate the “first principles” of ML. It allows us to think “outside the box” about ML. alternative discretization (spectral) alternative model
August 3, 2020 28 / 31
dz dτ = Ew∼πτa(w, τ)σ(wTz) Random feature model: Fix {πτ}, optimize over {a(·, τ)}.
August 3, 2020 29 / 31
Continuous formulation tries to formulate the “first principles” of ML. It gives rise to new variational and PDE-like problems. The performance of these models/algorithms are more stable to the choice of hyper-parameters (no “phase transition”). It gives rise to new models and new algorithms, more suited to the specific application. It allows us to think “outside the box” about ML.
August 3, 2020 30 / 31
August 3, 2020 31 / 31