Machine Learning from a Continuous Viewpoint Weinan E Princeton - PowerPoint PPT Presentation

Machine Learning from a Continuous Viewpoint Weinan E Princeton University Joint work with: Chao Ma, Lei Wu https://arxiv.org/pdf/1912.12777.pdf August 3, 2020 1 / 31

Examples Standard ML problems for which we are given the dataset: 1 Supervised learning : Given S = { ( x j , y j = f ∗ ( x j )) , j ∈ [ n ] } , learn f ∗ . Objective: Minimize population risk over the “hypothesis space” R ( f ) = E x ∼ µ ( f ( x ) − f ∗ ( x )) 2 2 Dimension reduction : Given S = { x j , j ∈ [ n ] } ⊂ R D sampled from µ , find mapping: Φ : R D → R d , ( d ≪ D ) that best preserves all important features of µ . “Auto-encoder”: Minimize reconstruction error x − ˜ x : ˜ x = Ψ( z ) , z = Φ( x ) R ( f ) = E x ∼ µ ( x − Ψ( z )) 2 = E x ∼ µ ( x − Ψ(Φ( x ))) 2 August 3, 2020 2 / 31

Non-standard ML problem, no dataset given beforehand: 1 Ground state of quantum many-body problem : Let H = − � 2 2 m ∆ + V be the Hamiltonian operator of the system I ( φ ) = ( φ, Hφ ) φ ( x ) Hφ ( x ) 1 ( φ, φ ) φ 2 ( x ) d x ( φ, φ ) = E x ∼ µ φ , µ φ ( d x ) = φ ( x ) 2 subject to the constraint imposed by Pauli exclusion principle. 2 Stochastic control problems : s t +1 = s t + b t ( s t , a t ) + ξ t +1 , s t = state at time t , a t = control at time t , ξ t = i.i.d. noise � T − 1 � L ( { a t } T − 1 � t =0 ) = E { ξ t } c t ( s t , a t ( s t )) + c T ( s T ) , t =0 Look for feedback control: a t = F ( t, s t ) , t = 0 , 1 , · · · , T − 1 . August 3, 2020 3 / 31

Remark: High dimensionality Benchmark: High dimensional integration � I m ( g ) = 1 � I ( g ) = X =[0 , 1] d g ( x ) dµ, g ( x j ) m j Grid-based quadrature rules: I ( g ) − I m ( g ) ∼ C ( g ) m α/d Appearance of 1 /d in the exponent of m : Curse of dimensionality (CoD) ! If we want m − α/d = 0 . 1 , then m = 10 d/α = 10 d , if α = 1 . Monte Carlo: { x j , j ∈ [ m ] } is uniformly distributed in X . � 2 E ( I ( g ) − I m ( g )) 2 = var ( g ) �� g 2 ( x ) d x − , var ( g ) = g ( x ) d x m X X However, var ( g ) can be very large in high dimension. Variance reduction! August 3, 2020 4 / 31

Overall strategy: Formulate a “nice” continuous problem, then discretize to get concrete models/algorithms. For PDEs, “nice” = well-posed. For calculus of variation problems, “nice” = “convex”, lower semi-continuous. For machine learning, “nice” = variational problem has simple landscape. August 3, 2020 5 / 31

How do we represent a function? An illustrative example Traditional approach for Fourier transform: � f m ( x ) = 1 � R d a ( ω ) e i ( ω , x ) d ω , a ( ω j ) e i ( ω j , x ) f ( x ) = m j { ω j } is a fixed grid, e.g. uniform. � f − f m � L 2 ( X ) ≤ C 0 m − α/d � f � H α ( X ) “New” approach: Let π be a probability distribution and � R d a ( ω ) e i ( ω , x ) π ( d ω ) = E ω ∼ π a ( ω ) e i ( ω , x ) f ( x ) = � m Let { ω j } be an i.i.d. sample of π , f m ( x ) = 1 j =1 a ( ω j ) e i ( ω j , x ) , m E | f ( x ) − f m ( x ) | 2 = m − 1 var ( f ) � m f m ( x ) = 1 j =1 a j σ ( ω T j x ) = two-layer neural network with activation function σ ( z ) = e iz . m August 3, 2020 6 / 31

Integral transform-based representation Let σ be a scalar nonlinear function (activation function), e.g. σ = ReLU Consider functions represented in the form: � R d a ( w ) σ ( w T x ) π ( d w ) f ( x ; θ ) = = E w ∼ π a ( w ) σ ( w T x ) = E ( a, w ) ∼ ρ aσ ( w T x ) θ = parameter to be optimized: θ = a ( · ) corresponds to a feature-based model θ = ρ corresponds to a two-layer neural network-like model. θ = ( a ( · ) , π ( · )) , a new model August 3, 2020 7 / 31

Discretize Fourier method: π ∼ 1 � j δ ω j where { ω j } lives on a uniform lattice. Optimize a ( · ) : N f ( x ; θ ) ∼ f m ( x ) = 1 � a ( w j ) σ ( w T j x ) m j Neural network-based method: ρ ∼ 1 � j δ ( a j , ω j ) ( { ω j } is also optimized): N f ( x ; θ ) ∼ f m ( x ) = 1 � a j σ ( w T j x ) m j then optimize (say, using L-BFGS) — this is more in line with traditional numerical analysis (e.g. nonlinear finite element or meshless methods). August 3, 2020 8 / 31

For truly large datasets, we need to use stochastic algorithms objective function are all expressed as expectations: R ( θ ) = E x ∼ µ ( f ( x ; θ ) − f ∗ ( x )) 2 R ( θ 1 , θ 2 ) = E x ∼ µ ( x − Ψ(Φ( x ; θ 1 ); θ 2 )) 2 φ ( x ; θ ) Hφ ( x ; θ ) I ( θ ) = E x ∼ µ θ φ ( x ; θ ) 2 Gradient descent (GD) can be readily converted to stochastic gradient descent (SGD). Let F = F ( θ ) = E x ∼ µ g ( θ, x ) be the objective function: GD : θ k +1 = θ k − η ∇ θ E x ∼ µ g ( θ k , x ) SGD : θ k +1 = θ k − η ∇ θ g ( θ k , x k ) where { x k } are i.i.d. random samples. August 3, 2020 9 / 31

Optimization: Defining gradient flows “Free energy” = R ( f ) = E x ∼ µ ( f ( x ) − f ∗ ( x )) 2 � a ( w ) σ ( w T x ) π ( d w ) = E w ∼ π a ( w ) σ ( w T x ) f ( x ) = Follow Halperin and Hohenberg (1977): a = non-conserved, use “model A” dynamics (Allen-Cahn): ∂a ∂t = − δ R δa π = conserved (probability density), use “model B” (Cahn-Hilliard): ∂π ∂t + ∇ · J = 0 J = π v , v = −∇ V, V = δ R δπ . August 3, 2020 10 / 31

Gradient flow for the feature-based model Fix π , optimize a . ∂ t a ( w , t ) = − δ R � w ) + ˜ δa ( w , t ) = − a ( ˜ w , t ) K ( w , ˜ w ) π ( d ˜ f ( w ) ˜ w ) = E x [ σ ( w T x ) σ ( ˜ w T x )] , f ( w ) = E x [ f ∗ ( x ) σ ( w T x )] K ( w , ˜ This is an integral equation with a symmetric positive definite kernel. Decay estimates due to convexity: Let f ∗ ( x ) = E w ∼ π a ∗ ( w ) σ ( w T x ) , I ( t ) = 1 2 � a ( · , t ) − a ∗ ( · ) � 2 + t ( R ( a ( t )) − R ( a ∗ )) Then we have dI R ( a ( t )) ≤ C 0 dt ≤ 0 , t August 3, 2020 11 / 31

Conservative gradient flow Optimize ρ : f ( x ) = E u ∼ ρ φ ( x , u ) Example: u = ( a, w ) , φ ( x , u ) = aσ ( w T x ) ∂ t ρ = ∇ ( ρ ∇ V ) V ( u ) = δ R � u ) − ˜ δρ ( u ) = E x [( f ( x ) − f ∗ ( x )) φ ( x , u )] = K ( u , ˜ u ) ρ ( d ˜ f ( u ) This is the mean-field equation derived by Chizat and Bach (2018), Mei, Montanari and Nguyen (2018), Rotskoff and Vanden-Eijnden (2018), Sirignano and Spiliopoulos (2018), by studying the continuum limit of two-layer neural networks. Does not satisfy displacement convexity. August 3, 2020 12 / 31

Mixture model Optimize ( a, π ) ( a = non-conservative, π =conservative): ∂ t a ( w , t ) = − δ R � w , t ) + ˜ a ( ˜ w , t ) K ( w , ˜ w ) π ( d ˜ δa ( w , t ) = − f ( w ) V ( w ) = δ R ∂ t π = ∇ ( π ∇ ˜ V ) , δπ ( w ) August 3, 2020 13 / 31

Discretizing the gradient flows Discretizing the population risk (into the empirical risk) using data Discretizing the gradient flow particle method – the dynamic version of Monte Carlo smoothed particle method – analog of vortex blob method spectral method – very effective in low dimensions We will see that gradient descent algorithm (GD) for random feature and neural network models are simply the particle method discretization of the gradient flows discussed before. August 3, 2020 14 / 31

Particle method for the feature-based model ∂ t a ( w , t ) = − δ R � w ) + ˜ δa ( w ) = − a ( ˜ w , t ) K ( w , ˜ w ) π ( d ˜ f ( w ) π ( d w ) ∼ 1 � δ w j , a ( w j , t ) ∼ a j ( t ) m j Discretized version: dta j ( t ) = − 1 d � K ( w j , w k ) a k ( t ) + ˜ f ( w j ) m k This is exactly the GD for the random feature model. f ( x ) ∼ f m ( x ) = 1 � a j σ ( w T j x ) m j August 3, 2020 15 / 31

Particle method for the conservative flow ∂ t ρ = ∇ ( ρ ∇ V ) (1) Particle method discretization: ρ ( t, d u ) ∼ 1 � δ u j ( t ) m j Define the loss function f m ( x ) = 1 � I ( u 1 , · · · , u m ) = R ( f m ) , φ ( x , u j ) m j Lemma: Given a set of initial data { u 0 j , j ∈ [ m ] } . The solution of (1) with initial data � m ρ (0) = 1 j =1 δ u 0 j is given by m m ρ ( t ) = 1 � δ u j ( t ) m j =1 where the particles { u j ( · ) , j ∈ [ m ] } solves: d u j u j (0) = u 0 dt = −∇ u j I ( u 1 , · · · , u m ) , j , j ∈ [ m ] This is exactly the GD dynamics for two-layer neural networks. August 3, 2020 16 / 31

Comparison with conventional NN models Continuous viewpoint (in this case same as mean-field): f m ( x ) = 1 j a j σ ( w T � j x ) m j a j σ ( w T Conventional NN models: f m ( x ) = � j x ) Test errors Test errors 0.6 2.6 2.6 1.2 1.8 2.4 2.4 2.4 log 10 ( n ) log 10 ( n ) 3.0 2.2 2.2 3.6 4.2 2.0 2.0 4.8 5.4 1.8 1.8 6.0 2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5 log 10 ( m ) log 10 ( m ) Figure: (Left) continuous viewpoint; (Right) conventional NN models. Target function is a single neuron f ∗ ( x ) = σ ( e T 1 x ) . August 3, 2020 17 / 31

Flow-based representation Continuous dynamical system viewpoint (E (2017), Haber and Ruthotto (2017), “Neural ODEs” (Chen et al, 2018)) d z dτ = g ( τ, z ) , z (0) = x The flow-map at time 1 : x → z ( x , 1) . Trial functions: f = α T z (1) Will take α = 1 for simplicity. August 3, 2020 18 / 31

Machine Learning from a Continuous Viewpoint Weinan E Princeton - PowerPoint PPT Presentation

Machine Learning from a Continuous Viewpoint Weinan E Princeton University Joint work with: Chao Ma, Lei Wu https://arxiv.org/pdf/1912.12777.pdf August 3, 2020 1 / 31 Examples Standard ML problems for which we are given the dataset: 1

Machine Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net Spring Semester,

CS420 Machine Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net Self

Mathematics for Machine Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

Photographing Long Scenes with Multi-Viewpoint Panoramas Agarwala, M. Agrawala, M. Cohen, D.

Multiple Viewpoint Systems Harmonizations Raymond Whorley, The Construction and Evaluation of

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Introd u ction to spreadsheets STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo

ORTHOGONALIZATION WITH A NON-STANDARD INNER PRODUCT WITH THE APPLICATION TO PRECONDITIONING

DUNE Beyond Standard Model Physics Group Meeting Nonstandard Interactions Subgroup Recent

Continuous Improvement Toolkit Standard Work Continuous Improvement Toolkit . www.citoolkit.com

Scheduling on clusters and grids Gr egory Mouni e, Yves Robert et Denis Trystram ID-IMAG 6

Malaysian Healthy Ageing Society Psychological well being Psychological well being is a

Towards Enabling Internet-Scale Context-as-a-Service: A Position Paper Alexandru SORICI, Andrei

Switched Control and other 'uncontrolled' cases of obligatory control Dorothee Beermann and Lars