Large Graph Limits of Learning Algorithms Andrew M Stuart Computing - - PowerPoint PPT Presentation

large graph limits of learning algorithms andrew m stuart
SMART_READER_LITE
LIVE PREVIEW

Large Graph Limits of Learning Algorithms Andrew M Stuart Computing - - PowerPoint PPT Presentation

Large Graph Limits of Learning Algorithms Andrew M Stuart Computing and Mathematical Sciences, Caltech Andrea Bertozzi, Michael Luo (UCLA) and Kostas Zygalakis (Edinburgh) Matt Dunlop (Caltech), Dejan Slep cev (CMU) and Matt Thorpe


slide-1
SLIDE 1

Large Graph Limits of Learning Algorithms Andrew M Stuart

Computing and Mathematical Sciences, Caltech

Andrea Bertozzi, Michael Luo (UCLA) and Kostas Zygalakis (Edinburgh) ⋆ Matt Dunlop (Caltech), Dejan Slepˇ cev (CMU) and Matt Thorpe (CMU)

1

slide-2
SLIDE 2

References

X Zhu, Z Ghahramani, J Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, ICML,

  • 2003. Harmonic Functions.

C Rasmussen and C Williams, Gaussian processes for machine learning, MIT Press, 2006. Probit. AL Bertozzi and A Flenner, Diffuse interface models on graphs for classification of high dimensional data, SIAM MMS, 2012. Ginzburg-Landau. MA Iglesias, Y Lu, AM Stuart, Bayesian level set method for geometric inverse problems, Interfaces and Free Boundaries, 2016. Level Set. AL Bertozzi, M Luo, AM Stuart and K Zygalakis, Uncertainty quantification in the classification of high dimensional data, https://arxiv.org/abs/1703.08816, 2017. Probit on a graph. N Garcia-Trillos and D Slepˇ cev, A variational approach to the consistency of spectral clustering, ACHA, 2017. M Dunlop, D Slepˇ cev, AM Stuart and M Thorpe, Large data and zero noise limits of graph based semi-supervised learning algorithms, In preparation, 2017. N Garcia-Trillos, D Sanz-Alonso, Continuum Limit of Posteriors in Graph Bayesian Inverse Problems, https://arxiv.org/abs/1706.07193, 2017. 2

slide-3
SLIDE 3

Talk Overview

Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions

3

slide-4
SLIDE 4

Talk Overview

Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions

4

slide-5
SLIDE 5

Regression

Let D ⊂ Rd be a bounded open set. Let D′ ⊂ D.

Ill-Posed Inverse Problem

Find u : D → R given y(x) = u(x), x ∈ D′. Strong prior information needed.

5

slide-6
SLIDE 6

Classification

Let D ⊂ Rd be a bounded open set. Let D′ ⊂ D.

Ill-Posed Inverse Problem

Find u : D → R given y(x) = sign

  • u(x)
  • ,

x ∈ D′. Strong prior information needed.

6

slide-7
SLIDE 7

y = sign(u). Red= 1. Blue= −1. Yellow: no information.

7

slide-8
SLIDE 8

Reconstruction of the function u on D

8

slide-9
SLIDE 9

Talk Overview

Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions

9

slide-10
SLIDE 10

Graph Laplacian

Graph Laplacian: Similarity graph G with n vertices Z = {1, . . . , n}. Weighted adjacency matrix W = {wj,k},

  • wj,k = ηε(xj − xk).
  • Diagonal D = diag{djj}, djj =

k∈Z wj,k.

L = sn(D − W) (unnormalized); L′ = D− 1

2 LD− 1 2 (normalized).

Spectral Properties: L is positive semi-definite: u, LuRn ∝

j∼k wj,k|uj − uk|2.

Lqj = λjqj; Fully connected ⇒ λ1 > λ0 = 0. Fiedler Vector: q1.

10

slide-11
SLIDE 11

Problem Statement (Optimization)

Semi-Supervised Learning

Input:

Unlabelled data

  • xj ∈ Rd,

j ∈ Z := {1, . . . , n}

  • ;

Labelled data

  • yj ∈ {±1},

j ∈ Z′ ⊆ Z

  • .

Output:

Labels

  • yj ∈ {±1},

j ∈ Z

  • .

Classification based on sign(u), u the optimizer of: J(u; y) = 1 2u, C−1uRn + Φ(u; y). u is an R−valued function on the graph nodes. C = (L + τ 2I)−α from unlabelled data: wj,k = ηε(xj − xk).

  • Φ(u; y) links real-valued u to the binary-valued labels y.

11

slide-12
SLIDE 12

Example: Voting Records

U.S. House of Representatives 1984, 16 key votes. For each congress representative we have an associated feature vector xj ∈ R16 such as xj = (1, −1, 0, · · · , 1)T; 1 is “yes”, −1 is “no” and 0 abstain/no-show. Hence d = 16 and n = 435. Figure: Fiedler Vector and Spectrum (Normalized Case)

12

slide-13
SLIDE 13

Probit

Rasmussen and Williams, 2006. (MIT Press) Bertozzi, Luo, Stuart and Zygalakis, 2017. (arXiv)

Probit Model

J(n)

p (u; y) = 1

2u, C−1uRn + Φ(n)

p (u; y).

Here C = (L + τ 2I)−α, Φ(n)

p (u; y) := −

  • j∈Z′

log

  • Ψ(yj uj ; γ)
  • and

Ψ(v; γ) = 1

  • 2πγ2

v

−∞

exp

  • − t2/2γ2

dt.

13

slide-14
SLIDE 14

Level Set

Iglesias, Lu and Stuart, 2016. (IFB)

Level Set Model

J(n)

ls (u; y) = 1

2u, C−1uRn + Φ(n)

ls (u; y).

Here C = (L + τ 2I)−α, and Φ(n)

ls (u; y) :=

1 2γ2

  • j∈Z′
  • yj − sign
  • uj
  • |2.

14

slide-15
SLIDE 15

Talk Overview

Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions

15

slide-16
SLIDE 16

Infimization

Recall that both optimization problems have the form J(n)(u; y) = 1 2u, C−1uRn + Φ(n)(u; y). Indeed: Φ(n)

p (u; y) := −

  • j∈Z′

log

  • Ψ(yj uj ; γ)
  • and

Φ(n)

ls (u; y) :=

1 2γ2

  • j∈Z′
  • yj − sign
  • uj
  • |2.

Theorem 1

Probit: Jp is convex. Level Set: Jls does not attain its infimum.

16

slide-17
SLIDE 17

Limit Theorem for the Dirichlet Energy

Garcia-Trillos and Slepˇ cev, 2016. (ACHA)

Unlabelled data {xj} sampled i.i.d. from density ρ supported on bounded D ⊂ Rd. Let Lu = −1 ρ∇ ·

  • ρ2∇u
  • x ∈ D,

∂u ∂n = 0, x ∈ ∂D.

Theorem 2

Let sn =

2 C(η)nε2 . Then under connectivity conditions on ε = ε(n) in

ηε, the scaled Dirichlet energy Γ− converges in the TL2 metric: 1 nu, LuRn → u, LuL2

ρ

as n → ∞.

17

slide-18
SLIDE 18

Sketch Proof: Quadratic Forms on Graphs

Discrete Dirichlet Energy

u, LuRn ∝

  • j∼k

wj,k|uj − uk|2.

Figure: Connectivity Stencils For Orange Node: PDE, Data, Localized Data.

18

slide-19
SLIDE 19

Sketch Proof: Limits of Quadratic Forms on Graphs

Garcia-Trillos and Slepˇ cev, 2016. (ACHA)

{xj}n

j=1 i.i.d. from density ρ on D ⊂ Rd.

wjk = ηε(xj − xk), ηε = 1

εd η

  • | · |

ε

  • .

Limiting Discrete Dirichlet Energy

u, LuRn ∝ 1 n2ε2

  • j∼k

ηε

  • xj − xk
  • u(xj) − u(xk)
  • 2;

n → ∞ ≈

  • D
  • D

ηε

  • x − y
  • u(x) − u(y)

ε

  • 2

ρ(x)ρ(y)dxdy; ε → 0 ≈ C(η)

  • D

|∇u(x)|2ρ(x)2dx ∝ u, LuL2

ρ. 19

slide-20
SLIDE 20

Limit Theorem for Probit

  • M. Dunlop, D Slepˇ

cev, AM Stuart and M Thorpe, In preparation 2017.

Let D± be two disjoint bounded subsets of D, define D′ = D+ ∪ D− and y(x) = +1, x ∈ D+; y(x) = −1, x ∈ D−. For α > 0, define C = (L + τ 2I)−α. Recall that C = (L + τ 2I)−α.

Theorem 3

Let sn =

2 C(η)nε2 . Then under connectivity conditions on ε = ε(n) the

scaled probit objective function Γ−converges in the TL2 metric: 1 nJ(n)

p (u; y) → Jp(u; y)

as n → ∞, where Jp(u; y) = 1 2

  • u, C−1u
  • L2

ρ + Φp(u; y),

Φp(u; y) := −

  • D′ log
  • Ψ(y(x) u(x) ; γ)
  • ρ(x)dx.

20

slide-21
SLIDE 21

Talk Overview

Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions

21

slide-22
SLIDE 22

Problem Statement (Bayesian Formulation)

Semi-Supervised Learning

Input:

Unlabelled data

  • xj ∈ Rd,

j ∈ Z := {1, . . . , n}

  • ; prior

Labelled data

  • yj) ∈ {±1},

j ∈ Z′ ⊆ Z

  • . likelihood

Output:

Labels

  • yj ∈ {±1},

j ∈ Z

  • . posterior

Connection between probability and optimization: J(n)(u; y) = 1 2u, C−1uRn + Φ(n)(u; y). P(u|y) ∝ exp

  • −J(n)(u; y)
  • ∝ exp
  • −Φ(n)(u; y)
  • × N(0, C)

∝ P(y|u) × P(u).

22

slide-23
SLIDE 23

Example of Underlying Gaussian (Voting Records)

Figure: Two point correlation of sign(u) for 3 democrats

23

slide-24
SLIDE 24

Probit (Continuum Limit)

Let α > d

2.

Probit Probabilistic Model

Prior: Gaussian P(du) = N(0, C). Posterior: Pγ(du|y) ∝ exp

  • −Φp(u; y)
  • P(du).

Φp(u; y) := −

  • D′ log
  • Ψ(y(x) u(x) ; γ)
  • ρ(x)dx.

24

slide-25
SLIDE 25

Level Set (Continuum Limit)

Let α > d

2.

Level Set Probabilistic Model

Prior: Gaussian P(du) = N(0, C). Posterior: Pγ(du|y) ∝ exp

  • −Φls(u; y)
  • P(du).

Φls(u; y) :=

  • D′

1 2γ2

  • y(x) − sign
  • u(x)
  • 2ρ(x)dx.

25

slide-26
SLIDE 26

Connecting Probit, Level Set and Regression

  • M. Dunlop, D Slepˇ

cev, AM Stuart and M Thorpe, In preparation 2017.

Theorem 4

Let α > d

  • 2. We have Pγ(u|y) ⇒ P(u|y) as γ → 0 where

P(du|y) ∝ 1A(u)P(du), P(du) = N(0, C) A = {u : sign

  • u(x)
  • = y(x),

x ∈ D′}. Compare with regression (Zhu, Ghahramani, Lafferty 2003, (ICML):) A0 = {u : u(x) = y(x), x ∈ D′}.

26

slide-27
SLIDE 27

Example (PDE Two Moons – Unlabelled Data)

Figure: Sampling density ρ of unlabelled data.

27

slide-28
SLIDE 28

Example (PDE Two Moons – Label Data)

Figure: Labelled Data.

28

slide-29
SLIDE 29

Example (PDE Two Moons – Fiedler Vector of L)

Figure: Fiedler Vector.

29

slide-30
SLIDE 30

Example (PDE Two Moons – Posterior Labelling)

Figure: Posterior mean of u and sign(u).

30

slide-31
SLIDE 31

Example (One Data Point Makes All The Difference)

Figure: Sampling density, Label Data 1, Label Data 2.

31

slide-32
SLIDE 32

Talk Overview

Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions

32

slide-33
SLIDE 33

Summary

Single optimization framework for classification algorithms. Single Bayesian framework for classification algorithms. Comparison of related optimization problems. Probit and Level Set have same small noise limit. This limit generalizes previous regression-based methods. Fast mixing MCMC algorithms. Fast per MCMC step approximations. Infinite data limit identifies appropriate parameter choices. Infinite data limit to (S)PDEs, conditioned Gaussian measure.

33

slide-34
SLIDE 34

References

X Zhu, Z Ghahramani, J Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, ICML,

  • 2003. Harmonic Functions.

C Rasmussen and C Williams, Gaussian processes for machine learning, MIT Press, 2006. Probit. AL Bertozzi and A Flenner, Diffuse interface models on graphs for classification of high dimensional data, SIAM MMS, 2012. Ginzburg-Landau. MA Iglesias, Y Lu, AM Stuart, Bayesian level set method for geometric inverse problems, Interfaces and Free Boundaries, 2016. Level Set. AL Bertozzi, M Luo, AM Stuart and K Zygalakis, Uncertainty quantification in the classification of high dimensional data, https://arxiv.org/abs/1703.08816, 2017. Probit on a graph. N Garcia-Trillos and D Slepˇ cev, A variational approach to the consistency of spectral clustering, ACHA, 2017. M Dunlop, D Slepˇ cev, AM Stuart and M Thorpe, Large data and zero noise limits of graph based semi-supervised learning algorithms, In preparation, 2017. N Garcia-Trillos, D Sanz-Alonso, Continuum Limit of Posteriors in Graph Bayesian Inverse Problems, https://arxiv.org/abs/1706.07193, 2017. 34

slide-35
SLIDE 35

pCN

α(u, v) = min{1, exp(Φ(u) − Φ(v)}.

The preconditioned Crank-Nicolson (pCN) Method

1: while k < M do 2:

v(k) =

  • 1 − β2u(k) + βξ(k), where ξ(k) ∼ N(0, C).

3:

Accept: u(k+1) = v(k) with probability α(u(k), v(k)), otherwise

4:

Reject: u(k+1) = u(k).

5: end while

Why pCN? For given acceptance probability, β is independent of N = |Z|. Can exploit approximation of graph Laplacian (Nystr¨

  • m) and · · ·

35

slide-36
SLIDE 36

Example of UQ (Two Moons)

Recall that d = 102, N = 2 × 103.

Figure: Average Label Posterior Variance vs σ, feature vector noise.

36

slide-37
SLIDE 37

Example of UQ (MNIST)

Here d = 784 and N = 4000.

Figure: “Low confidence” vs “High confidence” nodes in MNIST49 graph.

37

slide-38
SLIDE 38

Saturation of Spectra in Applications

Karhunen-Loeve – if Lqj = λjqj then u ∼ N(0, C) is: u = c

1 2

N−1

  • j=1

(λj + τ 2)− α

2 qjzj, zj ∼ N(0, 1)

i.i.d. (1) Spectrum of graph Laplacian often saturates as j → N − 1. Spectral Projection ⇐ ⇒ λk := ∞, k ≥ ℓ. Spectral Approximation: set λk to some ¯ λ < ∞.

Figure: Two Moons, Hyperspectral, Voting Records.

38

slide-39
SLIDE 39

Example of UQ (Voting)

Recall that d = 16 and N = 435. Mean Absolute Error: Projection: 0.1577, Approximation: 0.0261.

Figure: Mean Label Posterior. Compare Full (black), Spectral Approximation (red) and Spectral Projection (blue).

39

slide-40
SLIDE 40

Example of UQ (Hyperspectral)

Here d = 129 and N ≈ 3 × 105. Use Nystr¨

  • m .

Figure: Spectral Approximation. Uncertain classification in red.

40