Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw - - PowerPoint PPT Presentation

large scale machine learning
SMART_READER_LITE
LIVE PREVIEW

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw - - PowerPoint PPT Presentation

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 1 / 67 Outline When ML Meets Big


slide-1
SLIDE 1

Large-Scale Machine Learning

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 1 / 67

slide-2
SLIDE 2

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 2 / 67

slide-3
SLIDE 3

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 3 / 67

slide-4
SLIDE 4

The Big Data Era

Today, more and more of our activities are recorded by ubiquitous computing devices

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 4 / 67

slide-5
SLIDE 5

The Big Data Era

Today, more and more of our activities are recorded by ubiquitous computing devices Networked computers make it easy to centralize these records and curate them into a big dataset

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 4 / 67

slide-6
SLIDE 6

The Big Data Era

Today, more and more of our activities are recorded by ubiquitous computing devices Networked computers make it easy to centralize these records and curate them into a big dataset Large-scale machine learning techniques solve problems by leveraging the posteriori knowledge learned from the big data

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 4 / 67

slide-7
SLIDE 7

Characteristics of Big Data

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 5 / 67

slide-8
SLIDE 8

Challenges of Large-Scale ML

Variety and veracity

Feature engineering gets even harder

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 67

slide-9
SLIDE 9

Challenges of Large-Scale ML

Variety and veracity

Feature engineering gets even harder Multi-task/transfer learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 67

slide-10
SLIDE 10

Challenges of Large-Scale ML

Variety and veracity

Feature engineering gets even harder Multi-task/transfer learning

Volume

Large D: curse of dimensionality Large N: training efficiency

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 67

slide-11
SLIDE 11

Challenges of Large-Scale ML

Variety and veracity

Feature engineering gets even harder Multi-task/transfer learning

Volume

Large D: curse of dimensionality Large N: training efficiency

Velocity

Online learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 67

slide-12
SLIDE 12

Advantages of Deep Learning

Neural Networks (NNs) that go deep

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 67

slide-13
SLIDE 13

Advantages of Deep Learning

Neural Networks (NNs) that go deep Automatic feature engineering

A kind of representation learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 67

slide-14
SLIDE 14

Advantages of Deep Learning

Neural Networks (NNs) that go deep Automatic feature engineering

A kind of representation learning

Exponential gain of expressiveness

Counters the curse of dimensionality

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 67

slide-15
SLIDE 15

Advantages of Deep Learning

Neural Networks (NNs) that go deep Automatic feature engineering

A kind of representation learning

Exponential gain of expressiveness

Counters the curse of dimensionality

Memory and GPU friendliness

SGD GPU-based parallelism

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 67

slide-16
SLIDE 16

Advantages of Deep Learning

Neural Networks (NNs) that go deep Automatic feature engineering

A kind of representation learning

Exponential gain of expressiveness

Counters the curse of dimensionality

Memory and GPU friendliness

SGD GPU-based parallelism

Supporting online & transfer learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 67

slide-17
SLIDE 17

Is Deep Learning a Panacea?

I have big data, so I have to use deep learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 67

slide-18
SLIDE 18

Is Deep Learning a Panacea?

I have big data, so I have to use deep learning Wrong! No free launch theorem: there is no single ML algorithm that

  • utperforms others in every domain

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 67

slide-19
SLIDE 19

Is Deep Learning a Panacea?

I have big data, so I have to use deep learning Wrong! No free launch theorem: there is no single ML algorithm that

  • utperforms others in every domain

Deep learning is more useful when the function f to learn is complex (nonlinear to the input dimension) and has composited patterns

E.g., image recognition, natural language processing

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 67

slide-20
SLIDE 20

Is Deep Learning a Panacea?

I have big data, so I have to use deep learning Wrong! No free launch theorem: there is no single ML algorithm that

  • utperforms others in every domain

Deep learning is more useful when the function f to learn is complex (nonlinear to the input dimension) and has composited patterns

E.g., image recognition, natural language processing

For simple (linear) f, there are specialized large-scale ML techniques (e.g., LIBLINEAR [7]) that are much more efficient

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 67

slide-21
SLIDE 21

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 9 / 67

slide-22
SLIDE 22

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 10 / 67

slide-23
SLIDE 23

Representation Learning

Gray boxes are learned automatically

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 67

slide-24
SLIDE 24

Representation Learning

Gray boxes are learned automatically Deep learning maps the most abstract (deepest) features to the

  • utput

Usually, a simple linear function suffices

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 67

slide-25
SLIDE 25

Representation Learning

Gray boxes are learned automatically Deep learning maps the most abstract (deepest) features to the

  • utput

Usually, a simple linear function suffices

In deep learning, features/presentations are distributed

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 67

slide-26
SLIDE 26

Distributed Representations of Data

In deep learning, we assume that x’s were generated by compositions of factors, potentially at multiple levels in a hierarchy

E.g., layer 3: face = 0.3 [corner] + 0.7 [circle] + 0 [curve]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 12 / 67

slide-27
SLIDE 27

Distributed Representations of Data

In deep learning, we assume that x’s were generated by compositions of factors, potentially at multiple levels in a hierarchy

E.g., layer 3: face = 0.3 [corner] + 0.7 [circle] + 0 [curve] [.] a predefined non-linear function Weights (arrows) learned from training examples

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 12 / 67

slide-28
SLIDE 28

Distributed Representations of Data

In deep learning, we assume that x’s were generated by compositions of factors, potentially at multiple levels in a hierarchy

E.g., layer 3: face = 0.3 [corner] + 0.7 [circle] + 0 [curve] [.] a predefined non-linear function Weights (arrows) learned from training examples

Given x, factors at the same level

  • utput a layer of features of x

Layer 2: 1, 2, 0.5 for [corner], [circle], and [curve] respectively

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 12 / 67

slide-29
SLIDE 29

Distributed Representations of Data

In deep learning, we assume that x’s were generated by compositions of factors, potentially at multiple levels in a hierarchy

E.g., layer 3: face = 0.3 [corner] + 0.7 [circle] + 0 [curve] [.] a predefined non-linear function Weights (arrows) learned from training examples

Given x, factors at the same level

  • utput a layer of features of x

Layer 2: 1, 2, 0.5 for [corner], [circle], and [curve] respectively

To be fed into the factors in the next (deeper) level

Face = 0.3 * 1 + 0.7 * 2

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 12 / 67

slide-30
SLIDE 30

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 13 / 67

slide-31
SLIDE 31

Curse of Dimensionality

Most classic nonlinear ML models find θ by assuming function smoothness: if x ∼ x(i) ∈ X, then f(x;w) ∼ f(x(i);w)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 67

slide-32
SLIDE 32

Curse of Dimensionality

Most classic nonlinear ML models find θ by assuming function smoothness: if x ∼ x(i) ∈ X, then f(x;w) ∼ f(x(i);w) E.g., the non-parametric methods predict the label ˆ y of x by simply interpolating the labels of examples x(i)’s close to x: ˆ y = ∑

i

αiy(i)k(x(i),x)+b, where k(x(i),x) = exp(−γx(i) −x2)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 67

slide-33
SLIDE 33

Curse of Dimensionality

Most classic nonlinear ML models find θ by assuming function smoothness: if x ∼ x(i) ∈ X, then f(x;w) ∼ f(x(i);w) E.g., the non-parametric methods predict the label ˆ y of x by simply interpolating the labels of examples x(i)’s close to x: ˆ y = ∑

i

αiy(i)k(x(i),x)+b, where k(x(i),x) = exp(−γx(i) −x2) Suppose f is smooth within a bin, we need exponentially more examples to get a good interpolation as D increases

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 67

slide-34
SLIDE 34

Exponential Gains from Depth I

Functions representable with a deep rectifier NN require an exponential number of hidden units in a shallow NN [13]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 67

slide-35
SLIDE 35

Exponential Gains from Depth I

Functions representable with a deep rectifier NN require an exponential number of hidden units in a shallow NN [13] In deep learning, a deep factor is defined by “reusing” the shallow ones

Face = 0.3 [corner] + 0.7 [circle]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 67

slide-36
SLIDE 36

Exponential Gains from Depth I

Functions representable with a deep rectifier NN require an exponential number of hidden units in a shallow NN [13] In deep learning, a deep factor is defined by “reusing” the shallow ones

Face = 0.3 [corner] + 0.7 [circle]

With a shallow structure, a deep factor needs to be replaced by exponentially many factors

Face = 0.3 [0.5 [vertical] + 0.5 [horizontal] ] + 0.7 [ ... ]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 67

slide-37
SLIDE 37

Exponential Gains from Depth II

Another example: an NN with absolute value rectification units Each hidden unit specifies where to fold the input space in order to create mirror responses (on both sides of the absolute value) A single fold in a deep layer creates an exponentially large number of piecewise linear regions in input space

No need to see examples in each linear regions in input space

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 16 / 67

slide-38
SLIDE 38

Exponential Gains from Depth II

Another example: an NN with absolute value rectification units Each hidden unit specifies where to fold the input space in order to create mirror responses (on both sides of the absolute value) A single fold in a deep layer creates an exponentially large number of piecewise linear regions in input space

No need to see examples in each linear regions in input space

This exponential gain counters the exponential challenges posed by the curse of dimensionality

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 16 / 67

slide-39
SLIDE 39

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 67

slide-40
SLIDE 40

Stochastic Gradient Descent

Gradient Descent (GD) w(0) ← a randon vector; Repeat until convergence { w(t+1) ← w(t) −η∇wCN(w(t);X); }

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 67

slide-41
SLIDE 41

Stochastic Gradient Descent

Gradient Descent (GD) w(0) ← a randon vector; Repeat until convergence { w(t+1) ← w(t) −η∇wCN(w(t);X); } Needs to scan the entire dataset to descent (many I/Os)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 67

slide-42
SLIDE 42

Stochastic Gradient Descent

Gradient Descent (GD) w(0) ← a randon vector; Repeat until convergence { w(t+1) ← w(t) −η∇wCN(w(t);X); } Needs to scan the entire dataset to descent (many I/Os) (Mini-Batched) Stochastic Gradient Descent (SGD) w(0) ← a randon vector; Repeat until convergence { Randomly partition the training set X into minibatches {X(j)}j; w(t+1) ← w(t) −η∇wC(w(t);X(j)); }

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 67

slide-43
SLIDE 43

Stochastic Gradient Descent

Gradient Descent (GD) w(0) ← a randon vector; Repeat until convergence { w(t+1) ← w(t) −η∇wCN(w(t);X); } Needs to scan the entire dataset to descent (many I/Os) (Mini-Batched) Stochastic Gradient Descent (SGD) w(0) ← a randon vector; Repeat until convergence { Randomly partition the training set X into minibatches {X(j)}j; w(t+1) ← w(t) −η∇wC(w(t);X(j)); } No I/O if the next mini-batch can be prefetched

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 67

slide-44
SLIDE 44

GD vs. SGD

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 67

slide-45
SLIDE 45

GD vs. SGD

Is SGD really a better algorithm?

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 67

slide-46
SLIDE 46

Yes, If You Have Big Data

Performance is limited by training time

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 20 / 67

slide-47
SLIDE 47

Asymptotic Analysis [4]

GD SGD Time per iteration N 1 #Iterations to opt. error ρ log 1

ρ 1 ρ

Time to opt. error ρ N log 1

ρ 1 ρ

Time to excess error ε

1 ε1/α log 1 ε , where α ∈ [1 2,1] 1 ε

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 21 / 67

slide-48
SLIDE 48

Parallelizing SGD

Data Parallelism Model Parallelism Every core/GPU trains the full model given partitioned data. Every core/GPU train a partitioned model given full data.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 67

slide-49
SLIDE 49

Parallelizing SGD

Data Parallelism Model Parallelism Every core/GPU trains the full model given partitioned data. Every core/GPU train a partitioned model given full data. The effectiveness depends on applications and available hardware

E.g., CPU/GPU speed, communication latency, bandwidth, etc.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 67

slide-50
SLIDE 50

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 23 / 67

slide-51
SLIDE 51

Online Learning

So far, we assume that the training data X comes at once What if data come sequentially?

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 24 / 67

slide-52
SLIDE 52

Online Learning

So far, we assume that the training data X comes at once What if data come sequentially? Online learning: to update model when new data arrive

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 24 / 67

slide-53
SLIDE 53

Online Learning

So far, we assume that the training data X comes at once What if data come sequentially? Online learning: to update model when new data arrive This a already supported by SGD

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 24 / 67

slide-54
SLIDE 54

Muti-Task and Transfer Learning

Multi-task learning: to learning a single model for multiple tasks Transfer learning: to reuse the knowledge learned from one task to help another

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 25 / 67

slide-55
SLIDE 55

Muti-Task and Transfer Learning

Multi-task learning: to learning a single model for multiple tasks

Via shared layers

Transfer learning: to reuse the knowledge learned from one task to help another

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 25 / 67

slide-56
SLIDE 56

Muti-Task and Transfer Learning

Multi-task learning: to learning a single model for multiple tasks

Via shared layers

Transfer learning: to reuse the knowledge learned from one task to help another

Via pretrained layers (whose weights may be further updated when a smaller learning rate)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 25 / 67

slide-57
SLIDE 57

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 26 / 67

slide-58
SLIDE 58

Learning Theory

How to learn a function fN from N examples X that is close to the true function f ∗?

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

slide-59
SLIDE 59

Learning Theory

How to learn a function fN from N examples X that is close to the true function f ∗? Empirical risk: CN[fN] = 1

N ∑N i=1 loss(fN(x(i)),y(i))

Expected risk: C[fN] =

loss(f(x),y)dP(x,y)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

slide-60
SLIDE 60

Learning Theory

How to learn a function fN from N examples X that is close to the true function f ∗? Empirical risk: CN[fN] = 1

N ∑N i=1 loss(fN(x(i)),y(i))

Expected risk: C[fN] =

loss(f(x),y)dP(x,y)

Let f ∗ = argminf C[f] be the true function (our goal)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

slide-61
SLIDE 61

Learning Theory

How to learn a function fN from N examples X that is close to the true function f ∗? Empirical risk: CN[fN] = 1

N ∑N i=1 loss(fN(x(i)),y(i))

Expected risk: C[fN] =

loss(f(x),y)dP(x,y)

Let f ∗ = argminf C[f] be the true function (our goal) Since we are seeking a function in a model (hypothesis space) F, this is what can have at best: f ∗

F = argminf∈F C[f]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

slide-62
SLIDE 62

Learning Theory

How to learn a function fN from N examples X that is close to the true function f ∗? Empirical risk: CN[fN] = 1

N ∑N i=1 loss(fN(x(i)),y(i))

Expected risk: C[fN] =

loss(f(x),y)dP(x,y)

Let f ∗ = argminf C[f] be the true function (our goal) Since we are seeking a function in a model (hypothesis space) F, this is what can have at best: f ∗

F = argminf∈F C[f]

But we only minimizes errors on limited examples in our objective, so we only have fN = argminf∈F CN[f]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

slide-63
SLIDE 63

Learning Theory

How to learn a function fN from N examples X that is close to the true function f ∗? Empirical risk: CN[fN] = 1

N ∑N i=1 loss(fN(x(i)),y(i))

Expected risk: C[fN] =

loss(f(x),y)dP(x,y)

Let f ∗ = argminf C[f] be the true function (our goal) Since we are seeking a function in a model (hypothesis space) F, this is what can have at best: f ∗

F = argminf∈F C[f]

But we only minimizes errors on limited examples in our objective, so we only have fN = argminf∈F CN[f] The excess error E = C[fN]−C[f ∗]: E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

slide-64
SLIDE 64

Excess Error

Wait, we may not have enough training time, so we stop iterations early and have ˜ fN, where CN[˜ fN] ≤ CN[fN]+ρ

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

slide-65
SLIDE 65

Excess Error

Wait, we may not have enough training time, so we stop iterations early and have ˜ fN, where CN[˜ fN] ≤ CN[fN]+ρ The excess error becomes E = C[˜ fN]−C[f ∗]: E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

slide-66
SLIDE 66

Excess Error

Wait, we may not have enough training time, so we stop iterations early and have ˜ fN, where CN[˜ fN] ≤ CN[fN]+ρ The excess error becomes E = C[˜ fN]−C[f ∗]: E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Approximation error Eapp: reduced by choosing a larger model

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

slide-67
SLIDE 67

Excess Error

Wait, we may not have enough training time, so we stop iterations early and have ˜ fN, where CN[˜ fN] ≤ CN[fN]+ρ The excess error becomes E = C[˜ fN]−C[f ∗]: E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Approximation error Eapp: reduced by choosing a larger model Estimation error Eest: reduced by

1

Increasing N, or

2

Choosing smaller model [5, 12, 15]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

slide-68
SLIDE 68

Excess Error

Wait, we may not have enough training time, so we stop iterations early and have ˜ fN, where CN[˜ fN] ≤ CN[fN]+ρ The excess error becomes E = C[˜ fN]−C[f ∗]: E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Approximation error Eapp: reduced by choosing a larger model Estimation error Eest: reduced by

1

Increasing N, or

2

Choosing smaller model [5, 12, 15]

Optimization error Eopt: reduced by

1

Running optimization alg. longer (with smaller ρ)

2

Choosing more efficient optimization alg.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

slide-69
SLIDE 69

Minimizing Excess Error

E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Small-scale ML tasks:

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

slide-70
SLIDE 70

Minimizing Excess Error

E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Small-scale ML tasks:

Mainly constrained by N

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

slide-71
SLIDE 71

Minimizing Excess Error

E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Small-scale ML tasks:

Mainly constrained by N Computing time is not an issue, so Eopt can be insignificant by choosing small ρ

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

slide-72
SLIDE 72

Minimizing Excess Error

E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Small-scale ML tasks:

Mainly constrained by N Computing time is not an issue, so Eopt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between Eapp and Eest

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

slide-73
SLIDE 73

Minimizing Excess Error

E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Small-scale ML tasks:

Mainly constrained by N Computing time is not an issue, so Eopt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between Eapp and Eest

Large-scale ML tasks:

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

slide-74
SLIDE 74

Minimizing Excess Error

E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Small-scale ML tasks:

Mainly constrained by N Computing time is not an issue, so Eopt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between Eapp and Eest

Large-scale ML tasks:

Mainly constrained by time (significant Eopt), so SGD is preferred

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

slide-75
SLIDE 75

Minimizing Excess Error

E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Small-scale ML tasks:

Mainly constrained by N Computing time is not an issue, so Eopt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between Eapp and Eest

Large-scale ML tasks:

Mainly constrained by time (significant Eopt), so SGD is preferred N is large, so Eest can be reduced

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

slide-76
SLIDE 76

Minimizing Excess Error

E = C[f ∗

F]−C[f ∗]

  • Eapp

+C[fN]−C[f ∗

F]

  • Eest

+C[˜ fN]−C[fN]

  • Eopt

Small-scale ML tasks:

Mainly constrained by N Computing time is not an issue, so Eopt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between Eapp and Eest

Large-scale ML tasks:

Mainly constrained by time (significant Eopt), so SGD is preferred N is large, so Eest can be reduced Large model is preferred to reduce Eapp

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

slide-77
SLIDE 77

Big Data + Big Models

  • 9. COTS HPC unsupervised convolutional network [6]
  • 10. GoogleLeNet [14]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 30 / 67

slide-78
SLIDE 78

Big Data + Big Models

  • 9. COTS HPC unsupervised convolutional network [6]
  • 10. GoogleLeNet [14]

With domain-specific architecture such as convolutional NNs (CNNs) and recurrent NNs (RNNs)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 30 / 67

slide-79
SLIDE 79

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 31 / 67

slide-80
SLIDE 80

Over-Parametrized NNs

Let D(l) be the output dimension (“width”) of a layer f (l)(·; θ (l)) of an NN

Input/output dimension: (x,y) ∈ RD(0) ×RD(L) D = min(D(0),··· ,D(L)) the network width

From the statistical learning theory point of view, the larger the D, the worse the generalizability

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 32 / 67

slide-81
SLIDE 81

Over-Parametrized NNs

Let D(l) be the output dimension (“width”) of a layer f (l)(·; θ (l)) of an NN

Input/output dimension: (x,y) ∈ RD(0) ×RD(L) D = min(D(0),··· ,D(L)) the network width

From the statistical learning theory point of view, the larger the D, the worse the generalizability However, as D grows, the generalizability actually increases [20]; i.e.,

  • ver-parametrization leads to better performance

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 32 / 67

slide-82
SLIDE 82

Over-Parametrized NNs

Let D(l) be the output dimension (“width”) of a layer f (l)(·; θ (l)) of an NN

Input/output dimension: (x,y) ∈ RD(0) ×RD(L) D = min(D(0),··· ,D(L)) the network width

From the statistical learning theory point of view, the larger the D, the worse the generalizability However, as D grows, the generalizability actually increases [20]; i.e.,

  • ver-parametrization leads to better performance

Why such a paradox?

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 32 / 67

slide-83
SLIDE 83

Wide-and-Deep NNs as Gaussian Processes

Recent studies [10, 9, 11] show that a wide NN of any depth can be approximated by a Gaussian process (GP)

Either before, during, or after training

Recall that a GP is a non-parametric model whose complexity depends

  • nly on the size of training set |X| and the hyperparameters of kernel

function k(·,·): yN yM

  • ∼ N (

mN mM

  • ,

KN,N KN,M KM,N KM,M

  • with Bayesian inference for test points X′:

P(yM |X′,X) = N (KM,NK−1

N,NyN , KM,M −KM,NK−1 N,NKN,M)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 33 / 67

slide-84
SLIDE 84

Wide-and-Deep NNs as Gaussian Processes

Recent studies [10, 9, 11] show that a wide NN of any depth can be approximated by a Gaussian process (GP)

Either before, during, or after training

Recall that a GP is a non-parametric model whose complexity depends

  • nly on the size of training set |X| and the hyperparameters of kernel

function k(·,·): yN yM

  • ∼ N (

mN mM

  • ,

KN,N KN,M KM,N KM,M

  • with Bayesian inference for test points X′:

P(yM |X′,X) = N (KM,NK−1

N,NyN , KM,M −KM,NK−1 N,NKN,M)

Therefore, wide-and-deep NNs do not overfit as one may expect

The D, once becoming large, does not reflect true model complexity

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 33 / 67

slide-85
SLIDE 85

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 34 / 67

slide-86
SLIDE 86

Example: NN for Regression

For simplicity, we consider an L-layer NN f(·; θ) for the regression problem: f(x; θ) = a(l) = φ (l)(W(l)⊤a(l−1) +b(l)), for l = 1,...,L, where

the activation functions φ (1)(·) = ··· = φ (L−1)(·) ≡ φ(·) and φ (L−1)(·) is an identify function a(0) = x and ˆ y = a(L) = z(L) ∈ R the mean of a Gaussian θ (l) = vec(W(l),b(l)) and θ = vec(θ (1),··· ,θ (L))

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 35 / 67

slide-87
SLIDE 87

Example: NN for Regression

For simplicity, we consider an L-layer NN f(·; θ) for the regression problem: f(x; θ) = a(l) = φ (l)(W(l)⊤a(l−1) +b(l)), for l = 1,...,L, where

the activation functions φ (1)(·) = ··· = φ (L−1)(·) ≡ φ(·) and φ (L−1)(·) is an identify function a(0) = x and ˆ y = a(L) = z(L) ∈ R the mean of a Gaussian θ (l) = vec(W(l),b(l)) and θ = vec(θ (1),··· ,θ (L))

Let ˆ yN = [f(x(1);θ),··· ,f(x(N);θ)]⊤ ∈ RN be the predictions for the points in training set X = {(x(i),y(i))}N

i=1 = {XN ∈ RN×D(0),yN ∈ RN}

Maximum-likelihood estimation: argmax

θ

P(X|θ) = argmin

θ C(ˆ

yN,yN) = argmin

θ

1 2ˆ yN −yN2

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 35 / 67

slide-88
SLIDE 88

Weight Initialization and Normalization

a(l) = φ (l)(W(l)⊤a(l−1) +b(l)) Common initialization: W(l)

i,j ∼ N (0,σ2 w) and b(l) i

∼ N (0,σ2

b )

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 36 / 67

slide-89
SLIDE 89

Weight Initialization and Normalization

a(l) = φ (l)(W(l)⊤a(l−1) +b(l)) Common initialization: W(l)

i,j ∼ N (0,σ2 w) and b(l) i

∼ N (0,σ2

b )

To normalize the forward and backward gradient signals w.r.t. layer width D(l), we can define an equivalent NN: a(l) = φ (l)(W(l)⊤a(l−1) +b(l)), where W(l)

i,j = σw √ D(l−1) ω(l) i,j , b(l) i

= σbβ (l)

i , and ω(l) i,j ,β (l) i

∼ N (0,1)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 36 / 67

slide-90
SLIDE 90

Distribution of ˆ y

Given an x, what is the distribution of its prediction ˆ y?

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 37 / 67

slide-91
SLIDE 91

Distribution of ˆ y

Given an x, what is the distribution of its prediction ˆ y? Recall that ˆ y = z(L) = w(L)⊤a(L−1) +b(L) = σw √ D(l−1) Σjω(L)

j

φ(z(L−1)

j

)+σbβ (L) Since ω(l)

j ’s and β (l) are Gaussian random variables with zero means,

their sum ˆ y is also a zero-mean Gaussian

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 37 / 67

slide-92
SLIDE 92

Distribution of ˆ y

Given an x, what is the distribution of its prediction ˆ y? Recall that ˆ y = z(L) = w(L)⊤a(L−1) +b(L) = σw √ D(l−1) Σjω(L)

j

φ(z(L−1)

j

)+σbβ (L) Since ω(l)

j ’s and β (l) are Gaussian random variables with zero means,

their sum ˆ y is also a zero-mean Gaussian Now consider the predictions ˆ yN = [ˆ y(x(1)),··· , ˆ y(x(N))]⊤ ∈ RN for N points, we have    ˆ y(x(1)) . . . ˆ y(x(N))    = σw √ D(l−1) Σjω(l)

j,i

    φ(z(l−1)

j

(x(1))) . . . φ(z(l−1)

j

(x(N)))    +σbβ (l)

i 1N

As D(L−1) → ∞, by multidimensional Central Limit Theorem, ˆ y is a multivariate Gaussian with mean 0N and covariance Σ

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 37 / 67

slide-93
SLIDE 93

Wide-and-Deep NN as a Gaussian Process

The covariance Σ completely describes the behavior of our NN ˆ y(·) = f(·) over N points Furthermore, we will show that Σ can be describe by a deterministic kernel function k(L)(·, ·) independent of a particular initialization such that Σ =    k(L)(x(1),x(1)) ··· k(L)(x(1),x(N)) . . . ... . . . k(L)(x(N),x(1)) ··· k(L)(x(N),x(N))    ≡ K(L)

N,N

This implies that the NN is in correspondent with a GP called NN-GP: ˆ yN ˆ yM

  • ∼ N (

0N 0M

  • ,
  • K(L)

N,N

K(L)

N,M

K(L)

M,N

K(L)

M,M

  • )

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 38 / 67

slide-94
SLIDE 94

Wide-and-Deep NN as a Gaussian Process

The covariance Σ completely describes the behavior of our NN ˆ y(·) = f(·) over N points Furthermore, we will show that Σ can be describe by a deterministic kernel function k(L)(·, ·) independent of a particular initialization such that Σ =    k(L)(x(1),x(1)) ··· k(L)(x(1),x(N)) . . . ... . . . k(L)(x(N),x(1)) ··· k(L)(x(N),x(N))    ≡ K(L)

N,N

This implies that the NN is in correspondent with a GP called NN-GP: ˆ yN ˆ yM

  • ∼ N (

0N 0M

  • ,
  • K(L)

N,N

K(L)

N,M

K(L)

M,N

K(L)

M,M

  • )

What’s the k(L)(·, ·)?

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 38 / 67

slide-95
SLIDE 95

Deriving k(1)(·, ·)

We use induction to show that z(1)

i (·), z(2) i (·),··· , z(L)(·) = ˆ

y(·) are GPs, which are govern by kernels k(1)(·,·),··· ,k(L)(·,·) independent with i, respectively

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 39 / 67

slide-96
SLIDE 96

Deriving k(1)(·, ·)

We use induction to show that z(1)

i (·), z(2) i (·),··· , z(L)(·) = ˆ

y(·) are GPs, which are govern by kernels k(1)(·,·),··· ,k(L)(·,·) independent with i, respectively Consider z(1)

i (x) = σw √ D(0) Σjω(l) j,i xj +σbβ (l) i

a zero-mean Gaussian As D(0) → ∞, we have [z(1)

i (x(1)),··· ,z(1) i (x(N))]⊤ ∼ N(0N,K(1) N,N) by

multidimensional Central Limit Theorem, where

k(1)(x,x′) = Cov[z(1)

i

(x),z(1)

i

(x′)] = Eω(l)

:,i ,β (l) i [z(1)

i

(x)z(1)

i

(x′)] = σ 2

w

D(0) E

  • Σj,kω(l)

j,i ω(l) k,i xjx′ k

  • + σwσb

√ D(0) E

  • β (l)

i Σjω(l) j,i xj

  • + σwσb

√ D(0) E

  • β (l)

i Σjω(l) j,i x′ j

  • +σ2

b E

  • β (l)

i β (l) i

  • = σ 2

w

D(0) ΣjE

  • ω(l)

j,i ω(l) j,i

  • xjx′

j +σ2 b E

  • β (l)

i β (l) i

  • = σ 2

w

D(0) x⊤x′ +σ2 b ,

is independent with i Note that z(1)

i (·) and z(1) j (·) are independent with each other, ∀i = j

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 39 / 67

slide-97
SLIDE 97

Deriving k(l)(·, ·)

Given that D(0) → ∞,··· ,D(l−2) → ∞ and

[z(l−1)

i

(x(1)),··· ,z(l−1)

i

(x(N))]⊤ ∼ N(0N,K(l−1)

N,N )

z(l−1)

i

(·) and z(l−1)

j

(·) are independent with each other, ∀i = j

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 40 / 67

slide-98
SLIDE 98

Deriving k(l)(·, ·)

Given that D(0) → ∞,··· ,D(l−2) → ∞ and

[z(l−1)

i

(x(1)),··· ,z(l−1)

i

(x(N))]⊤ ∼ N(0N,K(l−1)

N,N )

z(l−1)

i

(·) and z(l−1)

j

(·) are independent with each other, ∀i = j

Consider z(l)

i (x) = σw √ D(l−1) Σjω(l) j,i φ(z(l−1) j

(x))+σbβ (l)

i

a zero-mean Gaussian As D(l−1) → ∞, we have [z(l)

i (x(1)),··· ,z(l) i (x(N))]⊤ ∼ N(0N,K(l) N,N) by

multidimensional Central Limit Theorem, where

k(l)(x,x′) = Cov[z(l)

i (x),z(l) i (x′)] = Eω(l)

:,i ,β (l) i ,z(l−1)(x)[z(l)

i (x)z(l) i (x′)]

=

σ 2

w

D(l−1) E

  • Σj,kω(l)

j,i ω(l) k,i φ(z(l−1) j

(x))φ(z(l−1)

k

(x′))

  • +σ2

b E

  • β (l)

i β (l) i

  • + σwσb

√ D(l−1)

  • E
  • β (l)

i Σjω(l) j,i φ(z(l−1) j

(x))

  • +E
  • β (l)

i Σjω(l) j,i φ(z(l−1) j

(x′))

  • =

σ 2

w

D(l−1) ΣjE

  • ω(l)

j,i ω(l) j,i

  • E
  • φ(z(l−1)

j

(x))φ(z(l−1)

j

(x′))

  • +σ2

b E

  • β (l)

i β (l) i

  • = σ2

wE(z(l−1)

i

(x),z(l−1)

i

(x′))∼N (02,K(l−1)

2,2

)

  • φ(z(l−1)

i

(x))φ(z(l−1)

i

(x′))

  • +σ2

b ,

where

K(l−1)

2,2

= k(l−1)(x,x) k(l−1)(x,x′) k(l−1)(x,x′) k(l−1)(x′,x′)

  • Shan-Hung Wu (CS, NTHU)

Large-Scale ML Machine Learning 40 / 67

slide-99
SLIDE 99

Evaluating K(l)

For certain activation functions φ(·), such as tanh and ReLU, k(l)(x,x′) has a closed form [10] For other φ(·)’s, Markov Chain Monte Carlo (MCMC) sampling is required to devaluate k(l)(x,x′)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 41 / 67

slide-100
SLIDE 100

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 42 / 67

slide-101
SLIDE 101

Weight Dynamics

Observation: the weights of a wide NN do not change much during gradient descent Why?

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 43 / 67

slide-102
SLIDE 102

Weight Dynamics

Observation: the weights of a wide NN do not change much during gradient descent Why? A small change in a large number of neurons is enough to significantly change the output This allows us to approximate an NN f(·; θ) w.r.t. weights using the first-order Taylor expansion

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 43 / 67

slide-103
SLIDE 103

Linearization of f(·; θ)

Let θ (t) be the parameters of the NN at the t-th step of gradient descent

ˆ y(t)

N = [f(x(1);θ (t)),··· ,f(x(N);θ (t))]⊤ be the predictions over training

points

Since θ (t) is close to θ (0) at any time t, we can approximate f(·;θ (t)) using the first-order Taylor expansion w.r.t. θ (t) around θ (0): f(x,θ (t)) ≈ ¯ f(x,θ (t)) = f(x,θ (0))+∇θf(x,θ (0))⊤(θ (t) −θ (0))

¯ f is still non-linear in terms of x Let ¯ y(t)

N = [¯

f(x(1);θ (t)),··· , ¯ f(x(N);θ (t))]⊤ be the predictions of ¯ f at time t

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 44 / 67

slide-104
SLIDE 104

Weight and Prediction Dynamics

f(x,θ (t)) ≈ ¯ f(x,θ (t)) = f(x,θ (0))+∇θf(x,θ (0))⊤(θ (t) −θ (0)) Gradient descent with learning rate η makes the following changes: θ (t+1) −θ (t) ≈ −η∇θC(¯ y(t)

N ,yN)

= −η∇θ ¯ y(t)

N ∇¯ y(t)

N C(¯

y(t)

N ,yN)

= −η∇θ ˆ y(0)

N ∇¯ y(t)

N C(¯

y(t)

N ,yN)

and ¯ y(t+1)

N

− ¯ y(t)

N

= ∇θ ˆ y(0)⊤

N

(θ (t+1) −θ (t)) ≈ −η∇θ ˆ y(0)⊤

N N×D

∇θ ˆ y(0)

N D×N

∇¯

y(t)

N C(¯

y(t)

N ,yN),

where T(0)

N,N ≡ ∇θ ˆ

y(0)⊤

N

∇θ ˆ y(0)

N ∈ RN×N is called the Neural Tangent

Kernel (NTK) matrix

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 45 / 67

slide-105
SLIDE 105

Prediction Dynamics in Regression

In regression where C(¯ y(0)

N ,yN) = 1 2¯

y(0)

N −yN2, we have

¯ y(t+1)

N

− ¯ y(t)

N ≈ −ηT(0) N,N∇¯ y(t)

N C(¯

y(t)

N ,yN) = −ηT(0) N,N(¯

y(t)

N −yN)

With a sufficiently small learning rate η, we can think t as continuous time and each GD step as ∆t, where lim∆t→0

¯ y(t+∆t)

N

−¯ y(t)

N

∆t

= ∂ ¯

y(t)

N

∂t ≈ −ηT(0) N,N(¯

y(t)

N −yN)

Letting u(t) = ¯ y(t)

N −yN, we have an ordinary differential equation: ∂ ¯ y(t)

N

∂t ≈ −ηT(0) N,N(¯

y(t)

N −yN)

⇒ ∂u(t)

∂t ≈ −ηT(0) N,Nu(t)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 46 / 67

slide-106
SLIDE 106

Prediction Dynamics in Regression

In regression where C(¯ y(0)

N ,yN) = 1 2¯

y(0)

N −yN2, we have

¯ y(t+1)

N

− ¯ y(t)

N ≈ −ηT(0) N,N∇¯ y(t)

N C(¯

y(t)

N ,yN) = −ηT(0) N,N(¯

y(t)

N −yN)

With a sufficiently small learning rate η, we can think t as continuous time and each GD step as ∆t, where lim∆t→0

¯ y(t+∆t)

N

−¯ y(t)

N

∆t

= ∂ ¯

y(t)

N

∂t ≈ −ηT(0) N,N(¯

y(t)

N −yN)

Letting u(t) = ¯ y(t)

N −yN, we have an ordinary differential equation: ∂ ¯ y(t)

N

∂t ≈ −ηT(0) N,N(¯

y(t)

N −yN)

⇒ ∂u(t)

∂t ≈ −ηT(0) N,Nu(t)

Therefore, u(t) = e−ηT(0)

N,Ntu(0)

Recall that eAt = 1

0!I + t 1!A+ t2 2!A2 +··· for a symmetric A

So, ∂eAt

∂t = 1 0!A+ t 1!A2 +··· = ( 1 0!I + t 1!A+···)A = AeAt

This implies that ¯ y(t)

N = e−ηT(0)

N,Nt¯

y(0)

N +(I −e−ηT(0)

N,Nt)yN = e−ηT(0) N,Ntˆ

y(0)

N +(I −e−ηT(0)

N,Nt)yN Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 46 / 67

slide-107
SLIDE 107

Weight Dynamics in Regression

By definition of ¯ y(t)

N , we also have ¯

y(t)

N = ˆ

y(0)

N +∇θ ˆ

y(0)⊤

N

(θ (t) −θ (0)) Solving θ (t) in e−ηT(0)

N,Ntˆ

y(0)

N +(I −e−ηT(0)

N,Nt)yN = ˆ

y(0)

N +∇θ ˆ

y(0)⊤

N

(θ (t) −θ (0)), we have θ (t) = θ (0) −∇θ ˆ y(0)

N T(0)−1 N,N (I −e−ηT(0)

N,Nt)(ˆ

y(0)

N −yN)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 47 / 67

slide-108
SLIDE 108

Predictions of Trained NN

Substituting θ (t) in ¯ y(t)

N = ˆ

y(0)

N +∇θ ˆ

y(0)⊤

N

(θ (t) −θ (0)), we have that: For an arbitrary (training or test) point x′, the prediction of trained NN is f(x′,θ (t)) ≈ ¯ f(x′;θ (t)) = p⊤

  • ˆ

y(0)

N

ˆ y′(0)

  • +q,

where p = [−T(0)

1′,NT(0)−1 N,N (I −e−ηT(0)

N,Nt),1]⊤ ∈ RN+1,

q = T(0)

1′,NT(0)−1 N,N (I −e−ηT(0)

N,Nt)yN

T(0)

N,N = ∇θ ˆ

y(0)⊤

N

∇θ ˆ y(0)

N ∈ RN×N is the NTK matrix for XN

T(0)

1′,N = ∇θ ˆ

y′(0)⊤∇θ ˆ y(0)

N ∈ R1×N is the NTK matrix between x′ and XN

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 48 / 67

slide-109
SLIDE 109

Predictions of Trained NN

Substituting θ (t) in ¯ y(t)

N = ˆ

y(0)

N +∇θ ˆ

y(0)⊤

N

(θ (t) −θ (0)), we have that: For an arbitrary (training or test) point x′, the prediction of trained NN is f(x′,θ (t)) ≈ ¯ f(x′;θ (t)) = p⊤

  • ˆ

y(0)

N

ˆ y′(0)

  • +q,

where p = [−T(0)

1′,NT(0)−1 N,N (I −e−ηT(0)

N,Nt),1]⊤ ∈ RN+1,

q = T(0)

1′,NT(0)−1 N,N (I −e−ηT(0)

N,Nt)yN

T(0)

N,N = ∇θ ˆ

y(0)⊤

N

∇θ ˆ y(0)

N ∈ RN×N is the NTK matrix for XN

T(0)

1′,N = ∇θ ˆ

y′(0)⊤∇θ ˆ y(0)

N ∈ R1×N is the NTK matrix between x′ and XN

No actual training needed!

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 48 / 67

slide-110
SLIDE 110

Gradient Descent as an Affine Transformation

Theorem (NTK in infinite width) As the NN’s width goes to infinity, T(0)

N,N and T(0) 1′,N converges to TN,N and

T1′,N, which can be described by a deterministic kernel function τ(L)(·,·) independent of a particular initialization [9, 11]. That is, each element Ti,j = τ(L)(x(i),x(j)) τ(L)(·,·) depends only on the network structure and hyperparameters

  • f initial weights

τ(L)(·,·) has a closed form for certain activation functions φ(·)’s, including erf and ReLU

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 49 / 67

slide-111
SLIDE 111

Gradient Descent as an Affine Transformation

Theorem (NTK in infinite width) As the NN’s width goes to infinity, T(0)

N,N and T(0) 1′,N converges to TN,N and

T1′,N, which can be described by a deterministic kernel function τ(L)(·,·) independent of a particular initialization [9, 11]. That is, each element Ti,j = τ(L)(x(i),x(j)) τ(L)(·,·) depends only on the network structure and hyperparameters

  • f initial weights

τ(L)(·,·) has a closed form for certain activation functions φ(·)’s, including erf and ReLU f(x′,θ (t)) ≈ p⊤ ˆ y(0) ˆ y′(0)

  • +q is an affine transformation of a random

vector ˆ y(0) ˆ y′(0)

  • Shan-Hung Wu (CS, NTHU)

Large-Scale ML Machine Learning 49 / 67

slide-112
SLIDE 112

NTK in Infinite Width

Consider the pre-activations z(1)

i (·), z(2) i (·),··· , z(L)(·) = ˆ

y(0)(·) at different layers at time 0 Let ∇θ (≤1)z(1)

i (·), ∇θ (≤2)z(2) i (·),··· , ∇θ (≤L)z(L)(·) = ∇θ ˆ

y(0)(·) be the corresponding derivatives

θ (≤l) ≡ vec(θ (1),··· ,θ (l))

We use induction to show that, when D → ∞, we have

∇θ (≤l)z(l)

i (x)⊤∇θ (≤l)z(l) i (x′) = τ(l)(x,x′)

= k(l)(x,x′)+ σ2

wτ(l−1)(x,x′)E(z(l−1)

i

(x),z(l−1)

i

(x′))∼N (02,K(l−1)

2,2

)

  • φ′(z(l−1)

i

(x))φ′(z(l−1)

i

(x′))

  • at any layer l, and τ(1)(x,x′) = k(1)(x,x′)

τ(l)(·,·) is independent of i

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 50 / 67

slide-113
SLIDE 113

Deriving τ(1)(·, ·)

At the first layer, we have ∇θ (≤1)z(1)

i (x)⊤∇θ (≤1)z(1) i (x′) = σ2 w

D(0) x⊤x′ +σ2

b = k(1)(x,x′)

as D(0) → ∞

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 51 / 67

slide-114
SLIDE 114

Deriving τ(1)(·, ·)

At the first layer, we have ∇θ (≤1)z(1)

i (x)⊤∇θ (≤1)z(1) i (x′) = σ2 w

D(0) x⊤x′ +σ2

b = k(1)(x,x′)

as D(0) → ∞ Now, assume that when D(0) → ∞,··· ,D(l−2) → ∞, ∇θ (≤l−1)z(l−1)

i

(x)⊤∇θ (≤l−1)z(l−1)

i

(x′) = τ(l−1)(x,x′) holds

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 51 / 67

slide-115
SLIDE 115

Deriving τ(l)(·, ·) I

At the l-th layer, we have ∇θ (≤l)z(l)

i (x)⊤∇θ (≤l)z(l) i (x′)

= [∇θ (l)z(l)

i (x),∇θ (≤l−1)z(l) i (x)][∇θ (l)z(l) i (x′),∇θ (≤l−1)z(l) i (x′)]⊤

= ∇θ (l)z(l)

i (x)⊤∇θ (l)z(l) i (x′)+∇θ (≤l−1)z(l) i (x)⊤∇θ (≤l−1)z(l) i (x′)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 52 / 67

slide-116
SLIDE 116

Deriving τ(l)(·, ·) I

At the l-th layer, we have ∇θ (≤l)z(l)

i (x)⊤∇θ (≤l)z(l) i (x′)

= [∇θ (l)z(l)

i (x),∇θ (≤l−1)z(l) i (x)][∇θ (l)z(l) i (x′),∇θ (≤l−1)z(l) i (x′)]⊤

= ∇θ (l)z(l)

i (x)⊤∇θ (l)z(l) i (x′)+∇θ (≤l−1)z(l) i (x)⊤∇θ (≤l−1)z(l) i (x′)

As D(l−1) → ∞, the first term ∇θ (l)z(l)

i (x)⊤∇θ (l)z(l) i (x′) =

σ2

w

D(l−1) Σjφ(z(l−1)

j

(x))φ(z(l−1)

j

(x′))+σ2

b

converges to σ2

wE(z(l−1)

i

(x),z(l−1)

i

(x′))∼N (02,K(l−1)

2,2

)

  • φ(z(l−1)

i

(x))φ(z(l−1)

i

(x′))

  • +σ2

b

= k(l)(x,x′) because z(l−1)

i

(·) and z(l−1)

j

(·) are i.i.d.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 52 / 67

slide-117
SLIDE 117

Deriving τ(l)(·, ·) II

Consider the second term ∇θ (≤l−1)z(l)

i (x)⊤∇θ (≤l−1)z(l) i (x′)

= ∇z(l−1)(x)z(l)

i (x)⊤∇θ (≤l−1)z(l−1)(x)⊤∇θ (≤l−1)z(l−1)(x′)∇z(l−1)(x)z(l) i (x′)

= ∇z(l−1)(x)z(l)

i (x)⊤

   τ(l−1)(x,x′) ... τ(l−1)(x,x′)   ∇z(l−1)(x)z(l)

i (x)

= τ(l−1)(x,x′)Σj

∂z(l)

i (x)

∂z(l−1)

j

(x) · ∂z(l)

i (x′)

∂z(l−1)

j

(x′)

= τ(l−1)(x,x′)

σ2

w

D(l−1) Σjω(l) j,i ω(l) j,i φ ′(z(l−1) j

(x))φ ′(z(l−1)

j

(x′)) As D(l−1) → ∞, it becomes σ2

wτ(l−1)(x,x′)E(z(l−1)

i

(x),z(l−1)

i

(x′))∼N (02,K(l−1)

2,2

)

  • φ ′(z(l−1)

i

(x))φ ′(z(l−1)

i

(x′))

  • Shan-Hung Wu (CS, NTHU)

Large-Scale ML Machine Learning 53 / 67

slide-118
SLIDE 118

Outline

1

When ML Meets Big Data

2

Advantages of Deep Learning Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning

3

Learning Theory Revisited Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training*

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 54 / 67

slide-119
SLIDE 119

Wide-and-Deep NN as a Gaussian Process I

As D → ∞, randomly initialized NN has a corresponding NN-GP: ˆ yN ˆ yM

  • ∼ N (

0N 0M

  • ,
  • K(L)

N,N

K(L)

N,M

K(L)

M,N

K(L)

M,M

  • )

As D → ∞, GD-based training is an affine transformation: f(x′,θ (t)) ≈ ¯ f(x′;θ (t)) = p⊤

  • ˆ

y(0)

N

ˆ y′(0)

  • +q

where

p = [−T1′,NT−1

N,N(I −e−ηTN,Nt),1]⊤ ∈ RN+1

q = T1′,NT−1

N,N(I −e−ηTN,Nt)yN

TN,N and T1′,N the NTK matrices

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 55 / 67

slide-120
SLIDE 120

Wide-and-Deep NN as a Gaussian Process II

Therefore, as D → ∞, the trained NN is still in correspondent with a GP, called NTK-GP, whose predictions for M test points are ˆ yN ˆ yM

  • ∼ N (

AyN ByN

  • ,C⊤
  • K(L)

N,N

K(L)

N,M

K(L)

M,N

K(L)

M,M

  • C),

where

A = (I −e−ηTN,Nt) ∈ RN×N B = TM,NT−1

N,N(I −e−ηTN,Nt) ∈ RM×N

C =

  • IN,N −A

ON,M −B IM,M

  • ∈ R(N+M)×(N+M)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 56 / 67

slide-121
SLIDE 121

Mean Predictions of NTK-GP

Prior (unconditioned) mean predictions for training set: ˆ yN = AyN = (I −e−ηTN,Nt)yN

As t → ∞, the ˆ yN always approaches true labels yN This explains why the SGD-based training of large NNs seldom encounters significant obstacles such as local minima [8]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 57 / 67

slide-122
SLIDE 122

Mean Predictions of NTK-GP

Prior (unconditioned) mean predictions for training set: ˆ yN = AyN = (I −e−ηTN,Nt)yN

As t → ∞, the ˆ yN always approaches true labels yN This explains why the SGD-based training of large NNs seldom encounters significant obstacles such as local minima [8]

Prior mean predictions for test set: ˆ yM = By = TM,NT−1

N,N(I −e−ηTN,Nt)yN

As t → ∞, we have ˆ yM = TM,NT−1

N,NyN

Weight hyperparameters are important because they determines TM,NT−1

N,N

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 57 / 67

slide-123
SLIDE 123

Analytic vs. Real Predictions

Wide residual network [19] trained by SGD with momentum on MSE loss on CIFAR-10

First two panes shows the output dynamics for a randomly selected subset of train and test points

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 58 / 67

slide-124
SLIDE 124

Remarks

Wide-and-deep NNs can be approximated by a class of GPs

Either before, during, or after training

Therefore, complexity of wide-and-deep NNs grows with N, not |θ|

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 59 / 67

slide-125
SLIDE 125

Remarks

Wide-and-deep NNs can be approximated by a class of GPs

Either before, during, or after training

Therefore, complexity of wide-and-deep NNs grows with N, not |θ| Applicable to other architectures including CNN [2, 16], RNN [17, 1], and any architecture [18]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 59 / 67

slide-126
SLIDE 126

Limitations

Approximation holds only when the NNs have:

Infinite width Small learning rate: η <

2 λmax+λmin where λmax/min is the max/min

eigenvalue of TN,N [17] Proper initialization (to be discussed next)

The prior NTK-GP inference ˆ yNTK-GP = TM,NT−1

N,NyN is inconsistent

with the Bayesian inference of NN-GP ˆ yNN-GP = KM,NK−1

N,NyN

SGP introduces bias [3]

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 60 / 67

slide-127
SLIDE 127

Reference I

[1] Sina Alemohammad, Zichao Wang, Randall Balestriero, and Richard Baraniuk. The recurrent neural tangent kernel. arXiv preprint arXiv:2006.10246, 2020. [2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150, 2019. [3] Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, pages 12893–12904, 2019.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 61 / 67

slide-128
SLIDE 128

Reference II

[4] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. [5] Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph.D. thesis, Ecole Polytechnique, Palaiseau, France, 2002. [6] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with cots hpc systems. In Proceedings of The 30th International Conference on Machine Learning, pages 1337–1345, 2013.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 62 / 67

slide-129
SLIDE 129

Reference III

[7] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008. [8] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014. [9] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 63 / 67

slide-130
SLIDE 130

Reference IV

[10] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017. [11] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8572–8583, 2019. [12] Pascal Massart. Some applications of concentration inequalities to statistics. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 9, pages 245–303, 2000.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 64 / 67

slide-131
SLIDE 131

Reference V

[13] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014. [14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. [15] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of Complexity, pages 11–30. Springer, 2015.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 65 / 67

slide-132
SLIDE 132

Reference VI

[16] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019. [17] Greg Yang. Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In Advances in Neural Information Processing Systems, pages 9951–9960, 2019. [18] Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 66 / 67

slide-133
SLIDE 133

Reference VII

[19] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. [20] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 67 / 67