Statistical Learning Theory and Applications 9.520/6.860 in Fall - - PowerPoint PPT Presentation

statistical learning theory and applications
SMART_READER_LITE
LIVE PREVIEW

Statistical Learning Theory and Applications 9.520/6.860 in Fall - - PowerPoint PPT Presentation

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2017 Class Times: Monday and Wednesday 1pm-2:30pm in 46-3310 Units: 3-0-9 H,G Web site: http://www.mit.edu/~9.520/ Email Contact : 9.520@mit.edu 9.520: Statistical


slide-1
SLIDE 1

Statistical Learning Theory 
 and 
 Applications

9.520/6.860 in Fall 2017

Class Times: Monday and Wednesday 1pm-2:30pm in 46-3310 Units: 3-0-9 H,G

Web site: http://www.mit.edu/~9.520/

Email Contact : 9.520@mit.edu

slide-2
SLIDE 2

9.520: Statistical Learning Theory and Applications

2

  • Course focuses on regularization techniques for supervised learning.
  • Support Vector Machines, manifold learning, sparsity, batch and online supervised

learning, feature selection, structured prediction, multitask learning.

  • Optimization theory critical for machine learning (first order methods, proximal/

splitting techniques).

  • Focus on deep learning and theory of it, based on first part of the class

The goal of this class is to provide the theoretical knowledge and the basic intuitions underlying it, which are needed to effectively use and develop machine learning solutions to a variety of problems.

slide-3
SLIDE 3

Mathcamps

  • Functional analysis (~45mins)
  • Probability (~45mins)

Class http://www.mit.edu/~9.520/ Functional Analysis:

Linear and Euclidean spaces scalar product, orthogonality

  • rthonormal bases, norms and semi-norms,

Cauchy sequence and complete spaces Hilbert spaces, function spaces and linear functional, Riesz representation theorem, convex functions, functional calculus.

Probability Theory:

Random Variables (and related concepts), Law of Large Numbers, Probabilistic Convergence, Concentration Inequalities.

Linear Algebra

Basic notion and definitions: matrix and vectors norms, positive, symmetric, invertible matrices, linear systems, condition number.

slide-4
SLIDE 4

Class http://www.mit.edu/~9.520/: big picture

  • Classes 2-9 are the core: foundations + regularization
  • Classes 10-20 are state-of-the-art topics for research in — and

applications of — ML

  • Classes 21-25 review very recent developments in the theory of

multilayer networks (DCLNs)

{

{

Shallow Networks Deep Networks

slide-5
SLIDE 5

Today’s hand wavy overview

  • Motivations for this course: a golden age for new AI, the key role of Machine

Learning, CBMM

  • A bit of history: Statistical Learning Theory, Neuroscience
  • A bit of ML history: applications
  • Deep Learning
slide-6
SLIDE 6

Fourth CBMM Summer School, 2017

CBMM

slide-7
SLIDE 7

We aim to make progress in understanding intelligence, that is in understanding how the brain makes the mind, how the brain works and how to build intelligent machines. We believe that the science of intelligence will enable better engineering of intelligence.

CBMM’s focus is the Science and the Engineering of Intelligence Key recent advances in the engineering of intelligence have their roots in basic research on the brain

slide-8
SLIDE 8

The problem of (human) intelligence is one of the great problems in science, probably the greatest. Research on intelligence:

  • a great intellectual mission: understand the brain, reproduce it in machines
  • will help develop intelligent machines

The problem of intelligence: how it arises in the brain and how to replicate it in machines

slide-9
SLIDE 9

Machine Learning Computer Science Science+ Technology

  • f Intelligence

Interdisciplinary

Cognitive Science Neuroscience Computational Neuroscience

9

slide-10
SLIDE 10

MIT Research, Education & Diversity Partners

Boyden, Desimone, DiCarlo, Kanwisher, Katz, McDermott, Poggio, Rosasco, Sassanfar, Saxe, Schulz, Tegmark, Tenenbaum, Ullman, Wilson, Winston

Harvard

Blum, Gershman, Kreiman, Livingstone, Nakayama, Sompolinsky, Spelke

Hunter College

Chodorow, Epstein, Sakas, Zeigler

Universidad Central del Caribe (UCC)

Jorquera

UMass Boston

Blaser, Ciaramitaro, Pomplun, Shukla

Howard U.

Chouika, Manaye, Rwebangira, Salmani

Queens College

Brumberg

Stanford U.

Goodman

Johns Hopkins U.

Yuille

Allen Institute

Koch

Rockefeller U.

Freiwald

Wellesley College

Hildreth, Wiest, Wilmer

UPR– Río Piedras

Garcia-Arraras, Maldonado-Vlaar, Megret, Ordóñez, Ortiz-Zuazaga

UPR – Mayagüez

Santiago, Vega-Riveros

University of Central Florida

McNair Program

slide-11
SLIDE 11

Google DeepMind

Academic and Corporate Partners

IIT

Cingolani

A*star

Chuan Poh Lim

Hebrew U.

Weiss

MPI

Bülthoff

Genoa U.

Verri, Rosasco

Weizmann

Ullman

City U. HK

Smale IBM Honda Microsoft Boston
 Dynamics Orcam NVIDIA Rethink
 Robotics Siemens Philips GE

Schlumberger

Mobileye Intel

slide-12
SLIDE 12

20 40 60 80 100 120 140 F a c u l t y R e s e a r c h S c i e n t i s t P

  • s

t d

  • c

s G r a d S t u d e n t s E I T s ( b e g a n 2 1 6 ) S t a f f / O t h e r T

  • t

a l Year 1 Year 2 Year 3 Year 4

CBMM Participants

slide-13
SLIDE 13

Collaboration

  • Of all the things that your STC does, what

works best to foster inter-institutional collaboration?

slide-14
SLIDE 14

Education

slide-15
SLIDE 15

CBMM Summer Course at Woods Hole: Our flagship initiative

Brains, Minds & Machines Summer Course

An intensive three-week course gives advanced students a “deep” introduction to the problem of intelligence

A community of scholars between computer science and neuroscience is being formed: First reunion of alumni of summer school Aug. 26-27 in Woodshole, MA

slide-16
SLIDE 16

Fourth CBMM Summer School, 2017

Recent Achievements in AI

slide-17
SLIDE 17

Intelligence in games: the beginning

slide-18
SLIDE 18
slide-19
SLIDE 19

Recent progress in AI

slide-20
SLIDE 20
  • AlphaGo
  • Mobileye

The 2 best examples of the success of new ML

slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23

Real Engineering: Mobileye

slide-24
SLIDE 24

Third Annual NSF Site Visit, June 8 – 9, 2016

History

slide-25
SLIDE 25

Fourth CBMM Summer School, 2017

Inspiration from Neuroscience

slide-26
SLIDE 26

NSF Site Visit, May 15-16, 2017

Desimone & Ungerleider 1989; vanEssen+Movshon

Background: State-of-the-art Machines (“Deep Learning”) Have Emerged From the Brain’s Visual Processing Architecture

State of the Art ResNets

Brains / Minds Machines

(ventral visual stream)

What’s the engineering of the future?

slide-27
SLIDE 27

Fourth CBMM Summer School, 2017

The Problem of Intelligence is NOT solved as yet….

slide-28
SLIDE 28

Fourth CBMM Summer School, 2017

The Problems of Intelligence and CBMM Intelligence is not solved, not as as a scientific problem, not as an engineering problem. Research is needed:

  • for the sake of basic science
  • for the engineering of tomorrow

My personal challenge for 2016 was to build a simple AI to run m Iron Man. Within 5-10 years we'll have AI systems that are m each of our senses -- vision, hearing, touch, etc, as well as thi impressive how powerful the state of the art for these tools is At the same time, we are still far off from understanding h Everything I did this year – natural language, face recognition, s –are all variants of the same fundamental pattern recognition t hours building Jarvis this year, but even if I spent 1,000 mor be able to build a system that could learn completely new s made some fundamental breakthrough in the state of AI al

slide-29
SLIDE 29

Fourth CBMM Summer School, 2017

For the solution I bet we will need Neuroscience

(suggestion: attend 6.861/9.523)

slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32

The Science of Intelligence

The science of intelligence was at the roots of today’s engineering success We need to make another basic effort leveraging the old and new science of intelligence: neuroscience, cognitive science, learning theory

slide-33
SLIDE 33

Today’s overview

  • Motivations for this course: a golden age for new AI, the key role of Machine

Learning, CBMM

Summary: I told you about the present great success of ML, its connections with neuroscience, its limitations for full AI. I then told you that we need to connect to neuroscience if we want to realize real AI, in addition to understanding our brain. BTW, even without this extension, the next few years will be a golden age for ML

  • applications. The connection to neuroscience is what we do at CBMM and in the CBMM Summer School:

this is an advertisement.

slide-34
SLIDE 34

Summary of today’s overview

  • Motivations for this course: a golden age for new AI, the key

role of Machine Learning, CBMM

  • A bit of history: Statistical Learning Theory
  • A bit of history: applications
  • Deep Learning
slide-35
SLIDE 35

INPUT

OUTPUT

f

Given a set of l examples (data) Question: find function f such that is a good predictor of y for a future input x (fitting the data is not enough!)

Statistical Learning Theory: supervised learning (~1980-2010)

slide-36
SLIDE 36

y x

= data from f = approximation of f = function f

Generalization:

estimating value of function where there are no data (good generalization means predicting the function well; important is for empirical or validation error to be a good proxy of the prediction error)

Statistical Learning Theory: prediction, not description

slide-37
SLIDE 37

(92,10,…) (41,11,…) (19,3,…) (1,13,…) (4,24,…) (7,33,…) (4,71,…)

Regression Classification

Statistical Learning Theory: supervised learning

slide-38
SLIDE 38

There is an unknown probability distribution on the product space Z = X × Y, written µ(z) = µ(x, y). We assume that X is a compact domain in Euclidean space and Y a bounded subset

  • f R. The training set S = {(x1, y1), ..., (xn, yn)} = {z1, ...zn}

consists of n samples drawn i.i.d. from µ. H is the hypothesis space, a space of functions f : X → Y. A learning algorithm is a map L : Z n → H that looks at S and selects from H a function fS : x → y such that fS(x) ≈ y in a predictive way.

Statistical Learning Theory: supervised learning

slide-39
SLIDE 39

Statistical Learning Theory

slide-40
SLIDE 40

Conditions for generalization and well-posedness in learning theory have deep, almost philosophical, implications: they can be regarded as equivalent conditions that guarantee a theory to be predictive and scientific

  • theory must be chosen from a small hypothesis set (~ Occam razor, VC dimension,…)
  • theory should not change much with new data...most of the time (stability)

Statistical Learning Theory: foundational theorems

slide-41
SLIDE 41

implies

Classical algorithm:

Regularization in RKHS (eg. kernel machines)

Remark (for later use):

Classical kernel machines — such as SVMs — correspond to shallow networks

X 1

f

X l

slide-42
SLIDE 42

Summary of today’s overview

  • A bit of history: Statistical Learning Theory

Summary: I told you about learning theory and the concern about productivity and no overfitting. I told you about kernel machines and shallow networks. We will learn a lot about RKHS. Much of this is needed for an eventual theory for deep learning.

slide-43
SLIDE 43

Summary of today’s overview

  • Motivations for this course: a golden age for new AI, the key

role of Machine Learning, CBMM

  • A bit of history: Statistical Learning Theory, Neuroscience
  • A bit of history: old applications
  • Deep Learning
slide-44
SLIDE 44

LEARNING THEORY + ALGORITHMS COMPUTATIONAL NEUROSCIENCE: models+experiments

How visual cortex works Theorems on foundations of learning Predictive algorithms Sung & Poggio 1995, also Kanade& Baluja....

Learning

slide-45
SLIDE 45

LEARNING THEORY + ALGORITHMS COMPUTATIONAL NEUROSCIENCE: models+experiments

How visual cortex works Theorems on foundations of learning Predictive algorithms Sung & Poggio 1995

Engineering of Learning

slide-46
SLIDE 46
slide-47
SLIDE 47

LEARNING THEORY + ALGORITHMS COMPUTATIONAL NEUROSCIENCE: models+experiments

How visual cortex works Theorems on foundations of learning Predictive algorithms

Face detection has been available in digital cameras for a few years now

Engineering of Learning

slide-48
SLIDE 48

LEARNING THEORY + ALGORITHMS COMPUTATIONAL NEUROSCIENCE: models+experiments

How visual cortex works Theorems on foundations of learning Predictive algorithms Papageorgiou&Poggio, 1997, 2000 also Kanade&Scheiderman

Engineering of Learning

People detection

slide-49
SLIDE 49

LEARNING THEORY + ALGORITHMS COMPUTATIONAL NEUROSCIENCE: models+experiments

How visual cortex works Theorems on foundations of learning Predictive algorithms Papageorgiou&Poggio, 1997, 2000 also Kanade&Scheiderman

Engineering of Learning

Pedestrian detection

slide-50
SLIDE 50

50

Some other examples of past ML applications from my lab

Computer Vision

  • Face detection
  • Pedestrian detection
  • Scene understanding
  • Video categorization
  • Video compression
  • Pose estimation

Graphics Speech recognition Speech synthesis Decoding the Neural Code Bioinformatics Text Classification Artificial Markets Stock option pricing ….

slide-51
SLIDE 51

Decoding the neural code: Matrix-like read-out from the brain

Hung, Kreiman, Poggio, DiCarlo. Science 2005

slide-52
SLIDE 52

New feature selection SVM:

Only 38 training examples, 7100 features

AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error.

Pomeroy, S.L., P. Tamayo, M. Gaasenbeek, L.M. Sturia, M. Angelo, M.E. McLaughlin, J.Y.H. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, M.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander and T.R. Golub. Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression, Nature, 2002.

Learning: bioinformatics

slide-53
SLIDE 53

⇒ Bear (0° view)

⇒ Bear (45° view)

Learning: image analysis

slide-54
SLIDE 54

UNCONVENTIONAL GRAPHICS

Θ = 0° view ⇒ Θ = 45° view ⇒

Learning: image synthesis

slide-55
SLIDE 55

A- more in a moment Tony Ezzat,Geiger, Poggio, SigGraph 2002

Mary101

Extending the same basic learning techniques (in 2D): Trainable Videorealistic Face Animation
 (voice is real, video is synthetic)

slide-56
SLIDE 56

Phone Stream

Trajectory Synthesis MMM

Phonetic Models Image Prototypes

  • 1. Learning

System learns from 4 mins

  • f video face appearance

(Morphable Model) and speech dynamics of the person

  • 2. Run Time

For any speech input the system provides as output a synthetic video stream

slide-57
SLIDE 57
slide-58
SLIDE 58

B-Dido

slide-59
SLIDE 59

C-Hikaru

slide-60
SLIDE 60

D-Denglijun

slide-61
SLIDE 61

E-Marylin

slide-62
SLIDE 62

62

slide-63
SLIDE 63

Fourth CBMM Summer School, 2017

slide-64
SLIDE 64

G-Katie

slide-65
SLIDE 65

H-Rehema

slide-66
SLIDE 66

I-Rehemax

slide-67
SLIDE 67

L-real-synth

A Turing test: what is real and what is synthetic?

slide-68
SLIDE 68

Tony Ezzat,Geiger, Poggio, SigGraph 2002

A Turing test: what is real and what is synthetic?

slide-69
SLIDE 69

Opportunity for a good project!

slide-70
SLIDE 70

Summary of today’s overview

  • A bit of history: old applications

Summary: I told you about old applications of ML, mainly kernel machines. I wanted to give you a feeling for how broadly powerful is the supervised learning approach: you can apply it to visual recognition, to decode neural data, to medical diagnosis, to finance, even to graphics. I also wanted to make you aware that ML does not start with deep learning and certainly does not finish with it.

slide-71
SLIDE 71

Today’s overview

  • Motivations for this course: a golden age for new AI, the key

role of Machine Learning, CBMM

  • A bit of history: Statistical Learning Theory, Neuroscience
  • A bit of history: old applications
  • Deep Learning, theory questions:
  • why depth works
  • why deep networks do not overfit
  • the challenge of sampling complexity
slide-72
SLIDE 72

72

slide-73
SLIDE 73

73

slide-74
SLIDE 74

74

slide-75
SLIDE 75

75

slide-76
SLIDE 76
slide-77
SLIDE 77

Deep nets : a theory is needed

slide-78
SLIDE 78
slide-79
SLIDE 79

79

slide-80
SLIDE 80

Deep nets architecture and SGD training

slide-81
SLIDE 81

81

slide-82
SLIDE 82

Summary of today’s overview

  • Motivations for this course: a golden age for new AI, the key

role of Machine Learning, CBMM

  • A bit of history: Statistical Learning Theory, Neuroscience
  • A bit of history: old applications
  • Deep Learning, theory questions
  • why depth works
  • why deep networks do not overfit
  • the challenge of sampling complexity
slide-83
SLIDE 83

Approximation theory: when and why are deep networks better - no curse of

dimensionality — than shallow networks?

Optimization: what is the landscape of the empirical risk? Generalization by SGD: how can overparametrized networks generalize?

DLNNs: three main scientific questions

Work with Hrushikeshl Mhaskar+Lorenzo Rosasco+Fabio Anselmi+Chiyuan Zhang+Qianli Liao +Sasha Rakhlin + Noah G + Xavier B

slide-84
SLIDE 84

84

slide-85
SLIDE 85

Opportunity for theory projects!

slide-86
SLIDE 86

When is deep better than shallow

Theorem (informal statement)

g(x) = ci

i=1 r

< wi ,x > +bi +

Suppose that a function of d variables is compositional . Both shallow and deep network can approximate f equally well. The number of parameters of the shallow network depends exponentially on d as with the dimension whereas for the deep network dance is dimension independent, i.e.

O(ε −d)

O(ε −2)

f (x1,x2,...,x8) = g3(g21(g11(x1,x2),g12(x3,x4))g22(g11(x5,x6),g12(x7,x8)))

Mhaskar, Poggio, Liao, 2016

Theory I:
 Why and when are deep networks better than shallow networks?

slide-87
SLIDE 87

Deep and shallow networks: universality

Cybenko, Girosi, ….

φ(x) = ci

i=1 r

< wi ,x > +bi +

slide-88
SLIDE 88

When is deep better than shallow

Both shallow and deep network can approximate a function of d variables equally well. The number of parameters in both cases depends exponentially on d as .

O(ε −d)

y = f (x1,x2,...,x8)

Mhaskar, Poggio, Liao, 2016

Curse of dimensionality

slide-89
SLIDE 89

When is deep better than shallow

When can the curse of dimensionality be avoided

slide-90
SLIDE 90

When is deep better than shallow

f (x1,x2,...,x8) = g3(g21(g11(x1,x2),g12(x3,x4))g22(g11(x5,x6),g12(x7,x8)))

Generic functions

Mhaskar, Poggio, Liao, 2016

f (x1,x2,...,x8)

Compositional functions

slide-91
SLIDE 91

91

Microstructure of compositionality

target function approximating function/network

slide-92
SLIDE 92

When is deep better than shallow

Theorem (informal statement)

Suppose that a function of d variables is hierarchically, locally, compositional . Both shallow and deep network can approximate f equally well. The number of parameters of the shallow network depends exponentially on d as with the dimension whereas for the deep network dance is

O(ε −d)

O(dε −2)

f (x1,x2,...,x8) = g3(g21(g11(x1,x2),g12(x3,x4))g22(g11(x5,x6),g12(x7,x8)))

Mhaskar, Poggio, Liao, 2016

Hierarchically local compositionality

slide-93
SLIDE 93

Locality of constituent functions is key not weight sharing: CIFAR

slide-94
SLIDE 94

When is deep better than shallow

Open problem: why compositional functions are important for perception?

Which one of these reasons: Physics? Neuroscience? <=== Evolution?

slide-95
SLIDE 95

Opportunity for theory projects!

slide-96
SLIDE 96

When is deep better than shallow

Theorem (informal statement)

Liao, Poggio, 2017

Theory II:
 What is the Landscape of the empirical risk?

Replacing the RELUs with univariate polynomial approximation, Bezout theorem implies that the system of polynomial equations corresponding to zero empirical error has a very large number of degenerate solutions. The global zero-minimizers correspond to flat minima in many dimensions (generically unlike local minima). Thus SGD is biased towards finding global minima of the empirical risk.

slide-97
SLIDE 97

When is deep better than shallow

Results

  • SGD finds with very high probability large volume, flat zero-minimizers;
  • Flat minimizers correspond to degenerate zero-minimizers and thus to global minimizers;
  • SGD minimizers select minima that correspond to small norm solutions and “good”

expected error;

Theory III:

How can the underconstrained solutions found by SGD generalize?

Poggio, Rakhlin, Golovitc, Zhang, Liao, 2017

slide-98
SLIDE 98


 Good generalization with less data than # weights

slide-99
SLIDE 99

No overfitting

slide-100
SLIDE 100

Why do Deep Learning Networks work? ===> In which cases will they fail? Is it possible to improve them? Is it possible to reduce the number of labeled examples?

Beyond today’s DLNNs: several scientific questions…

slide-101
SLIDE 101

Opportunity for a good project!

slide-102
SLIDE 102

Beyond today’s DLNNs: neurocognitive science

  • State-of-the-art DLNNs require ~1M labeled

examples

  • This is not how we learn, how children learn
slide-103
SLIDE 103

The first phase (and successes) of ML: supervised learning, big data:

Today’s science, tomorrow’s engineering: learn like children learn

n → ∞

The next phase of ML: implicitly supervised learning, learning like children do, small data: n → 1

from programmers… …to labelers… …to computers that learn like children…

slide-104
SLIDE 104

Summary of today’s overview

  • Deep Learning, theory questions:
  • why depth works
  • why deep networks do not overfit
  • the challenge of sampling complexity

Summary: I told you why and when deep learning can avoid the curse of dimensionality while shallow nets

  • cannot. I told you why SGD finds global minima and why they are likely to exist in overparametrized networks. I

told you how the theory you learned in class 2-9 explain the puzzle of non-overfitting and good generalization by deep nets.

slide-105
SLIDE 105

Old applications

slide-106
SLIDE 106

Old applications

slide-107
SLIDE 107

Old applications

slide-108
SLIDE 108

A- more in a moment Tony Ezzat,Geiger, Poggio, SigGraph 2002

Mary101

slide-109
SLIDE 109

Phone Stream

Trajectory Synthesis MMM

Phonetic Models Image Prototypes

  • 1. Learning

System learns from 4 mins

  • f video face appearance

(Morphable Model) and speech dynamics of the person

  • 2. Run Time

For any speech input the system provides as output a synthetic video stream

slide-110
SLIDE 110
slide-111
SLIDE 111

B-Dido

slide-112
SLIDE 112

C-Hikaru

slide-113
SLIDE 113

D-Denglijun

slide-114
SLIDE 114

E-Marylin

slide-115
SLIDE 115

F-Katie Couric

slide-116
SLIDE 116

G-Katie

slide-117
SLIDE 117

H-Rehema

slide-118
SLIDE 118

I-Rehemax

slide-119
SLIDE 119

L-real-synth

A Turing test: what is real and what is synthetic?

slide-120
SLIDE 120

Tony Ezzat,Geiger, Poggio, SigGraph 2002

A Turing test: what is real and what is synthetic?

slide-121
SLIDE 121

Fourth CBMM Summer School, 2017

slide-122
SLIDE 122

Fourth CBMM Summer School, 2017

slide-123
SLIDE 123

⇒ Bear (0° view)

⇒ Bear (45° view)

Learning: image analysis

slide-124
SLIDE 124

UNCONVENTIONAL GRAPHICS

Θ = 0° view ⇒ Θ = 45° view ⇒

Learning: image synthesis

slide-125
SLIDE 125

Memory Based Graphics DV

slide-126
SLIDE 126

126

slide-127
SLIDE 127
slide-128
SLIDE 128

Learning from examples paradigm

Examples

Prediction Statistical Learning Algorithm

Prediction New sample

Bioinformatics application: predicting type of cancer from DNA chips signals

slide-129
SLIDE 129

New feature selection SVM:

Only 38 training examples, 7100 features

AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error.

Pomeroy, S.L., P. Tamayo, M. Gaasenbeek, L.M. Sturia, M. Angelo, M.E. McLaughlin, J.Y.H. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, M.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander and T.R. Golub. Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression, Nature, 2002.

Learning: bioinformatics

slide-130
SLIDE 130
slide-131
SLIDE 131

Decoding the neural code: Matrix-like read-out from the brain

slide-132
SLIDE 132

The end station of the ventral stream in visual cortex is IT

slide-133
SLIDE 133

77 objects, 8 classes

Chou Hung, Gabriel Kreiman, James DiCarlo, Tomaso Poggio, Science, Nov 4, 2005

Reading-out the neural code in AIT

slide-134
SLIDE 134

Recording at each recording site during passive viewing 100 ms 100 ms

  • 77 visual objects
  • 10 presentation repetitions per object
  • presentation order randomized and counter-balanced

time

slide-135
SLIDE 135

Example of one AIT cell

slide-136
SLIDE 136

Decoding the neural code … using a classifier

x Learning from (x,y) pairs y ∈ {1,…,8}

slide-137
SLIDE 137

Categorization

  • Toy
  • Body
  • Human Face
  • Monkey Face
  • Vehicle
  • Food
  • Box
  • Cat/Dog

Video speed: 1 frame/sec
 Actual presentation rate: 5 objects/sec Neuronal population activity Classifier prediction

Hung, Kreiman, Poggio, DiCarlo. Science 2005

We can decode the brain’s code and read-out from neuronal populations:
 reliable object categorization (>90% correct) using ~200 arbitrary AIT “neurons”

slide-138
SLIDE 138

We can decode the brain’s code and read-out from neuronal populations:
 


reliable object categorization using ~100 arbitrary AIT sites Mean single trial performance

  • [100-300 ms] interval
  • 50 ms bin size
slide-139
SLIDE 139
slide-140
SLIDE 140

When is deep better than shallow

Theorem (informal statement)

Liao, Poggio, 2017

Theory II:
 What is the Landscape of the empirical risk?

Replacing the RELUs with univariate polynomial approximation, Bezout theorem implies that the system of polynomial equations corresponding to zero empirical error has a very large number of degenerate solutions. The global zero-minimizers correspond to flat minima in many dimensions (generically unlike local minima). Thus SGD is biased towards finding global minima of the empirical risk.

slide-141
SLIDE 141


 Bezout theorem

The set of polynomial equations above with k= degree of p(x) has a number of distinct zeros (counting points at infinity, using projective space, assigning an appropriate multiplicity to each intersection point, and excluding degenerate cases) equal to the product of the degrees of each of the equations. As in the linear case, when the system of equations is underdetermined – as many equations as data points but more unknowns (the weights) – the theorem says that there are an infinite number of global minima, under the form of Z regions of zero empirical error.

Z = kn

p(xi)− yi = 0 for i = 1,...,n

slide-142
SLIDE 142

f (xi)− yi = 0 for i = 1,...,n


 Global and local zeros

n equations in W unknowns with W >> n

W equations in W unknowns

slide-143
SLIDE 143

When is deep better than shallow

Results

  • SGD finds with very high probability large volume, flat zero-minimizers;
  • Flat minimizers correspond to degenerate zero-minimizers and thus to global minimizers;
  • SGD minimizers select minima that correspond to small norm solutions and “good”

expected error;

Theory III:

How can the underconstrained solutions found by SGD generalize?

Poggio, Rakhlin, Golovitc, Zhang, Liao, 2017

slide-144
SLIDE 144


 Good generalization with less data than # weights

slide-145
SLIDE 145

No overfitting

slide-146
SLIDE 146

Why do Deep Learning Networks work? ===> In which cases will they fail? Is it possible to improve them? Is it possible to reduce the number of labeled examples?

Beyond today’s DLNNs: several scientific questions…

slide-147
SLIDE 147

Opportunity for a good project!

slide-148
SLIDE 148

Beyond today’s DLNNs: neurocognitive science

  • State-of-the-art DLNNs require ~1M labeled

examples

  • This is not how we learn, how children learn
slide-149
SLIDE 149

The first phase (and successes) of ML: supervised learning, big data:

Today’s science, tomorrow’s engineering: learn like children learn

n → ∞

The next phase of ML: implicitly supervised learning, learning like children do, small data: n → 1

from programmers… …to labelers… …to computers that learn like children…