Overview of Machine Learning introducing the field and some of its - - PowerPoint PPT Presentation

overview of machine learning
SMART_READER_LITE
LIVE PREVIEW

Overview of Machine Learning introducing the field and some of its - - PowerPoint PPT Presentation

Overview of Machine Learning introducing the field and some of its key concepts Thomas Sch on Division of Systems and Control Department of Information Technology Uppsala University. Email: thomas.schon@it.uu.se, www:


slide-1
SLIDE 1

Overview of Machine Learning

“introducing the field and some of its key concepts” Thomas Sch¨

  • n

Division of Systems and Control Department of Information Technology Uppsala University. Email: thomas.schon@it.uu.se, www: user.it.uu.se/~thosc112

Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-2
SLIDE 2

What is machine learning all about? ”Machine learning is about learning, reasoning and acting based on data.”

“It is one of today’s most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core

  • f artificial intelligence and data science.”

Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature 521:452-459, 2015. Jordan, M. I. and Mitchell, T. M. Machine Learning: Trends, perspectives and prospects. Science, 349(6245):255-260, 2015. 1 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-3
SLIDE 3

A probabilistic approach

Machine learning is about methods allowing computers/machines automatically make use of data to solve tasks. Data on its own is typically useless, it is only when we can extract knowledge from the data that it becomes useful. Representation of the data: A model with unknown (a.k.a. latent

  • r missing) variables related to the knowledge we are looking for.

Key concept: Uncertainty. Key ingredient: Data. Probability theory and statistics provide the theory and practice that is needed for representing and manipulating uncertainty about data, models and predictions. Learn the unknown variables from the data.

2 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-4
SLIDE 4

The data – model relationship

The first step in the extraction of knowledge from data often amounts to finding the unknown parameters in the model using the data that we have available. To do this the learning system needs links between the latent and the observed data. The links are made via assumptions and taken together these assumptions constitute the model. A mathematical model is a compact representation (set of assumptions) of the data that in precise mathematical form captures the key properties of the underlying system.

3 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-5
SLIDE 5

Mathematical models in machine learning

To enable reasoning about uncertainty we make use of extensive use of statistics and probability theory in building our models. We often work with very flexible models and methods, such as for example Gaussian processes, neural networks (deep learning). Simpler models, like the linear regression remains of key

  • importance. Typically these simpler models are used as

components within model complex models. “All models are wrong but some are useful.” Uncertainty plays a fundamental role since any reasonable model will be uncertain when making predictions of unobserved data.

4 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-6
SLIDE 6

Mathematical models – representations

The performance of an algorithms typically depends on which representation that is used for the data. Learned representations often provide better solutions than hand-designed representations. When solving a problem – start by thinking about which model/representation to use!

5 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-7
SLIDE 7

Representation learning

Learning Multiple Componen

Input Hand- designed program Output Input Hand- designed features Mapping from features Output Input Features Mapping from features Output Input Simple features Mapping from features Output Additional layers of more abstract features Rule-based systems Classic machine learning Representation learning Deep learning

Figure 1.5

From http://www.deeplearningbook.org/

Problem: How can we learn good representations of data?

  • Ex. Deep learning (DL) solves

the problem by introducing representations that are expressed in terms of other, simpler representations. International Conference on Learning Representations http://www.iclr.cc/

6 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-8
SLIDE 8

The two basic rules from probability theory

Let x and y be continuous random variables. Let p(·) denote a general probability density function.

  • 1. Marginalization (integrate out a variable):

p(x) =

  • p(x, y)dy.
  • 2. Conditional probability:

p(x, y) = p(x | y)p(y). Combine them into Bayes’ rule: p(y | x) = p(x | y)p(y) p(x) = p(x | y)p(y)

  • p(x | y)p(y)dy.

7 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-9
SLIDE 9

Key objects – Learning a model

D - measured data. z - unknown model variables. The full probabilistic model is given by p(D, z) = p(D | z)

data distribution

p(z)

  • prior

Inference amounts to computing the posterior distribution p(z | D) =

data distribution

p(D | z)

prior

  • p(z)

p(D)

model evidence

Soon we will make this much more concrete.

8 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-10
SLIDE 10

The model – inference relationship

The problem of inferring (estimating) a model based on data leads to computational challenges, both

  • Integration:

e.g. the HD integrals arising during marg. (averaging over all possible parameter values z): p(D) =

  • p(D | z)p(z)dz.
  • Optimization:

e.g. when extracting point estimates, for example by maximizing the posterior or the likelihood

  • z = arg max

z

p(D | z) Typically impossible to compute exactly, use approximate methods

  • Monte Carlo (MC), Markov chain MC (MCMC), and

sequential MC (SMC).

  • Variational inference (VI).

9 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-11
SLIDE 11

Example of 3 of the 4 cornerstones

The three cornerstones: 1. Data, 2. Model, and 3. Inference. Aim: Compute the position and

  • rientation of the different body

segments of a person moving around indoors (motion capture). Sensors (data) used:

  • 3D Accelerometer
  • 3D Gyroscope
  • 3D Magnetometer

A situation where we need to find latent variables based on

  • bserved data.

We need a model to extract knowledge from the observed data.

10 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-12
SLIDE 12

Example data-model-inference

Illustrate the use of three different models:

  • 1. Integration of the observations from the sensors.
  • 2. Add a biomechanical model.
  • 3. Add a world model.

Add ultrawideband (UWB) measurements for absolute position.

Manon Kok, Jeroen D. Hol and Thomas B. Sch¨

  • n. An optimization-based approach to human body motion

capture using inertial sensors. In Proceedings of the 19th World Congress of the International Federation of Automatic Control (IFAC), Cape Town, South Africa, August 2014. Manon Kok, Jeroen D. Hol and Thomas B. Sch¨

  • n. Indoor positioning using ultrawideband and inertial
  • measurements. IEEE Transactions on Vehicular Technology, 64(4):1293-1303, April, 2015.

11 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-13
SLIDE 13

Example – ambient magnetic field map

The Earth’s magnetic field sets a background for the ambient magnetic field. Deviations make the field vary from point to point. Aim: Build a map (i.e., a model) of the magnetic environment based on measurements from magnetometers. Solution: Customized Gaussian process that obeys Maxwell’s equations.

www.youtube.com/watch?v=enlMiUqPVJo

Arno Solin, Manon Kok, Niklas Wahlstr¨

  • m, Thomas B. Sch¨
  • n and Simo S¨

arkk¨

  • a. Modeling and interpolation of

the ambient magnetic field by Gaussian processes. arXiv:1509.04634, 2015. 12 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-14
SLIDE 14

Example – waveNet

A generative model capable to reading a written text using an artificial voice, beating all existing techniques. Using enough samples of a persons voice this can be used to synthesize new a written text using this particular voice. Application example: Audiobooks?

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K. WaveNet: a generative model for raw audio. arXiv:1609.03499v2, September, 2016. 13 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-15
SLIDE 15

The nature of Machine Learning

It is sometimes (often...) easier to solve a problem by starting from examples of input-output data than trying to manually program it. In ML we start from the data. We need models that are flexible enough to capture the properties of the data that are necessary to achieve a certain task. There are basically two ways of building flexible models:

  • 1. Models that use a large (but fixed) number of parameters

compared with the data set. (parametric, ex. deep learning)

  • 2. Models that use more parameters as we get access to more
  • data. (non-parametric, ex Gaussian process)

14 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-16
SLIDE 16

The scientific field of Machine Learning

There are many related terms, e.g. Pattern Recognition, Statistical Modelling, Data Mining, Adaptive Control, Data Analytics, Data Science, Artificial Intelligence, and Machine Learning. Learning is clearly multidisciplinary, view from different fields:

  • Engineering:

Signal processing, system identification, adaptive and optimal control, computer vision/image processing, information theory, robotics, . . .

  • Computer Science:

Artificial Intelligence, information retrieval, . . .

  • Statistics:

Learning theory, data mining, learning and inference from data, . . .

  • Cognitive Science and Psychology:

perception, mathematical psychology, computational linguistics, . . .

  • Economics:

Decision theory, game theory, operational research, . . .

15 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-17
SLIDE 17

Different scientific fields, same mathematics

Machine learning develops methods allowing computers to improve their performance at certain tasks based on observed data. Find and understand hidden structures and regularities in data.

  • 1. Look at the data and define possible models to be used.
  • 2. Learn the parameters and structure of the models from data.
  • 3. Use the models to make predictions and decisions.

Provides you with a timely and highly sought after skill-set. The importance of these skills is very likely to increase in the future.

16 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-18
SLIDE 18

Field of machine learning

Top conferences on general machine learning

  • 1. Neural Information Processing Systems (NIPS) and the

International Conference on Machine Learning (ICML)

  • 2. Inter. Conf. on Artificial Intelligence and Statistics

(AISTATS) and Uncertainty in Artificial Intelligence (UAI) Top journals on general machine learning

  • 1. Journal of Machine Learning Research (JMLR)
  • 2. IEEE Trans. Pattern Analysis and Machine Intel. (PAMI)

For new (and non-peer reviewed) material see arXiv.org arxiv.org/list/stat.ML/recent

17 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-19
SLIDE 19

My assignment

Aim of these lectures: To give an introduction to statistical Machine Learning (SML) by:

  • 1. providing a brief overview and
  • 2. introducing a few key techniques.

I will also make use of our own and others research in the area to exemplify the concepts that I am introducing.

18 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-20
SLIDE 20

Outline

  • 1. What is Machine Learning?
  • 2. Probabilistic modelling via probabilistic linear regression
  • 3. Flexible model 1 – Deep learning
  • 4. Flexible model 2 – Gaussian process
  • 5. (Deep) reinforcement learning (very brief if time is short)
  • 6. Some other interesting tools
  • 7. Conclude

19 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-21
SLIDE 21

Key building block – probability distributions

  • Important on their own.
  • Form building blocks for more sophisticated probabilistic

models.

  • Distributions commonly used in models: Gaussian,

exponential, Laplace, Wishart, Gamma, Student-t and inverse-Wishart, just to mention a few.

Given the computational tools we have today it is often rewarding to resist the linear Gaussian convenience!!

20 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-22
SLIDE 22

Probabilistic linear regression

Linear regression models the relationship between a continuous

  • utput variable yn and the input variable xn,

yn = θ1xn,1 + θ2xn,2 + · · · + θdxn,d + εn, εn ∼ N(0, β−1) = θTxn

f(xn,θ)

+

noise

  • εn ,

n = 1, . . . , N. We have N input-output pairs available. Model for n = 1, . . . , N yn = θTxn + εn, εn ∼ N(0, β−1), where β is a known constant. Let the input xn be modeled as a know deterministic variable.

21 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-23
SLIDE 23

Probabilistic linear regression

Recall that the full probabilistic model is given by p(Y , θ) = p(Y | θ)

data distribution

p(θ)

  • prior

where Y = (y1, y2, . . . , yN)T. p(Y | θ) =

N

  • n=1

p(yn | θ) =

N

  • n=1

N(yn | θTxn, β−1). For simplicity we let the prior be p(θ) = N(θ | 0, α−1Id). Hence, for this example, the full probabilistic model is p(Y , θ) =

N

  • n=1
  • N(yn | θTxn, β−1)
  • N(θ | 0, α−1Id).

22 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-24
SLIDE 24

Probabilistic linear regression

The posterior distribution is now given by p(θ | Y ) = p(Y | θ)p(θ) p(Y ) = · · · = N(θ | mN, SN), where mN = βSNXTY, SN = (αId + βXTX)−1, Y =      y1 y2 . . . yN      X =      xT

1

xT

2

. . . xT

N

    

23 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-25
SLIDE 25

Probabilistic linear reg. – example (I/VI)

Consider the problem of fitting a straight line to noisy measurements. Let the model be (yn ∈ R, xn ∈ R) yn = θ0 + θ1xn

  • f(xn,θ)

+εn, n = 1, . . . , N. where εn ∼ N(0, 0.22), β = 1 0.22 = 25. The example lives in two dimensions, allowing us to plot the distributions in illustrating the inference.

24 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-26
SLIDE 26

Probabilistic linear reg. – example (II/VI)

Let the true values for θ be θ⋆ =

  • −0.3

0.5 T , plotted using a filled white circle below. Generate synthetic measurements by yn = θ⋆

0 + θ⋆ 1xn + εn,

εn ∼ N(0, 0.22), where xn ∼ U(−1, 1). Furthermore, let the prior be p(θ) = N

  • θ |
  • T , α−1I
  • ,

where α = 2.

25 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-27
SLIDE 27

Probabilistic linear reg. – example (III/VI)

Plot of the situation before any data arrives. Prior, p(θ) = N

  • θ |
  • T , 1

2I

  • −1

−0.5 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x y

Example of a few realizations from the prior.

26 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-28
SLIDE 28

Probabilistic linear reg. – example (IV/VI)

Plot of the situation after one measurement has arrived.

−1 −0.5 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x y

Data distribution (plotted as a function of θ) p(y1 | θ) = N(y1 | θ0 + θ1x1, β−1) Posterior/prior, p(θ | y1) = N (θ | m1, S1) , m1 = βS1XTY, S1 = (αI + βXTX)−1. Example of a few realizations from the posterior and the first measurement (black circle).

27 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-29
SLIDE 29

Probabilistic linear reg. – example (V/VI)

Plot of the situation after two measurements have arrived.

−1 −0.5 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x y

Data distribution (plotted as a function of θ) p(y2 | θ) = N(y2 | θ0 + θ1x2, β−1) Posterior/prior, p(θ | Y ) = N (θ | m2, S2) , m2 = βS2XTY, S2 = (αI + βXTX)−1. Example of a few realizations from the posterior and the measurements (black circles).

28 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-30
SLIDE 30

Probabilistic linear reg. – example (VI/VI)

Plot of the situation after 30 measurements have arrived.

−1 −0.5 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x y

Data distribution (plotted as a function of θ) p(y30 | θ) = N(y30 | θ0 + θ1x30, β−1) Posterior/prior, p(θ | Y ) = N (θ | m30, S30) m30 = βS30XTY, S30 = (αI + βXTX)−1. Example of a few realizations from the posterior and the measurements (black circles).

29 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-31
SLIDE 31

Kernelized linear regression

We can predict the output y⋆ for a previously unseen input x⋆ (a “test” input) according to ˆ y⋆ = f(x⋆, mN) = mT

Nx⋆,

which we can rewrite ˆ y⋆ = xT

⋆ mN = βxT ⋆ SNXTY = N

  • n=1

βxT

⋆ SNxn

  • kernel fcn.

yn. Hence, the predictive mean can be written ˆ y⋆ =

N

  • n=1

k(x⋆, xn)yn where k(x, x′) = βxTSNx′ = ψ(x)Tψ(x′) (with ψ(x) = β1/2S1/2

N x) is called the equivalent kernel.

30 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-32
SLIDE 32

Kernelized linear regression

The exercise on the previous slide suggests an alternative approach to regression, where we instead of postulating a parametric model y = θTx + ε directly make use of a kernel. Provides a non-parametric alternative to linear regression. General property of kernels k(x, x′) = ψ(x)Tψ(x′), (an inner product of the input). Kernel trick: In any algorithm where the input data enters only in the form of an inner product, we can replace this inner product with any kernel! The Gaussian process is one construction that provides a non- parametric alternative to regression via the direct use of a kernel.

31 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-33
SLIDE 33

Learning the hyperparameters from data

yn = θTxn + εn, εn ∼ N(0, β−1), θ ∼ N(0, α−1Id). Important question: How do we decide on the suitable values for the hyperparameters η = (α, β)? Pragmatic idea: Estimate the hyperparameters η from the data by selecting them such that they maximize the marginal likelihood function, p(Y | η) =

  • p(Y | θ, η)p(θ | η)dθ.

Travels under many names and besides empirical Bayes this is also referred to as type 2 maximum likelihood, generalized maximum likelihood, and evidence approximation.

32 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-34
SLIDE 34

Summary – Probabilistic linear regression

Linear regression models the relationship between a continuous

  • utput variable yn and the input variable xn,

yn = θTxn + εn, εn ∼ N(0, β−1), θ ∼ N(0, α−1I). We derived the expression for the posterior distribution p(θ | Y ) = p(Y | θ)p(θ) p(Y ) Kernelized linear regression via the use of the kernel trick. Showed how to estimate the hyperparameters from data. We can show that the MAP point estimate with the data distribution (likelihood) p(Y | θ) ∝ N

n=1(yn − θTxn)2 together

with a Gaussian prior leads to ridge regression and with a Laplacian prior it leads to the LASSO.

33 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-35
SLIDE 35

Probabilistic mod. – Want to know more?

Good introduction

Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature 521:452-459, 2015.

A clear tutorial of probabilistic modelling

Bishop, C. M. Model-based machine learning. Philosophical Transactions of the Royal Society A, 371, 20120222 (2013).

Textbook style introductions

Bishop, C. M. Pattern Recognition and Machine Learning, Springer, 2006. Murphy, K. P. Machine learning – a probabilistic perspective, MIT Press, 2012. Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.

Lecture on Probabilistic modelling

https://www.microsoft.com/en-us/research/video/ posner-lecture-probabilistic-machine-learning-foundations-and-frontiers/ 34 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-36
SLIDE 36

Outline

  • 1. What is Machine Learning?
  • 2. Probabilistic modelling via probabilistic linear regression
  • 3. Flexible model 1 – Deep learning
  • 4. Flexible model 2 – Gaussian process
  • 5. (Deep) reinforcement learning (very brief if time is short)
  • 6. Some other interesting tools
  • 7. Conclude

35 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-37
SLIDE 37

Motivating the name deep learning

Let the computer learn from experience and understand the situation in terms of a hierarchy of concepts, where each concepts is defined in terms of its relation to simpler concepts. If we draw a graph showing these concepts of top of each other, the graph is deep, hence the name deep learning. Key aspect: We avoid the need for a human to formally specify all the knowledge that the computer needs. Deep learning achieves a highly flexible model by representing the world in terms of a sequential hierarchy of concepts, where more abstract representations are computed in terms of less abstract ones.

36 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-38
SLIDE 38

Why deep? - Image classification example

Image classification (Input: pixels of an image. Output: object identity.) 1 megapixel (B/W) → 21 000 000 possible images! A deep neural network can solve this with a few million parameters! Each hidden layer extracts increasingly abstract features.

Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), Z¨ urich, Switzerland, September, 2014. 37 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-39
SLIDE 39

An example – deep autoencoder

Unsupervised learning procedure for dimensionality reduction. Notation:

  • yk - High-dim. observations
  • zk - Low-dim. features
  • dim(yk) ≫ dim(zk)

. . . . . . . . . . . . . . .

yk,1 yk,2 yk,3 yk,n zk,1 zk,m ˆ yk,1 ˆ yk,2 ˆ yk,3 ˆ yk,n Input layer Hidden layer ”bottleneck” Output layer

38 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-40
SLIDE 40

An example – deep autoencoder

Unsupervised learning procedure for dimensionality reduction. Notation:

  • yk - High-dim. observations
  • zk - Low-dim. features
  • dim(yk) ≫ dim(zk)

. . . . . . . . . . . . . . .

yk,1 yk,2 yk,3 yk,n zk,1 zk,m ˆ yk,1 ˆ yk,2 ˆ yk,3 ˆ yk,n Input layer Hidden layer ”bottleneck” Output layer

  • Encoder

Model components:

  • 1. Encoder: zk = g−1(yk; θE)

38 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-41
SLIDE 41

An example – deep autoencoder

Unsupervised learning procedure for dimensionality reduction. Notation:

  • yk - High-dim. observations
  • zk - Low-dim. features
  • dim(yk) ≫ dim(zk)

. . . . . . . . . . . . . . .

yk,1 yk,2 yk,3 yk,n zk,1 zk,m ˆ yk,1 ˆ yk,2 ˆ yk,3 ˆ yk,n Input layer Hidden layer ”bottleneck” Output layer

  • Encoder
  • Decoder

Model components:

  • 1. Encoder: zk = g−1(yk; θE)
  • 2. Decoder:

yR

k = g(zk; θD)

38 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-42
SLIDE 42

An example – deep autoencoder

Unsupervised learning procedure for dimensionality reduction. Notation:

  • yk - High-dim. observations
  • zk - Low-dim. features
  • dim(yk) ≫ dim(zk)

. . . . . . . . . . . . . . .

yk,1 yk,2 yk,3 yk,n zk,1 zk,m ˆ yk,1 ˆ yk,2 ˆ yk,3 ˆ yk,n Input layer Hidden layer ”bottleneck” Output layer

  • Encoder
  • Decoder

Model components:

  • 1. Encoder: zk = g−1(yk; θE)
  • 2. Decoder:

yR

k = g(zk; θD)

Reconstruction error: VR(θE, θD) = N

k=1 yk−

yR

k (θE, θD)2

38 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-43
SLIDE 43

Constructing an NN for regression

A neural network (NN) is a hierarchical nonlinear function y = gθ(x) from an input variable x to an output variable y parameterized by θ. Linear regression models the relationship between a continuous

  • utput variable y and an input variable x,

y =

n

  • i=1

xiθi + θ0 + ε = xTθ + ε, where θ is the parameters composed by the “weights” θi and the

  • ffset (“bias”) term θ0,

θ =

  • θ0

θ1 θ2 · · · θn T , x =

  • 1

x1 x2 · · · xn T .

39 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-44
SLIDE 44

Generalized linear regression

We can generalize this by introducing nonlinear transformations of the predictor uTθ, y = f(uTθ). Let us consider an example of a feed-forward NN, indicating that the information flows from the input to the output layer.

40 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-45
SLIDE 45

NN for regression – an example

  • 1. Form m1 linear combinations of the input x ∈ Rn

a(1)

j

=

n

  • i=1

θ(1)

ji xi + θ(1) j0 ,

j = 1, . . . , m1.

  • 2. Apply a nonlinear transformation (element-wise)

zj = f

  • a(1)

j

  • ,

j = 1, . . . , m1.

  • 3. Form my linear combinations of z ∈ Rm1

yk =

my

  • j=1

θ(2)

kj zj + θ(2) k0 ,

k = 1, . . . , my.

41 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-46
SLIDE 46

NN for regression – an example

ˆ yk(θ) =

m1

  • j=1

θ(2)

kj f

n

  • i=1

θ(1)

ji xi + θ(1) j0

  • + θ(2)

k0

. . . f f f . . .f f . . . x1 x2 xn z1 ˆ y1 ˆ ymy θ(1)

11

θ(2)

11

Inputs Hidden layer Output layer

42 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-47
SLIDE 47

Deep neural networks

We can think of the neural network as a sequential construction of several generalized linear regressions. Each layer in a multi-layer NN is modelled as z(l+1) = f

  • Θ(l+1)z(l) + θ(l+1)
  • ,

starting with the input z(0) = x. (The nonlinearity operates element-wise.) The scalar nonlinear function f(·) is what makes the neural network nonlinear. Common functions are f(z) = 1/(1 + e−z), f(z) = tanh(z) and f(z) = max(0, z). The so-called rectified linear unit (ReLU) f(z) = max(0, z) is heavily used for deep architectures.

43 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-48
SLIDE 48

Deep neural networks

Deep learning methods allow a machine to make use of raw data to automatically discover the representations (abstractions) that are necessary to solve a particular task. It is accomplished by using multiple levels of representation. Each level transforms the representation at the previous level into a new and more abstract representation, z(l+1) = f

  • Θ(l+1)z(l) + θ(l+1)
  • ,

starting from the input (raw data) z(0) = x. Key aspect: The layers are not designed by human engineers, they are generated from (typically lots of) data using a learning procedure and lots of computations.

44 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-49
SLIDE 49

Training an NN

The final layer z(L) of the network is used for making a prediction ˆ y(θ) = z(L) and we train the network by employing:

  • 1. A set of training data.
  • 2. A cost function L (ˆ

y(θ), y) and a regularizer J(θ).

  • 3. An iterative scheme to optimize the cost function

V (θ) =

N

  • n=1

L (ˆ yn(θ), yn) + λJ(θ). There is great software support available, use it! For example: TensorFlow (and playground.tensorflow.org), Theano, Torch.

45 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-50
SLIDE 50

Some comments - Why now?

Neural networks have been around for more than fifty years. Why have they become so popular now (again)? To solve really interesting problems you need:

  • 1. Efficient learning algorithms
  • 2. Efficient computational hardware
  • 3. A lot of labeled data!

These three factors have not been fulfilled to a satisfactory level until the last 5-10 years.

46 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-51
SLIDE 51

Summary – Deep learning

A neural network (NN) is a hierarchical nonlinear function y = gθ(x) from an input variable x to an output variable y parameterized by θ. We can think of an NN as a sequential construction of several generalized linear regressions. Deep learning refers to learning NNs with several hidden layers. Allows for data-driven models that automatically learn representations (features) of data with multiple layers of abstraction. A deep NN is very parameter efficient when modelling high-dimensional, complex data.

47 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-52
SLIDE 52

Deep learning – Want to know more?

Good introduction

LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep learning, Nature, 521(7553), 436–444.

Timely introduction (target audience include: software engineers who do not have a machine learning or statistics background)

  • I. Goodfellow, Y. Bengio and A. Courville Deep learning Book in preparation for MIT Press.

http://www.deeplearningbook.org

Interesting discussion about why it works so well

Lin, H. W. and Tegmark, M. (2016) Why does deep and cheap learning work so well?, arXiv

Deep learning summer school 2016

https://sites.google.com/site/deeplearningsummerschool2016/

Geoffrey Hinton’s Coursera course

https://www.coursera.org/learn/neural-networks/home/welcome

NIPS and ICML conferences and workshops!

48 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-53
SLIDE 53

Appendix

49 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-54
SLIDE 54

Probabilistic linear regression – summary

We have N input-output pairs available. Model for n = 1, . . . , N yn = θTxn + εn, εn ∼ N(0, β−1), θ ∼ N(0, α−1Id). We have shown that the posterior distribution is p(θ | Y ) = N(θ | mN, SN), where mN = βSNXTY, SN = (αId + βXTX)−1, Y =      y1 y2 . . . yN      X =      xT

1

xT

2

. . . xT

N

     What is the maximum aposteriori (MAP) point estimate? ˆ θMAP = arg max

θ

p(θ | Y )

50 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-55
SLIDE 55

On the relationship to maximum likelihood

The MAP solution is given by ˆ θMAP = β(αI + βXTX)−1XTY. (1) Do you recognize this solution? What if α = 0? This is the maximum likelihood solution, i.e. the solution to ˆ θML = arg min

θ

||Y − Xθ||2

2

Keeping α = 0 we can show that the solution to ˆ θRR = arg min

θ

||Y − Xθ||2

2 + α

β ||θ||2

2

is given by (1). Commonly referred to as ridge regression. Hence, ridge regression is equivalent to computing the MAP estimate using a Gaussian prior for θ.

51 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-56
SLIDE 56

On the relationship to LASSO

Analogously we can show an equivalence between the use of a Laplacian prior and ℓ1-norm regularization, ˆ θLASSO = arg min

θ

||Y − Xθ||2

2 + λ||θ||1

Commonly referred to as the LASSO. Ridge regression and the LASSO are two examples of regularized (or penalized) maximum likelihood.

52 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.
slide-57
SLIDE 57

Probabilistic modeling of linear SSMs

A linear Gaussian state space model (SSM) consists of a Markov process {xt}t≥1 that is indirectly observed via a linear, Gaussian measurement process {yt}t≥1, xt+1 = Axt + But + vt, vt ∼ N(0, Q) yt = Cxt + Dut + et, et ∼ N(0, R) x1 ∼ µη(x1), θ ∼ π(θ), where θ = {A, B, C, D, Q, R, η}. The full probabilistic model is given by p(x1:T , θ, y1:T ) = p(y1:T | x1:T , θ)

  • data distribution

p(x1:T , θ)

  • prior

53 / 48 Overview of Machine Learning, Autonomous systems, WASP PhD course Thomas Sch¨

  • n, 2016.