A Gentle Introduction to Machine Learning First Lecture Olov - - PDF document

a gentle introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

A Gentle Introduction to Machine Learning First Lecture Olov - - PDF document

10/7/2014 A Gentle Introduction to Machine Learning First Lecture Olov Andersson, AIICS Linkpings Universitet What is Machine Learning about? To imbue the capacity to learn into machines Our only frame of reference for learning is


slide-1
SLIDE 1

10/7/2014 1

A Gentle Introduction to Machine Learning

First Lecture

Olov Andersson, AIICS Linköpings Universitet

What is Machine Learning about?

  • To imbue the capacity to learn into machines
  • Our only frame of reference for learning is from biology
  • …but brains are hideously complex, the result of ages of evolution
  • Like much of AI, Machine Learning mainly takes an engineering

approach1

  • Humanity didn’t first master flight by just imitating birds!

Although there is some

  • ccasional biological

inspiration

1.

2014-10-07 2

slide-2
SLIDE 2

10/7/2014 2

Why Machine Learning

  • It may be impossible to manually program for every situation in

advance

  • The world may change, if the agent cannot adapt it will fail
  • Many argue that learning is required for AI to scale up
  • We are still far from a general learning agent!
  • but the algorithms we have so far have shown themselves to be useful

in a wide range of applications!

2014-10-07 3

Some Application Aspects

  • May not be as versatile as human learning, but domain specific

problems can often be processed much faster than by a human

  • Computers work 24/7 and you can often scale performance by

piling on more of them Data Mining

  • Companies can collect ever more data and processing power is cheap
  • Put it to use automatically analyzing the performance of products!
  • Machine Learning is almost ubiquitous on the web: Mail filters, search

engines, product recommendations , customized content, ad serving…

  • “Big Data” – much hyped technology trend.

Robotics, Computer Vision

  • Many capabilities that humans take for granted like locomotion,

grasping and recognizing objects have turned out to be ridiculously difficult to manually construct rules for.

2014-10-07 4

slide-3
SLIDE 3

10/7/2014 3

Demo – Stanford Helicopter Acrobatics

…in narrow applications machine learning can even rival human performance

2014-10-07 5

To Define Machine Learning

A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E

  • Tom Mitchell

From the agent perspective:

Task (Environment)

Input (Sensors) Output (Actuators) Performance Metric

Agent

2014-10-07 6

slide-4
SLIDE 4

10/7/2014 4

The Three Main Types of Machine Learning

2014-10-07 7

Machine learning is a young science that is still changing, but traditionally algorithms are usually divided into three types depending on their purpose.

  • Supervised Learning
  • Reinforcement Learning
  • Unsupervised Learning

Supervised Learning at a glance

A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E

  • Tom Mitchell

In Supervised Learning:

  • The correct output is given to the algorithm during a training

phase

  • Experience E is thus tuples of training data consisting of

(Input,Output) pairs

  • Performance metric P is some function of how well the

predicted output matches the given correct output Mathematically, can be seen as trying to approximate an unknown function f(x) = y given examples of (x, y)

2014-10-07 8

slide-5
SLIDE 5

10/7/2014 5

Supervised Learning at a glance II

Representation from agent perspective:

Task (Environment)

Input (Sensors) Output (Actuators) Performance Metric

Reactive Agent

Supervised Learning is surprisingly powerful and ubiquitous Some real world examples

  • Spam Filter: f(mail) = spam?
  • Microsoft Kinect: f(pixels, distance) = body part

state action

f(input)=output

  • r

f(state) = action

…but it can also be used as a component in other architectures

2014-10-07 9

Body Part Classification on the Microsoft Kinect

2014-10-07 10

right elbow right hand left shoulder neck Shotton et al @ MSR, CVPR 2011

slide-6
SLIDE 6

10/7/2014 6

Reinforcement Learning at a glance

A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E

  • Tom Mitchell

In Reinforcement Learning:

  • A reward is given at each step instead of the correct input/output
  • Experience E consists of the history of inputs, the chosen outputs

and the rewards

  • Performance metric P is some sum of how much reward the

agent can accumulate Inspired by early work in psychology and how pets are trained The agent can learn on its own as long as the reward signal can be concisely defined.

2014-10-07 11

Reinforcement Learning at a glance II

RL fits neatly into a utility (reward) maximizing agent framework

  • Rewards of actions in different states are learned
  • Agent plans ahead to maximize reward over time

2014-10-07 12

Task (Environment)

Input (Sensors) Output (Actuators) Performance Metric (reward)

RL Agent

Real world examples – Robot Control, Game Playing (Checkers…)

state action

f(state, action) = reward Maximize future reward

slide-7
SLIDE 7

10/7/2014 7

Unsupervised Learning at a glance

A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E

  • Tom Mitchell

In Unsupervised Learning:

  • The Task is to find a more concise representation of the data
  • Neither the correct answer, nor a reward is given
  • Experience E is just the given data
  • P depends on the task

Examples: Clustering – When the data distribution is confined to lie in a small number of “clusters “ we can find these and use them instead of the

  • riginal representation

Dimensionality Reduction – Finding a suitable lower dimensional representation while preserving as much information as possible

2014-10-07 13

Clustering – Continuous Data

(Bishop, 2006) Two-dimensional continuous input

2014-10-07 15

slide-8
SLIDE 8

10/7/2014 8

Outline of Machine Learning Lectures

First we will talk about Supervised Learning

  • Definition
  • Main Concepts
  • Some Approaches & Applications
  • Pitfalls & Limitations
  • In-depth: Decision Trees (a supervised learning approach)

Then finish with a short introduction to Reinforcement Learning The idea is that you will be informed enough to find and try a learning algorithm if the need arises.

2014-10-07 17

Supervised Learning in more detail…

Remember, in Supervised Learning:

  • Tuples of training data consisting of (x,y) pairs are given to the

algorithm

  • The objective is to learn to predict the output yi for an input xi

Can be seen as searching for an approximation to the unknown function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)

  • A candidate approximation is sometimes called a hypothesis

Two major classes of supervised learning

  • Classification – Output is a discrete category label
  • Example: Detecting cancer, y = “healthy” or “ill”
  • Regression – Output is a numeric value
  • Example: Predicting temperature, y = 15.3 degrees

In either case input data xi can be vector valued and discrete, continuous or mixed. Example: x1 = (12.5, “cloud free”, 1.35).

2014-10-07 18

slide-9
SLIDE 9

10/7/2014 9

Supervised Learning in Practice

Can be seen as searching for an approximation to the unknown function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn) The goal is to have the algorithm learn from training examples to successfully generalize to new examples

  • First construct an input vector xi of examples by encoding

relevant problem data. This is often called the feature vector.

  • Examples of such xi, yi is the training set.
  • A model is selected and trained on the examples by searching

for parameters (the hypothesis space) that yield a good approximation to the unknown true function.

  • Evaluate performance, (carefully) tweak algorithm or features.

, , ,

2014-10-07 19

Feature Vector Construction

Want to learn y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)

  • Standard algorithms tend to work on variables defined as numbers
  • If the inputs x or outputs y contain categorical values like “book” or

“car” we need to encode them, typically as integers.

  • With only two classes we get y in {0,1}, called binary classification
  • Classification into multiple classes can be reduced to a sequence of

binary one-vs-all classifiers

  • The variables may also be structured like in text, graphs, audio,

image or video data

  • Finding a suitable feature representation can be non-trivial, but

there are standard approaches for the common domains

2014-10-07 20

slide-10
SLIDE 10

10/7/2014 10

Example of Feature Vector Construction

One of the early successes was learning spam filters Spam classification example:

Each mail is an input, some mails are flagged as spam or not spam to create training examples. Feature vector: Encode the existence of a fixed set of relevant key words in each mail as the feature vector.

Feature Exists? “Customer” 1 (Yes) “Dollar” 0 (No) “Nigeria” “Accept” 1 “Bank” …. …

xi = wordsi =

2014-10-07 21

yi = 1 (spam) or 0 (not spam) Simply learn f(x)=y using suitable classifier!

Simple Linear Classification Example

I. Construct a feature vector xi to be used with examples of yi II. Select algorithm and train on training data by searching for a good approximation to the unknown function Fictional example: A learning smartphone app that determines if silent mode should be on or off based on background noise, light level and observed user behavior. Feature vector xi = (Light level, Noise), yi = {“silent on”, “silent off”}

  • Select the familiy of linear

discriminant functions

  • Train the algorithm by searching for

a line that separates the classes well

  • New cases will be classified

according to which side they fall

2014-10-07 22

slide-11
SLIDE 11

10/7/2014 11

2014-10-07 23

Simple Linear Regression Example

I. Construct a feature vector xi to be used with examples of yi II. Select algorithm and train on training set by searching for a good approximation to the unknown function Fictional example: Same smartphone app but now we want to predict the ring volume based on noise level (only) Feature vector xi = (Noise), yi = (Volume)

  • Select the familiy of linear functions
  • Train the algorithm by searching for

a line that fits the data well …but how does ”training” really work?

2014-10-07 24

slide-12
SLIDE 12

10/7/2014 12

2014-10-07 25

Training a learning algorithm...

  • Want to find approximation h(x) to the unknown function f(x)
  • As an example we select the hypothesis space to be the family of

polynomials of degree one, that is linear functions:

  • The hypothesis space has two parameters
  • How do we find parameters that result in a good hypothesis h?

2014-10-07 26

Three (poor) linear hypotheses Feature vector xi = (Noise), outputs yi = (Volume)

slide-13
SLIDE 13

10/7/2014 13

Training a Learning Algorithm – Loss Functions

How do we find parameters w that result in a good hypothesis ?

  • First we need to define ”good”!
  • Define it mathematically as maximizing some function of how well

the hypothesis fits the data

  • Often one instead minimize how well it doesn’t fit
  • Search in continuous domains like w is known as optimization
  • (see Ch4.2 in course book AIMA)
  • For regression one common choice is a square loss function:

2014-10-07 27

Training a Learning Algorithm – Loss Functions II

  • What about classification?
  • Squared error does not make sense when target output in {0,1}
  • Custom loss functions for classification
  • Number of missclassifications (unsmooth w.r.t. parameter changes)
  • (inverse) information gain (used in decision trees, next lecture)
  • These require specialized parameter search methods
  • Alternative: Squash a numeric output to [0,1], then use square

error!

2014-10-07 28

Sigmoid functions allow us to use any regression method for classification. Common choice is the logistic function:

slide-14
SLIDE 14

10/7/2014 14

Initialize w to some random point in the parameter space loop until improvement in loss is small for each in w do Where:

  • Optimization approaches typically move in the direction that

locally improves the objective (ie. decreasing the loss function)

  • One simple and popular example is gradient descent:

Training a Learning Algorithm – Optimization

2014-10-07 29

How do we find parameters w that minimize the loss?

Training a Learning Algorithm – Limitations

Limitations

  • Locally greedy optimization approaches only work if the loss

function is convex w.r.t. w, i.e. there is only one minima

  • Linear regression models are always convex, however more

advanced models that we will look at later are vulnerable to getting stuck in local minina

  • Care should be taken when training such models by utilizing for

example random restarts

2014-10-07 30

Start positions in red area will get stuck in a local minima!

slide-15
SLIDE 15

10/7/2014 15

Linear Models in Summary

Advantages

  • Linear algorithms are simple and computationally efficient
  • Training them is a convex problem, so one is guaranteed to find

the best hypothesis in the hypothesis space Disadvantages

  • The hypothesis space is very restricted, it cannot handle non-

linear relations well They are widely used in applications

  • Recommender Systems – Initial Netflix Cinematch was a linear

regression, before their $1 million competition to improve it

  • At the core of the recent Google Gmail Priority feature is a linear

classifier

  • And many more…

2014-10-07 31

Beyond Linear Models – Artificial Neural Networks

  • One non-linear parametric model that has captivated people for

decades is Artificial Neural Networks (ANNs)

  • These draw upon inspiration from the physical structure of the

brain as an interconnected network of neurons, represented by non-linear ”activation functions” of the inputs

2014-10-07 32

The Neuron The Network

slide-16
SLIDE 16

10/7/2014 16

Artificial Neural Networks – The Neuron

  • In univariate (one input) linear regression we used the model:
  • Each neuron in an ANN is a linear model of all the inputs passed

through a non-linear activation function g

  • The activation function is typically the sigmoid

2014-10-07 33

Artificial Neural Networks – The Neuron II

  • However, there is not just one neuron, but a network of neurons!
  • Each neuron may operate on inputs from other neurons and in

turn pass their outputs on to other neurons

  • We rewrite our neuron definition using ai for the input, aj for the
  • utput and wi,j for the weight parameters:

2014-10-07 34

slide-17
SLIDE 17

10/7/2014 17

Artificial Neural Networks – The Network

  • Networks are composed of layers
  • All neurons in a layer are typically connected to all neurons in the

next layer, but not to each other

  • A two layer network (one hidden layer) with a sufficient number
  • f neurons is good enough for most problems
  • Expanding the output of a second layer neuron (5) we get

2014-10-07 35

Artificial Neural Networks – Training

  • How do we train an ANN to find the best parameters wi,j?
  • By optimization, minimizing a loss function like before!
  • We can compute a loss function (errors) for outputs of the last

layer against the training examples

  • A classical approach is backpropagation which propagates the

errors backwards to earlier layers and applies gradient descent

Predict output on training set Propagate errors backwards and adjust weights in all layers Compute errors w.r.t. a loss function

2014-10-07 36

slide-18
SLIDE 18

10/7/2014 18

Deep Learning

  • ANNs (among others) can also be

layered to capture abstractions

  • Some tasks are uniquely suited to this

like vision, text and speech recognition

  • Already used by Google, MSFT etc.
  • Training these require a lot of care,
  • ften in combination with

unsupervised learning methods

2014-10-07 37

Faces Facial parts Edges Abstraction (Honglak Lee, 2009)

Artificial Neural Networks – Summary

Advantages

  • Very large hypothesis space, under some conditions it is a

universal approximator to any function f(x)

  • Some biological justification
  • Has been moderately popular in applications ever since the 90ies
  • Can be layered to capture some abstraction (deep learning)

Disadvantages

  • Training is a non-convex problem with many local minima
  • Has many tuning parameters to twiddle with (number of neurons,

layers, starting weights…)

  • Very difficult to interpret or debug weights in the network
  • It is quite simplified compared to biological neurual networks

2014-10-07 38

slide-19
SLIDE 19

10/7/2014 19

Overfitting

  • Models can overfit if they have have too many parameters in

relation to the training set size

  • Example: 9th degree polynomial regression model on 15 data

points:

  • This is not a local minima during training, it is the best fit possible
  • n the given training examples!
  • The trained model captured noise, small independent variations

that will change each time we get a new example

2014-10-07 39

Green: True function (unknown) Blue: Training examples (noisy!) Red: Trained model (Bishop, 2006)

Overfitting – Where Does the Noise Come From?

  • Noise is small variations in the data due to unknown variables

that cannot be represented by the given inputs

  • Example: Predict the temperature based on season, time-of-day and

cloud cover. What about atmospheric changes like a cold front? As they are not included in the model, or completely tied to the other variables, this variation will show up as random noise in the model!

  • With few data points to go on the model may mistake the effect of

such variables as coming from variables that are included.

  • Since this relationship is merely chance the model will not

generalize well to future situations

2014-10-07 40

slide-20
SLIDE 20

10/7/2014 20

Model Selection – Choosing Between Models

  • In conclusion, we want to avoid unnecessarily complex models
  • This is a fairly general concept throughout science and is often

referred to as Ockham’s Razor:

“Pluralitas non est ponenda sine necessitate”

  • Willian of Ockham

“Everything should be kept as simple as possible, but no simpler.”

  • Albert Einstein (paraphrased)
  • There are several mathematically principled ways to penalize

model complexity during training

  • However, the most straightforward way is to keep a separate

validation set of known examples that is only used for evaluating the performance of different models and not for training them.

2014-10-07 41

Model Selection – Hold-out Validation

  • This is called a hold-out validation set as we keep the data away

from the training phase

  • Measuring performance on such a validation set is a better metric
  • f actual generalization error to unseen examples
  • With the validation set we can compare models of different

complexity to select the one which generalizes best.

  • Examples could be polynomial models of different order, the

number of neurons or layers in an ANN etc.

Validation Set Training Set Given example data:

2014-10-07 42

slide-21
SLIDE 21

10/7/2014 21

Model Selection – Selection Strategy

  • As the number of parameters increases the size of the

hypothesis space increases accordingly allowing a better fit to training data

  • However, at some point it is sufficiently flexible to capture the

underlying patterns, and any more will just capture noise, leading to worse generalization to new examples!

(Hastie et al., 2009) Red: Validation Set Error Blue: Training Set Error Best choice

2014-10-07 43

Overfit

Thank you for listening!

2014-10-07 54