Introduction to Machine Learning Part I. Mich` ele Sebag TAO: - - PowerPoint PPT Presentation

introduction to machine learning part i
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Part I. Mich` ele Sebag TAO: - - PowerPoint PPT Presentation

Introduction to Machine Learning Part I. Mich` ele Sebag TAO: Theme Apprentissage & Optimisation http://tao.lri.fr/tiki-index.php Sept 4th, 2012 Overview Examples Introduction to Supervised Machine Learning Decision trees Empirical


slide-1
SLIDE 1

Introduction to Machine Learning Part I.

Mich` ele Sebag TAO: Theme Apprentissage & Optimisation

http://tao.lri.fr/tiki-index.php

Sept 4th, 2012

slide-2
SLIDE 2

Overview

Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

slide-3
SLIDE 3

Examples

◮ Vision ◮ Control ◮ Netflix ◮ Spam ◮ Playing Go ◮ Google

http://ai.stanford.edu/∼ang/courses.html

slide-4
SLIDE 4

Reading cheques

LeCun et al. 1990

slide-5
SLIDE 5

MNIST: The drosophila of ML

Classification

slide-6
SLIDE 6

Detecting faces

slide-7
SLIDE 7

The 2005-2012 Visual Object Challenges

  • A. Zisserman, C. Williams, M. Everingham, L. v.d. Gool
slide-8
SLIDE 8

The supervised learning setting

Input: set of (x, y)

◮ An instance x

e.g. set of pixels, x ∈ I RD

◮ A label y in {1, −1} or {1, . . . , K} or I

R

slide-9
SLIDE 9

The supervised learning setting

Input: set of (x, y)

◮ An instance x

e.g. set of pixels, x ∈ I RD

◮ A label y in {1, −1} or {1, . . . , K} or I

R Pattern recognition

◮ Classification

Does the image contain the target concept ? h : { Images} → {1, −1}

◮ Detection Does the pixel belong to the img of target concept?

h : { Pixels in an image} → {1, −1}

◮ Segmentation

Find contours of all instances of target concept in image

slide-10
SLIDE 10

The 2005 Darpa Challenge

Thrun, Burgard and Fox 2005

Autonomous vehicle Stanley − Terrains

slide-11
SLIDE 11

The Darpa challenge and the AI agenda

What remains to be done

Thrun 2005

◮ Reasoning

10%

◮ Dialogue

60%

◮ Perception

90%

slide-12
SLIDE 12

Robots

Ng, Russell, Veloso, Abbeel, Peters, Schaal, ...

Reinforcement learning Classification

slide-13
SLIDE 13

Robots, 2

Toussaint et al. 2010 (a) Factor graph modelling the variable interactions (b) Behaviour of the 39-DOF Humanoid: Reaching goal under Balance and Collision constraints

Bayesian Inference for Motion Control and Planning

slide-14
SLIDE 14

Go as AI Challenge

Gelly Wang 07; Teytaud et al. 2008-2011

Reinforcement Learning, Monte-Carlo Tree Search

slide-15
SLIDE 15

Energy policy

Claim Many problems can be phrased as optimization in front of the uncertainty. Adversarial setting 2 two-player game uniform setting a single player game Management of energy stocks under uncertainty

slide-16
SLIDE 16

States and Decisions

States

◮ Amount of stock (60 nuclear, 20 hydro.) ◮ Varying: price, weather

alea or archive

◮ Decision: release water from one reservoir to another ◮ Assessment: meet the demand, otherwise buy energy

Reservoir 1 Reservoir2 Reservoir 3 Reservoir 4 Lost water

PLANT NUCLEAR PLANT DEMAND

PRICE

slide-17
SLIDE 17

Netflix Challenge 2007-2008

Collaborative Filtering

slide-18
SLIDE 18

Collaborative filtering

Input

◮ A set of users

nu, ca 500,000

◮ A set of movies

nm, ca 18,000

◮ A nm × nu matrix: person, movie, rating

Very sparse matrix: 1%... Output

◮ Filling the matrix !

slide-19
SLIDE 19

Collaborative filtering

Input

◮ A set of users

nu, ca 500,000

◮ A set of movies

nm, ca 18,000

◮ A nm × nu matrix: person, movie, rating

Very sparse matrix: 1%... Output

◮ Filling the matrix !

Criterion

◮ (relative) mean square error ◮ ranking error

slide-20
SLIDE 20

Spam − Phishing − Scam

Classification, Outlier detection

slide-21
SLIDE 21

The power of big data

◮ Now-casting

  • utbreak of flu

◮ Public relations >> Advertizing

slide-22
SLIDE 22

Mc Luhan and Google

We shape our tools and afterwards our tools shape us

Marshall McLuhan, 1964

First time ever a tool is observed to modify human cognition that fast.

Sparrow et al., Science 2011

slide-23
SLIDE 23

Overview

Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

slide-24
SLIDE 24

Where we are

  • Ast. series

Pierre de Rosette Maths. World Data / Principles Natural phenomenons Modelling Human−related phenomenons You are here Common Sense

slide-25
SLIDE 25

WHERE WE ARE

  • Sc. data

Maths. World Data / Principles Natural phenomenons Modelling Human−related phenomenons You are here Common Sense

slide-26
SLIDE 26

Types of Machine Learning problems

WORLD − DATA − USER Observations Understand Code Unsupervised LEARNING + Target Predict Classification/Regression Supervised LEARNING + Rewards Decide Policy Reinforcement LEARNING

slide-27
SLIDE 27

Data

Example

◮ row : example/ case ◮ column : fea-

ture/variables/attribute

◮ attribute : class/label

Instance space X

◮ Propositionnal :

X ≡ I Rd

◮ Structured :

sequential, spatio-temporal, relational. aminoacid

slide-28
SLIDE 28

Supervised Learning, notations

Context World → Instance xi → Oracle ↓ yi INPUT ∼ P(x, y) E = {(xi, yi), xi ∈ X, yi ∈ {0, 1}, i = 1 . . . n} HYPOTHESIS SPACE H h : X → {0, 1} LOSS FUNCTION ℓ : Y × Y → I R OUTPUT h∗ = arg max{score(h), h ∈ H}

slide-29
SLIDE 29

Classification and criteria

Generalization Error Err(h) = E[ℓ(y, h(x))] =

  • ℓ(y, h(x))dP(x, y)

Empirical Error Erre(h) = 1 n

n

  • i=1

ℓ(yi, h(xi)) Bound structural risk Err(h) < Erre(h) + F(n, d(H)) d(H) = Vapnik Cervonenkis dimension of H, see later

slide-30
SLIDE 30

The Bias-Variance Trade-off

Biais Bias (H): error of the best hypothesis h∗ de H Variance Variance of hn as a function of E

h* h h Variance h

H

Bias target concept

Function Space

slide-31
SLIDE 31

The Bias-Variance Trade-off

Biais Bias (H): error of the best hypothesis h∗ de H Variance Variance of hn as a function of E

h* h h Variance h

H

Bias target concept

Function Space

Overfitting

Test error Training error Complexity of H

slide-32
SLIDE 32

Key notions

◮ The main issue regarding supervised learning is overfitting. ◮ How to tackle overfitting:

◮ Before learning: use a sound criterion

regularization

◮ After learning: cross-validation

Case studies

Summary

◮ Learning is a search problem ◮ What is the space ? What are the navigation operators ?

slide-33
SLIDE 33

Hypothesis Spaces

Logical Spaces Concept ← Literal,Condition

◮ Conditions = [color = blue]; [age < 18] ◮ Condition f : X → {True, False} ◮ Find: disjunction of conjunctions of conditions ◮ Ex: (unions of) rectangles of the 2D-planeX.

slide-34
SLIDE 34

Hypothesis Spaces

Numerical Spaces Concept = (h() > 0)

◮ h(x) = polynomial, neural network, . . . ◮ h : X → I

R

◮ Find: (structure and) parameters of h

slide-35
SLIDE 35

Hypothesis Space H

Logical Space

◮ h covers one example x iff h(x) = True. ◮ H is structured by a partial order relation

h ≺ h′ iff ∀x, h(x) → h′(x) Numerical Space H

◮ h(x) is a real value (more or less far from 0) ◮ we can define ℓ(h(x), y) ◮ H is structured by a partial order relation

h ≺ h′ iff E[ℓ(h(x), y)] < E[ℓ(h′(x), y)]

slide-36
SLIDE 36

Hypothesis Space H / Navigation

H

  • perators

Version Space Logical spec / gen Decision Trees Logical specialisation Neural Networks Numerical gradient Support Vector Machines Numerical quadratic opt. Ensemble Methods − adaptation E This course

◮ Decision Trees ◮ Support Vector Machines ◮ Ensemble methods

slide-37
SLIDE 37

Overview

Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

slide-38
SLIDE 38

Decision Trees

C4.5 (Quinlan 86)

◮ Among the most widely

used algorithms

◮ Easy

◮ to understand ◮ to implelement ◮ to use ◮ and cheap in CPU time

◮ J48, Weka, SciKit

NORMAL >= 55 < 55 Age Smoker no yes Sport RISK NORMAL high low RISK Tension yes no Diabete yes RISK PATH. no

slide-39
SLIDE 39

Decision Trees

slide-40
SLIDE 40

Decision Trees (2)

Procedure DecisionTree(E)

  • 1. Assume E = {(xi, yi)n

i=1, xi ∈ I

RD, yi ∈ {0, 1}}

  • If E single-class (i.e., ∀i, j ∈ [1, n]; yi = yj), return
  • If n too small (i.e., < threshold), return
  • Else, find the most informative attribute att
  • 2. Forall value val of att
  • Set Eval = E ∩ [att = val].
  • Call DecisionTree(Eval)

Criterion: information gain

p = Pr(Class = 1|att = val) I([att = val]) = −p log p − (1 − p) log (1 − p) I(att) =

  • i Pr(att = vali).I([att = vali])
slide-41
SLIDE 41

Decision Trees (3)

Contingency Table Quantity of Information (QI)

p QI 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 Quantity of Information

Computation

value p(value) p(poor | value) QI (value) p(value) * QI (value) [0,10[ 0.051 0.999 0.00924 0.000474 [10,20[ 0.25 0.938 0.232 0.0570323 [20,30[ 0.26 0.732 0.581 0.153715

slide-42
SLIDE 42

Decision Trees (4)

Limitations

◮ XOR-like attributes ◮ Attributes with many values ◮ Numerical attributes ◮ Overfitting

slide-43
SLIDE 43

Limitations

Numerical Attributes

◮ Order the values val1 < . . . < valt ◮ Compute QI([att < vali]) ◮ QI(att) = maxi QI([att < vali])

The XOR case

Bias the distribution of the examples

slide-44
SLIDE 44

Complexity

Quantity of information of an attribute

n ln n

Adding a node

D × n ln n

slide-45
SLIDE 45

Tackling Overfitting

Penalize the selection of an already used variable

◮ Limits the tree depth.

Do not split subsets below a given minimal size

◮ Limits the tree depth.

Pruning

◮ Each leaf, one conjunction; ◮ Generalization by pruning litterals; ◮ Greedy optimization, QI criterion.

slide-46
SLIDE 46

Decision Trees, Summary

Still around after all these years

◮ Robust against noise and irrelevant attributes ◮ Good results, both in quality and complexity

Random Forests

Breiman 00

slide-47
SLIDE 47

Overview

Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

slide-48
SLIDE 48

Validation issues

  • 1. What is the result ?
  • 2. My results look good. Are they ?
  • 3. Does my system outperform yours ?
  • 4. How to set up my system ?
slide-49
SLIDE 49

Validation: Three questions

Define a good indicator of quality

◮ Misclassification cost ◮ Area under the ROC curve

Computing an estimate thereof

◮ Validation set ◮ Cross-Validation ◮ Leave one out ◮ Bootstrap

Compare estimates: Tests and confidence levels

slide-50
SLIDE 50

Which indicator, which estimate: depends.

Settings

◮ Large/few data

Data distribution

◮ Dependent/independent examples ◮ balanced/imbalanced classes

slide-51
SLIDE 51

Overview

Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

slide-52
SLIDE 52

Performance indicators

Binary class

◮ h∗ the truth ◮ ˆ

h the learned hypothesis Confusion matrix ˆ h / h∗ 1 1 a b a + b c d c+d a+c b+d a + b + c + d

slide-53
SLIDE 53

Performance indicators, 2

ˆ h / h∗ 1 1 a b a + b c d c+d a+c b+d a + b + c + d

◮ Misclassification rate b+c a+b+c+d ◮ Sensitivity, True positive rate (TP) a a+c ◮ Specificity, False negative rate (FN) b b+d ◮ Recall a a+c ◮ Precision a a+b

Note: always compare to random guessing / baseline alg.

slide-54
SLIDE 54

Performance indicators, 3

The Area under the ROC curve

◮ ROC: Receiver Operating Characteristics ◮ Origin: Signal Processing, Medicine

Principle h : X → I R h(x) measures the risk of patient x h leads to order the examples:

+ + + − + − + + + + − − − + − − − + − − − − − − − − − − −−

slide-55
SLIDE 55

Performance indicators, 3

The Area under the ROC curve

◮ ROC: Receiver Operating Characteristics ◮ Origin: Signal Processing, Medicine

Principle h : X → I R h(x) measures the risk of patient x h leads to order the examples:

+ + + − + − + + + + − − − + − − − + − − − − − − − − − − −−

Given a threshold θ, h yields a classifier: Yes iff h(x) > θ.

+ + + − + − + + ++ | − − − + − − − + − − − − − − − − − − −−

Here, TP (θ)= .8; FN (θ) = .1

slide-56
SLIDE 56

ROC

slide-57
SLIDE 57

The ROC curve

θ → I R2 : M(θ) = (1 − TNR, FPR) Ideal classifier: (0 False negative,1 True positive) Diagonal (True Positive = False negative) ≡ nothing learned.

slide-58
SLIDE 58

ROC Curve, Properties

Properties ROC depicts the trade-off True Positive / False Negative. Standard: misclassification cost (Domingos, KDD 99) Error = # false positive + c × # false negative In a multi-objective perspective, ROC = Pareto front. Best solution: intersection of Pareto front with ∆(−c, −1)

slide-59
SLIDE 59

ROC Curve, Properties, foll’d

Used to compare learners

Bradley 97

multi-objective-like insensitive to imbalanced distributions shows sensitivity to error cost.

slide-60
SLIDE 60

Area Under the ROC Curve

Often used to select a learner Don’t ever do this !

Hand, 09

Sometimes used as learning criterion

Mann Whitney Wilcoxon

AUC = Pr(h(x) > h(x′)|y > y′) WHY

Rosset, 04

◮ More stable O(n2) vs O(n) ◮ With a probabilistic interpretation

Clemen¸ con et al. 08

HOW

◮ SVM-Ranking

Joachims 05; Usunier et al. 08, 09

◮ Stochastic optimization

slide-61
SLIDE 61

Overview

Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

slide-62
SLIDE 62

Validation, principle

Desired: performance on further instances

Further examples WORLD h Quality Dataset

Assumption: Dataset is to World, like Training set is to Dataset.

Training set h Quality Test examples DATASET

slide-63
SLIDE 63

Validation, 2

Training set h Test examples Learning parameters DATASET perf(h)

Unbiased Assessment of Learning Algorithms

  • T. Scheffer and R. Herbrich, 97
slide-64
SLIDE 64

Validation, 2

Training set h Test examples Learning parameters DATASET parameter*, h*, perf (h*) perf(h)

Unbiased Assessment of Learning Algorithms

  • T. Scheffer and R. Herbrich, 97
slide-65
SLIDE 65

Validation, 2

Training set h Test examples Learning parameters DATASET Validation set True performance parameter*, h*, perf (h*) perf(h)

Unbiased Assessment of Learning Algorithms

  • T. Scheffer and R. Herbrich, 97
slide-66
SLIDE 66

Overview

Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

slide-67
SLIDE 67

Confidence intervals

Definition Given a random variable X on I R, a p%-confidence interval is I ⊂ I R such that Pr(X ∈ I) > p Binary variable with probability ǫ Probability of r events out of n trials: Pn(r) = n! r!(n − r)!ǫr(1 − ǫ)n−r

◮ Mean: nǫ ◮ Variance: σ2 = nǫ(1 − ǫ)

Gaussian approximation P(x) = 1 √ 2πσ2 exp− 1

2 x−µ σ 2

slide-68
SLIDE 68

Confidence intervals

Bounds on (true value, empirical value) for n trials, n > 30 Pr(|ˆ xn − x∗| > 1.96

  • ˆ

xn.(1−ˆ xn) n

) < .05 z ε Table z .67 1. 1.28 1.64 1.96 2.33 2.58 ε 50 32 20 10 5 2 1

slide-69
SLIDE 69

Empirical estimates

When data abound (MNIST)

Training Test Validation

Cross validation Fold 2 3 1 Run N 2 1 N Error = Average (error on N−fold Cross Validation

  • f h

learned from )

slide-70
SLIDE 70

Empirical estimates, foll’d

Cross validation → Leave one out 2 3 1 Run 2 1 Fold n n Leave one out Same as N-fold CV, with N = number of examples. Properties Low bias; high variance; underestimate error if data not independent

slide-71
SLIDE 71

Empirical estimates, foll’d

Bootstrap

Dataset Training set Test set. rest of examples with replacement uniform sampling

Average indicator over all (Training set, Test set) samplings.

slide-72
SLIDE 72

Beware

Multiple hypothesis testing

◮ If you test many hypotheses on the same dataset ◮ one of them will appear confidently true...

More

◮ Tutorial slides:

http://www.lri.fr/ sebag/Slides/Validation Tutorial 11.pdf

◮ Video and slides (soon): ICML 2012, Videolectures, Tutorial

Japkowicz & Shah http://www.mohakshah.com/tutorials/icml2012/

slide-73
SLIDE 73

Validation, summary

What is the performance criterion

◮ Cost function ◮ Account for class imbalance ◮ Account for data correlations

Assessing a result

◮ Compute confidence intervals ◮ Consider baselines ◮ Use a validation set

If the result looks too good, don’t believe it