SLIDE 1 Introduction to Machine Learning Part I.
Mich` ele Sebag TAO: Theme Apprentissage & Optimisation
http://tao.lri.fr/tiki-index.php
Sept 4th, 2012
SLIDE 2
Overview
Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator
SLIDE 3
Examples
◮ Vision ◮ Control ◮ Netflix ◮ Spam ◮ Playing Go ◮ Google
http://ai.stanford.edu/∼ang/courses.html
SLIDE 4
Reading cheques
LeCun et al. 1990
SLIDE 5
MNIST: The drosophila of ML
Classification
SLIDE 6
Detecting faces
SLIDE 7 The 2005-2012 Visual Object Challenges
- A. Zisserman, C. Williams, M. Everingham, L. v.d. Gool
SLIDE 8
The supervised learning setting
Input: set of (x, y)
◮ An instance x
e.g. set of pixels, x ∈ I RD
◮ A label y in {1, −1} or {1, . . . , K} or I
R
SLIDE 9
The supervised learning setting
Input: set of (x, y)
◮ An instance x
e.g. set of pixels, x ∈ I RD
◮ A label y in {1, −1} or {1, . . . , K} or I
R Pattern recognition
◮ Classification
Does the image contain the target concept ? h : { Images} → {1, −1}
◮ Detection Does the pixel belong to the img of target concept?
h : { Pixels in an image} → {1, −1}
◮ Segmentation
Find contours of all instances of target concept in image
SLIDE 10
The 2005 Darpa Challenge
Thrun, Burgard and Fox 2005
Autonomous vehicle Stanley − Terrains
SLIDE 11
The Darpa challenge and the AI agenda
What remains to be done
Thrun 2005
◮ Reasoning
10%
◮ Dialogue
60%
◮ Perception
90%
SLIDE 12
Robots
Ng, Russell, Veloso, Abbeel, Peters, Schaal, ...
Reinforcement learning Classification
SLIDE 13
Robots, 2
Toussaint et al. 2010 (a) Factor graph modelling the variable interactions (b) Behaviour of the 39-DOF Humanoid: Reaching goal under Balance and Collision constraints
Bayesian Inference for Motion Control and Planning
SLIDE 14
Go as AI Challenge
Gelly Wang 07; Teytaud et al. 2008-2011
Reinforcement Learning, Monte-Carlo Tree Search
SLIDE 15
Energy policy
Claim Many problems can be phrased as optimization in front of the uncertainty. Adversarial setting 2 two-player game uniform setting a single player game Management of energy stocks under uncertainty
SLIDE 16 States and Decisions
States
◮ Amount of stock (60 nuclear, 20 hydro.) ◮ Varying: price, weather
alea or archive
◮ Decision: release water from one reservoir to another ◮ Assessment: meet the demand, otherwise buy energy
Reservoir 1 Reservoir2 Reservoir 3 Reservoir 4 Lost water
PLANT NUCLEAR PLANT DEMAND
PRICE
SLIDE 17
Netflix Challenge 2007-2008
Collaborative Filtering
SLIDE 18
Collaborative filtering
Input
◮ A set of users
nu, ca 500,000
◮ A set of movies
nm, ca 18,000
◮ A nm × nu matrix: person, movie, rating
Very sparse matrix: 1%... Output
◮ Filling the matrix !
SLIDE 19
Collaborative filtering
Input
◮ A set of users
nu, ca 500,000
◮ A set of movies
nm, ca 18,000
◮ A nm × nu matrix: person, movie, rating
Very sparse matrix: 1%... Output
◮ Filling the matrix !
Criterion
◮ (relative) mean square error ◮ ranking error
SLIDE 20
Spam − Phishing − Scam
Classification, Outlier detection
SLIDE 21 The power of big data
◮ Now-casting
◮ Public relations >> Advertizing
SLIDE 22
Mc Luhan and Google
We shape our tools and afterwards our tools shape us
Marshall McLuhan, 1964
First time ever a tool is observed to modify human cognition that fast.
Sparrow et al., Science 2011
SLIDE 23
Overview
Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator
SLIDE 24 Where we are
Pierre de Rosette Maths. World Data / Principles Natural phenomenons Modelling Human−related phenomenons You are here Common Sense
SLIDE 25 WHERE WE ARE
Maths. World Data / Principles Natural phenomenons Modelling Human−related phenomenons You are here Common Sense
SLIDE 26
Types of Machine Learning problems
WORLD − DATA − USER Observations Understand Code Unsupervised LEARNING + Target Predict Classification/Regression Supervised LEARNING + Rewards Decide Policy Reinforcement LEARNING
SLIDE 27
Data
Example
◮ row : example/ case ◮ column : fea-
ture/variables/attribute
◮ attribute : class/label
Instance space X
◮ Propositionnal :
X ≡ I Rd
◮ Structured :
sequential, spatio-temporal, relational. aminoacid
SLIDE 28
Supervised Learning, notations
Context World → Instance xi → Oracle ↓ yi INPUT ∼ P(x, y) E = {(xi, yi), xi ∈ X, yi ∈ {0, 1}, i = 1 . . . n} HYPOTHESIS SPACE H h : X → {0, 1} LOSS FUNCTION ℓ : Y × Y → I R OUTPUT h∗ = arg max{score(h), h ∈ H}
SLIDE 29 Classification and criteria
Generalization Error Err(h) = E[ℓ(y, h(x))] =
Empirical Error Erre(h) = 1 n
n
ℓ(yi, h(xi)) Bound structural risk Err(h) < Erre(h) + F(n, d(H)) d(H) = Vapnik Cervonenkis dimension of H, see later
SLIDE 30 The Bias-Variance Trade-off
Biais Bias (H): error of the best hypothesis h∗ de H Variance Variance of hn as a function of E
h* h h Variance h
H
Bias target concept
Function Space
SLIDE 31 The Bias-Variance Trade-off
Biais Bias (H): error of the best hypothesis h∗ de H Variance Variance of hn as a function of E
h* h h Variance h
H
Bias target concept
Function Space
Overfitting
Test error Training error Complexity of H
SLIDE 32 Key notions
◮ The main issue regarding supervised learning is overfitting. ◮ How to tackle overfitting:
◮ Before learning: use a sound criterion
regularization
◮ After learning: cross-validation
Case studies
Summary
◮ Learning is a search problem ◮ What is the space ? What are the navigation operators ?
SLIDE 33
Hypothesis Spaces
Logical Spaces Concept ← Literal,Condition
◮ Conditions = [color = blue]; [age < 18] ◮ Condition f : X → {True, False} ◮ Find: disjunction of conjunctions of conditions ◮ Ex: (unions of) rectangles of the 2D-planeX.
SLIDE 34
Hypothesis Spaces
Numerical Spaces Concept = (h() > 0)
◮ h(x) = polynomial, neural network, . . . ◮ h : X → I
R
◮ Find: (structure and) parameters of h
SLIDE 35
Hypothesis Space H
Logical Space
◮ h covers one example x iff h(x) = True. ◮ H is structured by a partial order relation
h ≺ h′ iff ∀x, h(x) → h′(x) Numerical Space H
◮ h(x) is a real value (more or less far from 0) ◮ we can define ℓ(h(x), y) ◮ H is structured by a partial order relation
h ≺ h′ iff E[ℓ(h(x), y)] < E[ℓ(h′(x), y)]
SLIDE 36 Hypothesis Space H / Navigation
H
Version Space Logical spec / gen Decision Trees Logical specialisation Neural Networks Numerical gradient Support Vector Machines Numerical quadratic opt. Ensemble Methods − adaptation E This course
◮ Decision Trees ◮ Support Vector Machines ◮ Ensemble methods
SLIDE 37
Overview
Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator
SLIDE 38 Decision Trees
C4.5 (Quinlan 86)
◮ Among the most widely
used algorithms
◮ Easy
◮ to understand ◮ to implelement ◮ to use ◮ and cheap in CPU time
◮ J48, Weka, SciKit
NORMAL >= 55 < 55 Age Smoker no yes Sport RISK NORMAL high low RISK Tension yes no Diabete yes RISK PATH. no
SLIDE 39
Decision Trees
SLIDE 40 Decision Trees (2)
Procedure DecisionTree(E)
i=1, xi ∈ I
RD, yi ∈ {0, 1}}
- If E single-class (i.e., ∀i, j ∈ [1, n]; yi = yj), return
- If n too small (i.e., < threshold), return
- Else, find the most informative attribute att
- 2. Forall value val of att
- Set Eval = E ∩ [att = val].
- Call DecisionTree(Eval)
Criterion: information gain
p = Pr(Class = 1|att = val) I([att = val]) = −p log p − (1 − p) log (1 − p) I(att) =
- i Pr(att = vali).I([att = vali])
SLIDE 41
Decision Trees (3)
Contingency Table Quantity of Information (QI)
p QI 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 Quantity of Information
Computation
value p(value) p(poor | value) QI (value) p(value) * QI (value) [0,10[ 0.051 0.999 0.00924 0.000474 [10,20[ 0.25 0.938 0.232 0.0570323 [20,30[ 0.26 0.732 0.581 0.153715
SLIDE 42
Decision Trees (4)
Limitations
◮ XOR-like attributes ◮ Attributes with many values ◮ Numerical attributes ◮ Overfitting
SLIDE 43
Limitations
Numerical Attributes
◮ Order the values val1 < . . . < valt ◮ Compute QI([att < vali]) ◮ QI(att) = maxi QI([att < vali])
The XOR case
Bias the distribution of the examples
SLIDE 44
Complexity
Quantity of information of an attribute
n ln n
Adding a node
D × n ln n
SLIDE 45
Tackling Overfitting
Penalize the selection of an already used variable
◮ Limits the tree depth.
Do not split subsets below a given minimal size
◮ Limits the tree depth.
Pruning
◮ Each leaf, one conjunction; ◮ Generalization by pruning litterals; ◮ Greedy optimization, QI criterion.
SLIDE 46
Decision Trees, Summary
Still around after all these years
◮ Robust against noise and irrelevant attributes ◮ Good results, both in quality and complexity
Random Forests
Breiman 00
SLIDE 47
Overview
Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator
SLIDE 48 Validation issues
- 1. What is the result ?
- 2. My results look good. Are they ?
- 3. Does my system outperform yours ?
- 4. How to set up my system ?
SLIDE 49
Validation: Three questions
Define a good indicator of quality
◮ Misclassification cost ◮ Area under the ROC curve
Computing an estimate thereof
◮ Validation set ◮ Cross-Validation ◮ Leave one out ◮ Bootstrap
Compare estimates: Tests and confidence levels
SLIDE 50
Which indicator, which estimate: depends.
Settings
◮ Large/few data
Data distribution
◮ Dependent/independent examples ◮ balanced/imbalanced classes
SLIDE 51
Overview
Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator
SLIDE 52
Performance indicators
Binary class
◮ h∗ the truth ◮ ˆ
h the learned hypothesis Confusion matrix ˆ h / h∗ 1 1 a b a + b c d c+d a+c b+d a + b + c + d
SLIDE 53
Performance indicators, 2
ˆ h / h∗ 1 1 a b a + b c d c+d a+c b+d a + b + c + d
◮ Misclassification rate b+c a+b+c+d ◮ Sensitivity, True positive rate (TP) a a+c ◮ Specificity, False negative rate (FN) b b+d ◮ Recall a a+c ◮ Precision a a+b
Note: always compare to random guessing / baseline alg.
SLIDE 54
Performance indicators, 3
The Area under the ROC curve
◮ ROC: Receiver Operating Characteristics ◮ Origin: Signal Processing, Medicine
Principle h : X → I R h(x) measures the risk of patient x h leads to order the examples:
+ + + − + − + + + + − − − + − − − + − − − − − − − − − − −−
SLIDE 55
Performance indicators, 3
The Area under the ROC curve
◮ ROC: Receiver Operating Characteristics ◮ Origin: Signal Processing, Medicine
Principle h : X → I R h(x) measures the risk of patient x h leads to order the examples:
+ + + − + − + + + + − − − + − − − + − − − − − − − − − − −−
Given a threshold θ, h yields a classifier: Yes iff h(x) > θ.
+ + + − + − + + ++ | − − − + − − − + − − − − − − − − − − −−
Here, TP (θ)= .8; FN (θ) = .1
SLIDE 56
ROC
SLIDE 57
The ROC curve
θ → I R2 : M(θ) = (1 − TNR, FPR) Ideal classifier: (0 False negative,1 True positive) Diagonal (True Positive = False negative) ≡ nothing learned.
SLIDE 58
ROC Curve, Properties
Properties ROC depicts the trade-off True Positive / False Negative. Standard: misclassification cost (Domingos, KDD 99) Error = # false positive + c × # false negative In a multi-objective perspective, ROC = Pareto front. Best solution: intersection of Pareto front with ∆(−c, −1)
SLIDE 59
ROC Curve, Properties, foll’d
Used to compare learners
Bradley 97
multi-objective-like insensitive to imbalanced distributions shows sensitivity to error cost.
SLIDE 60
Area Under the ROC Curve
Often used to select a learner Don’t ever do this !
Hand, 09
Sometimes used as learning criterion
Mann Whitney Wilcoxon
AUC = Pr(h(x) > h(x′)|y > y′) WHY
Rosset, 04
◮ More stable O(n2) vs O(n) ◮ With a probabilistic interpretation
Clemen¸ con et al. 08
HOW
◮ SVM-Ranking
Joachims 05; Usunier et al. 08, 09
◮ Stochastic optimization
SLIDE 61
Overview
Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator
SLIDE 62
Validation, principle
Desired: performance on further instances
Further examples WORLD h Quality Dataset
Assumption: Dataset is to World, like Training set is to Dataset.
Training set h Quality Test examples DATASET
SLIDE 63 Validation, 2
Training set h Test examples Learning parameters DATASET perf(h)
Unbiased Assessment of Learning Algorithms
- T. Scheffer and R. Herbrich, 97
SLIDE 64 Validation, 2
Training set h Test examples Learning parameters DATASET parameter*, h*, perf (h*) perf(h)
Unbiased Assessment of Learning Algorithms
- T. Scheffer and R. Herbrich, 97
SLIDE 65 Validation, 2
Training set h Test examples Learning parameters DATASET Validation set True performance parameter*, h*, perf (h*) perf(h)
Unbiased Assessment of Learning Algorithms
- T. Scheffer and R. Herbrich, 97
SLIDE 66
Overview
Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator
SLIDE 67 Confidence intervals
Definition Given a random variable X on I R, a p%-confidence interval is I ⊂ I R such that Pr(X ∈ I) > p Binary variable with probability ǫ Probability of r events out of n trials: Pn(r) = n! r!(n − r)!ǫr(1 − ǫ)n−r
◮ Mean: nǫ ◮ Variance: σ2 = nǫ(1 − ǫ)
Gaussian approximation P(x) = 1 √ 2πσ2 exp− 1
2 x−µ σ 2
SLIDE 68 Confidence intervals
Bounds on (true value, empirical value) for n trials, n > 30 Pr(|ˆ xn − x∗| > 1.96
xn.(1−ˆ xn) n
) < .05 z ε Table z .67 1. 1.28 1.64 1.96 2.33 2.58 ε 50 32 20 10 5 2 1
SLIDE 69 Empirical estimates
When data abound (MNIST)
Training Test Validation
Cross validation Fold 2 3 1 Run N 2 1 N Error = Average (error on N−fold Cross Validation
learned from )
SLIDE 70
Empirical estimates, foll’d
Cross validation → Leave one out 2 3 1 Run 2 1 Fold n n Leave one out Same as N-fold CV, with N = number of examples. Properties Low bias; high variance; underestimate error if data not independent
SLIDE 71
Empirical estimates, foll’d
Bootstrap
Dataset Training set Test set. rest of examples with replacement uniform sampling
Average indicator over all (Training set, Test set) samplings.
SLIDE 72
Beware
Multiple hypothesis testing
◮ If you test many hypotheses on the same dataset ◮ one of them will appear confidently true...
More
◮ Tutorial slides:
http://www.lri.fr/ sebag/Slides/Validation Tutorial 11.pdf
◮ Video and slides (soon): ICML 2012, Videolectures, Tutorial
Japkowicz & Shah http://www.mohakshah.com/tutorials/icml2012/
SLIDE 73
Validation, summary
What is the performance criterion
◮ Cost function ◮ Account for class imbalance ◮ Account for data correlations
Assessing a result
◮ Compute confidence intervals ◮ Consider baselines ◮ Use a validation set
If the result looks too good, don’t believe it