1
Stacking for supervised learning Stacking for supervised learning - - PowerPoint PPT Presentation
Stacking for supervised learning Stacking for supervised learning - - PowerPoint PPT Presentation
Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL, University of Ulster 1 Ensemble learning Ensemble learning l Postulate multiple hypotheses to explain the data l Shortcomings of single model learning
2
Ensemble learning Ensemble learning
l Postulate multiple hypotheses to explain the
data
l Shortcomings of single model learning
algorithms (Dietterich , 2002)
Statistical problem Computational problem Representational problem
3
Ensemble learning Ensemble learning
l Generalization Error: Bias + Variance
– Bias: how close the algorithm’s average prediction is close to the target – Variance : how much the algorithm’s predictions “bounces round” for different training sets – a model which is too simple, or too inflexible, will have a large bias – a model which has too much flexibility will have high variance
4
Ensemble learning Ensemble learning
l Generalization Error: Ensembles
– Ensembles reduce bias and/or variance – Ensembles to be effective – need diverse and accurate base models – Diversity measured by level of variability in base members predictions (for regression)
5
Ensemble learning Ensemble learning
§ Homogeneous learning
- data sampling, feature sampling,
randomization, parameter settings
§ Heterogeneous learning
- Same data, different learning algorithms
6
Ensemble Learning Ensemble Learning
Classifier 1 Classifier 2 Classifier N
. . .
Input Features Combiner Class Predictions Class Prediction
7
Ensemble learning Ensemble learning
Methods of combination:
Voting, Weighting, Selection Mixture of experts Error-correcting output codes Bagging Boosting Stacking
8
Ensemble Learning: Stacking Ensemble Learning: Stacking
Base Model1 Base Model 2 Base Modeln …
Meta Model
Prediction instance
9
Meta Technique: SR Meta Technique: SR
CV Meta-training set
{ }
( ( ),..., ( ), ) f f y
j m j j 1 x
x
...
M
1
M
2
Mm
instance
Instance x* Base Model fi Base Predictions f 1(x*) f 2(x*) f m(x*) Combining (Meta-Level) model
Meta-M
Final Prediction Meta-M(f 1(x* ),..., f m(x*) )
10
Stacking for classification Stacking for classification
§ Use class distributions from base classifiers
rather than class predictions
1 1 1 1
{( ( | ),..., ( | ),..., ( | ),..., ( | ), )}
k m m k
P C x P C x P C x P C x y
§ Choice of Meta-classifier:
Multi-response linear regression
- For a classification with m class values, m
regression problems
- Only use probabilities related to class Cj to
predict class Cj
11
Stacking for classification Stacking for classification
§ Different “type” of base classifers § Multi-response model trees used to
guarantee better performance than Selecting best classifier
12
Stacking for regression Stacking for regression
§ Linear regression requires non-negative
weights
§ Model trees meta-learner § Homogeneous Stacking using random
feature sub-sets
§ Feature sub-sets can be improved upon
using hill-climbing or GA techniques
13
Related Related techniques:Mutiple techniques:Mutiple meta meta-
- levels
levels
Classifer1 <x> Classifer2 Classifer3
<x,P1(C1),..P1(Ck)> <x,P1(C1),..,P1(Ck), P2(C1),..,P2(Ck) >
Cascade Generalization
14
Related Related techniques:Mutiple techniques:Mutiple meta meta-
- levels
levels
Classifer1 Classifer2 Classifer3
Combiner Trees
Classifer4 T1 T2 T1 T1 Combiner1 Combiner2 Combiner3 Disjoint training sets
15
Related Techniques: Dynamic Related Techniques: Dynamic Integration Integration
Meta- Level Training Set
...
M
1
M
2
M
m
instance x* Base Model fi Base errors Combining model (Meta-level)
Meta-M
Final Prediction Meta-M( f 1(x* ),..., f m(x*) )
{(xj,Err1(xj),..,Errm(xj),yj)}
fm(x*) f2(x*) f1(x*) Erri(x)=|fi(x)-yi|
16
Dynamic Integration Dynamic Integration
Meta-M Meta Model - distance weighted k-NN
l NN – set of k nearest meta-instances l For each member find cumulative error
- f each model
17
Dynamic Integration Dynamic Integration
l Dynamic Selection (DS)
– choose the model with lowest cumulative error
l Dynamic Weighting (DW)
– combine the models with weights based on their cumulative error
l Dynamic Weighting with Selection (DWS)
– combine the models as DW but exclude models if they have larger than median cumulative error
18
Applications Applications
l Distributed data mining l Intrusion detection l Concept drift
19
Key papers Key papers
l
Wolpert, D. H.: Stacked Generalization. Neural Networks, 5 (1992) 241-259
l
Breiman, L.: Stacked Regressions. Machine Learning, 24 (1996) 49-64
l
Dietterich, T. G.: Ensemble Methods in Machine Learning. Lecture Notes in Computer Science, 1857 (2000) 1-15
l
Dzeroski, S., & Zenko, B.: Is Combining Classifiers with Stacking Better than Selecting the Best One? Machine Learning, 54 (2004) 255-273
l
Ting, K. M., & Witten, I. H.: Issues in Stacked Generalization. Journal of Artificial Intelligence Research, 10 (1999) 271-289