- J. Vega
J. Vega Asociacin EURATOM/CIEMAT para Fusin jesus.vega@ciemat.es - - PowerPoint PPT Presentation
J. Vega Asociacin EURATOM/CIEMAT para Fusin jesus.vega@ciemat.es - - PowerPoint PPT Presentation
J. Vega Asociacin EURATOM/CIEMAT para Fusin jesus.vega@ciemat.es Concepts Classification Regression Advanced methods 7th FDPVA. Frascati (March 26-28, 2012) 2 Technology: it gives stereotyped solutions to stereotyped
Concepts Classification Regression Advanced methods
2 7th FDPVA. Frascati (March 26-28, 2012)
Technology: it gives stereotyped solutions to
stereotyped problems
Basic science: it is the accumulation of knowledge
to explain a phenomenon
Applied science: it is the application of scientific
knowledge to a particular environment
- Machine learning can increase our comprehension of
plasma physics
7th FDPVA. Frascati (March 26-28, 2012) 3
Learning does not mean ‘learning by heart’ (any
computer can memorize)
Learning means ‘generalization capability’: we learn
with some samples and predict for other samples
7th FDPVA. Frascati (March 26-28, 2012) 4
The learning problem is the problem of finding a
desired dependence (function) using a limited number of observations (training data)
- Classification: the function can represent the separation
frontier between two classes
- Regression: the function can provide a fit to the data
7th FDPVA. Frascati (March 26-28, 2012) 5
x x x x x x x
- o
Classification Regression
The general model of learning from examples is described
through three components
The problem of learning is that of choosing from the given
set of functions f(x, a), the one that best approximates the supervisor’s response
7th FDPVA. Frascati (March 26-28, 2012) 6
Generator of random vectors: p(x) Supervisor: p(y|x) Learning machine: f(x, a)
n
x
| y p y x ˆ ( , ) y f a x p(x): fixed but unknown probability distribution function y = p(y|x) (fixed and unknown) (xi, yi), i = 1, ..., N: training samples
1 2 1 2
ˆ ˆ ˆ , ,..., "close" to , ,...,
N N
y y y y y y
Main hypothesis
- The training set, (xi, yi), i = 1, ..., N, is made up of
independent and identically distributed (iid) observations drawn according to p(x, y) = p(y|x)p(x)
Loss function: L(y, f(x, a))
- It measures the quality of the approach performed by the
learning algorithm, i.e. the discrepancy between the response y of the supervisor and the response f(x, a) of the learning machine. Its values are ≥ 0
Risk functional:
7th FDPVA. Frascati (March 26-28, 2012) 7
, , , R L y f p y d dy a a x x x The goal of a learning process is to find the function f(x, a0) that minimizes R(a) (over the class of functions f(x, a)) in the situation where p(x, y) is unknown and the only available information is contained in the training set
Pattern recognition (or classification) Regression estimation Density estimation
7th FDPVA. Frascati (March 26-28, 2012) 8
, , , L y f R p y d dy a a x x x
0 if , , , 1 if , y f x L y f y f x a a a x
2
, , , L y f y f a a x x
, log , L p p x a a x
7th FDPVA. Frascati (March 26-28, 2012) 9
2 1 2 1
0 if , 1 if x ax b f x x ax b a
, a b a
7th FDPVA. Frascati (March 26-28, 2012) 10
1 1 1 1 1
, , , , , f x x A f A f a a
2 2 2 2 2
, , , , , f x x A f A f
3
, , , , , f x x m b m n
7th FDPVA. Frascati (March 26-28, 2012) 11
7th FDPVA. Frascati (March 26-28, 2012) 12
Feature types Quantitative (numerical) Qualitative (categorical) Continuous-valued (length, pressure) Discrete (total basketball score, number of citizens in a town) Ordinal (education degree) Nominal (profession, brand of a car) Dataset: (x1, y1), (x2, y2), ..., (xi, yi), ..., (xN, yN) xi ∈ Rm: features that are of distinctive nature (object description with attributes managed by computers) yi ∈{L1, L2, ..., LK}: label of the sample xi
7th FDPVA. Frascati (March 26-28, 2012) 13
Dataset: (x1, y1), (x2, y2), ..., (xi, yi), ..., (xN, yN) xi ∈ Rm: features that are of distinctive nature (object description with attributes managed by computers) yi ∈{L1, L2, ..., LK}: known label of the sample xi Objective ve: to determine a separating function between classes (generalization) to predict the label of new samples with known feature vectors ((xN+1, yN+1), (xN+2, yN+2), ...)
x x
Decision boundary
x x
Decision boundary
Overfitting
How good is a classifier? Training set: a model is created to make
- predictions. Given xi, the model predicts yi
Test set: model validation. The success rate is
taken as the level of confidence and it is assumed to be the same for all future samples
7th FDPVA. Frascati (March 26-28, 2012) 14
Dataset: (x1, y1), (x2, y2), ..., (xN, yN) (xi, yi), i = 1, ..., J: training set (xi, yi), i = J+1, ..., N: test set
Multi-class problems: K > 2
- It can be tackled as K binary problems. Each class is compared
with the rest (one-versus-the-rest approach)
7th FDPVA. Frascati (March 26-28, 2012) 15
c1 c2 c3 c4 c1 not c4 c2 not c2 c3 not c3 c4
ambiguity region
not c1
Examples of feature vectors
- Disruptions
- L/H transition
- Image classification
xi: the set of pixels of an image yi∈{1, 2, 3}
7th FDPVA. Frascati (March 26-28, 2012) 16
5 10 15 20
time
1 1 2 2 3 3
( ), ( ), ( ),... , , ( ), ( ), ( ),... , , ( 2 ), ( 2 ), ( 2 ),... , ,
p s p s e s p s p s e s p s p s e s
t I t n t y D N t T I t T n t T y D N t T I t T n t T y D N x x x
. . . .
, , , ,
m i i i i
y y L H x x
Single classifiers
- Support Vector Machines (SVM)
- Neural networks
- Bayes decision theory
Parametric method Non-parametric method
- Classification trees
Combining classifiers
7th FDPVA. Frascati (March 26-28, 2012) 17
Binary classifier
It finds the optimal separating hyper-plane between classes
Samples: (xk, yk), xk∈Rn, k = 1, ..., N, y∈{C{+1}, C{-1}}
7th FDPVA. Frascati (March 26-28, 2012) 18
C{+1} C{-1} C{+1} C{-1}
Maximum margin: 2t
w
( ) . D b x w x
D(x)>+1 D(x)<-1
( ) 1 D x ( ) 1 D x
| ( ) | || ||
k
D x w
( ) , 1, 1 , 1, ,
k k k
y D y k N t x w
To find the optimal hyper-plane it is necessary to determine the vector w that maximizes the margin t There are infinite solutions due to the presence of a scale factor. To avoid this: t || w|| = 1 Therefore, to maximize the margin is equivalent to minimize || w|| Opti timi miza zation ion problem: lem:
2 ,
min ( ) , subject to . 1
w k i
J y w
w
w w w x
Solution:
Samples associated to ai ≠ 0 are called “support vectors”
The constant b is obtained from any condition (Karush-Kuhn-Tucker)
7th FDPVA. Frascati (March 26-28, 2012) 19
C{+1} C{-1} w
* *
( ) b w x
* * 1 N i i i i
y a
w x
(xk, yk), xk∈Rn, k = 1, ..., N, y∈C{+1}, C{-1}}
ai are the Lagrange multipliers
* support vectors i i i
y a
w x
( ) 1 0, 1
i i i
y b i , ,N a w x
The rest of training samples are irrelevant to classify new samples
- V. Cherkassky, F. Mulier. Learning from data. 2nd edition. Wiley-Interscience.
* * 1 1
Given to classify ( ) 0, . Otherwise
i i i vectores soporte
if sign y b C C a
x x x x x
* *
( ) · D b x w x
is the distance (with sign) from X to the separating hyper-plane
Support vectors
Non-linearly separable case Kernels
- Linear:
- Polynomials of degree q:
- Radial basis functions:
- Neural network:
7th FDPVA. Frascati (March 26-28, 2012) 20
feature space
H(x, x’)
input space
( , ) [( . ) 1]q H x x' x x'
2 2
( , ) exp H x x' x x'
( , ) tanh(2( . ) 1) H x x' x x' ( , ) ( . ) H x x' x x'
- V. Cherkassky, F. Mulier. Learning from data. 2nd edition. Wiley-Interscience.
* * 1 1
Given to classify , 0, . Otherwise
i i i vectores soporte
if sign y H b C C a
x x x x x
7th FDPVA. Frascati (March 26-28, 2012) 21
Input Hidden layer 1 Hidden layer 2 Output Samples: (xj, yj), xj∈Rn, j = 1,...,N, y∈{L1, L2} W(1): (n+1 x k) matrix W(2): (k+1 x m) matrix W(3): (m+1 x 1) matrix
1 1 exp s u u
Aim: determine W for a minimum error
- C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press
2 tanh 1 1 exp 2 s u u u
X1 1 s O1
(1)
Xn s s s Ok
(1)
1 O1
(2)
s Om
(2)
1 y W(1) W(2) W(3)
. . . . . . . . .
7th FDPVA. Frascati (March 26-28, 2012) 22
X1 X2 Xn s1 s2
. . .
W1 W2 Wn O1
(1)
1
2 tanh 1 1 exp 2 s u u u
2
s u kx b
1 1 1 1 2 2
2 1 1 exp 2 ...
n n
- W x
W x W x
2 2 1 1 2 2
2 1 exp 2 ...
n n
k
- k
b W x W x W x
Given x to classify: if O2
(2) ≥ threshold
x ∈ {L1}
- therwise
x ∈ {L2}
Samples: (xj, yj), xj∈Rn, j = 1,...,N, y∈{L1, L2}
7th FDPVA. Frascati (March 26-28, 2012) 23
1 2
| | , ,
j j j
p C P C P C j L L p x x x
Likelihood Posterior Prior Probability distribution function of x
- R. O. Duda, P. E. Hart, D. G. Stork. Pattern classification. 2nd edition.
Wiley-Interscience. (2001)
1 |
P C x
2 |
P C x
|
j j j
p p C P C x x
Classification rule
1 2 1 2
Given to classify | | , . Otherwise if P C P C C C x x x x x
1 1 2 2 1 2
| | , Otherwise if p C P C p C P C C C x x x x
Likelihood estimation:
- Parametric: parameter estimation
- Non-parametric: Parzen window
Samples: (xk, yk), xk∈Rn, k = 1, ..., N, y∈{C1, C2}
Some classification problems involve features that are nominal data (discrete descriptions without any natural notion of similarity or even ordering)
The questions asked at each node concern a particular property
Successive nodes are visited until a terminal or leaf node is reached, where the category label is read
The same question (Size?) appears in different places
Different leaf nodes can be labelled by the same category
7th FDPVA. Frascati (March 26-28, 2012) 24
Color? Size? Shape? Size? Watermelon Apple Grape Grapefruit Lemmon Size? Banana Taste? Apple Cherry Grape
root
Level 0 Level 1 Level 2 Level 3 green yellow red big medium small round thin medium small big small sweet sour
- L. Breiman et al. Classification and Regression Trees. Chapman&Hall/CRC
Combining several classifiers allows getting a consensus of
results for greater accuracy
7th FDPVA. Frascati (March 26-28, 2012) 25
Approaches to building classifiers ensembles
Data set x D1 D2 DL Combiner
...
X(1) X(2) X(L)
Combination level: design different combiners Classifier level: use different base classifiers Feature level: use different feature subsets Data level: use different data subsets
- L. I. Kuncheva. Combining pattern classifiers. Wiley Interscience
Two main strategies in combining classifiers
Fusion (competitive classifiers): each ensemble member is supposed to have knowledge of the whole feature space
Selection (cooperative classifiers): each ensemble member is supposed to know well a part of the feature space and be responsible for objects in this part
Methods
Majority vote
Fuzzy aggregation operators
Bagging
Boosting
Dataset: (x1, y1), (x2, y2), ..., (xN, yN)
- (xi, yi), i = 1, ..., J: training set (model creation)
- (xi, yi), i = J+1, ..., N: test set (model validation)
The predicted labels are compared with the real ones and the success rate that is obtained is taken as the level of confidence of the classifier. This level of confidence is assumed to be the same for all future samples
- (xi, yi), i = N+1, ...: predictions
Predictions corresponding to different samples can
have different levels of confidence
Objective: to qualify each particular prediction with
a measure of its reliability
7th FDPVA. Frascati (March 26-28, 2012) 26
Bayes classifiers Logistic regression Conformal predictors
7th FDPVA. Frascati (March 26-28, 2012) 27
1 2
| | , , ,...
j j C j
p C P C j L L L p P C x x x
must be known or assumed
D(x)
1 1 1
1 1 exp 1
class class class
P D P k P
x
- R. O. Duda, P. E. Hart, D. G. Stork. Pattern classification. 2nd edition. Wiley-Interscience. (2001)
The greater the distance D(x) the deeper is the point in its corresponding class. This has a translation in terms of probability
- E. Alpaydin. Introduction to Machine Learning. The MIT Press
- V. Vovk, A. Gammerman, G. Shafer. Algorithmic learning in a random world. Springer (2005)
Only require the iid hypothesis and several potential underlying classifiers can be used The accuracy and reliability of the prediction is based on two values: confidence and credibility The base of CP is to know how conformal the samples are among them Confidence: how good the class predicted is against the several possibilities Credibility: how good the training set is to make a prediction of the current sample
In context-free classifications no relation exists among the various
classes
In context-dependent classifications the various classes are closely
related
- Successive feature vectors are not independent
- Classifying each feature vector separately from the others obviously has
no meaning
One of the most widely used models describing the underlying class
dependence is the Markov chain rule
In the special case in which the states of the Markov chain are not
directly observable and can only be inferred from the sequence of the observations via some optimization technique, these types of Markov models are known as hidden Markov models
The Bayesian point of view is typically used
7th FDPVA. Frascati (March 26-28, 2012) 28
- S. Theodoridis, K. Koutroumbas. Pattern recognition. 2nd edition. Academic Press
7th FDPVA. Frascati (March 26-28, 2012) 29
Dataset: (x1, y1), (x2, y2), ..., (xi, yi), ..., (xN, yN) yi ∈{L1, L2, ..., LK}: class labels of the training samples xi are not available Objective ve: to “reveal” the organization of the samples into a number of “sensible” clusters which will allow us to discover similarities and differences among samples and to derive useful conclusions about them
- The labels are known just after the training
- A clustering criterion is needed: proximity
ty measure e (distance e or similarity ty)
{sheep, dog, cat, sparrow, seagull, viper, lizard, goldfish, red mullet, blue shark, frog} mammals birds reptiles fish amphibians Class of
animals
sheep dog cat sparrow seagull viper lizard goldfish red mullet shark frog
Existence of lungs
goldfish red mullet shark frog sheep dog cat sparrow seagull viper lizard
Environment where the animals live
Sequential algorithms
- k-means
Hierarchical clustering algorithms
- Agglomerative algorithms
- Divisive algorithms
Clustering algorithms based on cost function
- ptimization
- Hard or crisp clustering algorithms
- Probabilistic clustering algorithms
- Fuzzy clustering algorithms
- Boundary detection algorithms
7th FDPVA. Frascati (March 26-28, 2012) 30
- S. Theodoridis, K. Koutroumbas. Pattern recognition. 2nd edition. Academic Press
7th FDPVA. Frascati (March 26-28, 2012) 31
7th FDPVA. Frascati (March 26-28, 2012) 32
Support vector machines
- No estimation of error bars
Bayesian estimators
- Error bars estimation
Expensive from a computational point of view
What is the prediction region of the estimations?
Linear regression algorithms are faster than general regression
methods
Can we transform a (highly) non-linear regression problem into a linear
- ne?
- Kernel trick
Ridge regression confidence machine (conformal predictors to
determine the prediction regions)
- The ridge regression prediction for an object x based on samples (xi, yi), i = 1,...,n, xi∈Rd
7th FDPVA. Frascati (March 26-28, 2012) 33
H(x, x’)
- V. Vovk, A. Gammerman, G. Shafer. Algorithmic learning in a random world. Springer (2005)
1
ˆ ' '
n n n n n
y Y X X aI X
x
1 ,
ˆ ' , , , ,
n n n n n i j n i i j i
y Y K aI k K H k H
x x x x
¨ depends on the objects x1, ..., xn, x only via the scalar products between them
ˆ y
Intelligent selection of features in machine learning
- It is not possible the a priori decision on the number of features to characterize an
- bject (x∈Rn)
- If n>>, the aim is to determine m < n features to
- ptimize computation
make physical interpretation easier
- Classification
Support Vector Machines Genetic algorithms
- Regression
Determination of relevant variables in sparse linear regression models Conformal predictors are used
7th FDPVA. Frascati (March 26-28, 2012) 34
- G. Rattá et al. Improved Feature Selection based on Genetic Algorithms for Real Time Disruption Prediction on JET. Submitted
to Fus. Eng. Des.
- J. M. Ramírez et al. Parallel software code for feature selection in combined classifiers by means of genetic algorithms. In
submission
- M. Hebiri. Sparse conformal predictors. Stat. Comput. (2010) 20: 253:266
, , 1,..., , , . In matrix form ' ,
n n i i i i
y i N y Y X x x
- G. S. González et al. Support vector machine-based feature extractor for L/H transitions in JET. Rev. Sci. Ins. 81, 10E123
(2010) 3pp
Intelligent selection of samples for training purposes Dataset: (x1, y1), (x2, y2), ..., (xN, yN)
- Training set: (xi, yi), i = 1, ..., J
- Test set: (xi, yi), i = J+1, ..., N
The optimal training set cannot be chosen with a random
selection.
- Can we determine a good enough training dataset?
- What improvements can be achieved?
- Conformal predictions can be used in classification problems
7th FDPVA. Frascati (March 26-28, 2012) 35
Random selection
- L. Makili et al. Active learning using conformal predictors: application to image classification. Next talk
Real-time prediction of disruptions in JET
Intelligent system for feature extraction to characterize L/H transitions in JET and DIII-D
Automatic determination of L/H transition times in JET
Automatic determination of L/H transition times in DIII-D
Intelligent data retrieval of waveforms and images based on patterns from massive databases (JET and TJ-II)
Automatic detection of plasma events in waveforms and video-movies (JET)
Automatic ELM location in JET
Automatic analysis system in the TJ-II Thomson Scattering based on pattern recognition
Noise reduction in images (TJ-II Thomson scattering)
Application of event-based sampling strategies
Spatial location of local perturbations in plasma emissivity derived from projections using conformal predictors
7th FDPVA. Frascati (March 26-28, 2012) 36