J. Vega Asociacin EURATOM/CIEMAT para Fusin jesus.vega@ciemat.es - - PowerPoint PPT Presentation

j vega
SMART_READER_LITE
LIVE PREVIEW

J. Vega Asociacin EURATOM/CIEMAT para Fusin jesus.vega@ciemat.es - - PowerPoint PPT Presentation

J. Vega Asociacin EURATOM/CIEMAT para Fusin jesus.vega@ciemat.es Concepts Classification Regression Advanced methods 7th FDPVA. Frascati (March 26-28, 2012) 2 Technology: it gives stereotyped solutions to stereotyped


slide-1
SLIDE 1
  • J. Vega

Asociación EURATOM/CIEMAT para Fusión jesus.vega@ciemat.es

slide-2
SLIDE 2

 Concepts  Classification  Regression  Advanced methods

2 7th FDPVA. Frascati (March 26-28, 2012)

slide-3
SLIDE 3

 Technology: it gives stereotyped solutions to

stereotyped problems

 Basic science: it is the accumulation of knowledge

to explain a phenomenon

 Applied science: it is the application of scientific

knowledge to a particular environment

  • Machine learning can increase our comprehension of

plasma physics

7th FDPVA. Frascati (March 26-28, 2012) 3

slide-4
SLIDE 4

 Learning does not mean ‘learning by heart’ (any

computer can memorize)

 Learning means ‘generalization capability’: we learn

with some samples and predict for other samples

7th FDPVA. Frascati (March 26-28, 2012) 4

slide-5
SLIDE 5

 The learning problem is the problem of finding a

desired dependence (function) using a limited number of observations (training data)

  • Classification: the function can represent the separation

frontier between two classes

  • Regression: the function can provide a fit to the data

7th FDPVA. Frascati (March 26-28, 2012) 5

x x x x x x x

  • o

Classification Regression

slide-6
SLIDE 6

 The general model of learning from examples is described

through three components

 The problem of learning is that of choosing from the given

set of functions f(x, a), the one that best approximates the supervisor’s response

7th FDPVA. Frascati (March 26-28, 2012) 6

Generator of random vectors: p(x) Supervisor: p(y|x) Learning machine: f(x, a)

n

 x

 

| y p y  x ˆ ( , ) y f a  x p(x): fixed but unknown probability distribution function y = p(y|x) (fixed and unknown) (xi, yi), i = 1, ..., N: training samples

   

1 2 1 2

ˆ ˆ ˆ , ,..., "close" to , ,...,

N N

y y y y y y

slide-7
SLIDE 7

 Main hypothesis

  • The training set, (xi, yi), i = 1, ..., N, is made up of

independent and identically distributed (iid) observations drawn according to p(x, y) = p(y|x)p(x)

 Loss function: L(y, f(x, a))

  • It measures the quality of the approach performed by the

learning algorithm, i.e. the discrepancy between the response y of the supervisor and the response f(x, a) of the learning machine. Its values are ≥ 0

 Risk functional:

7th FDPVA. Frascati (March 26-28, 2012) 7

   

  

, , , R L y f p y d dy a a  x x x The goal of a learning process is to find the function f(x, a0) that minimizes R(a) (over the class of functions f(x, a)) in the situation where p(x, y) is unknown and the only available information is contained in the training set

slide-8
SLIDE 8

 Pattern recognition (or classification)  Regression estimation  Density estimation

7th FDPVA. Frascati (March 26-28, 2012) 8

   

  

, , , L y f R p y d dy a a  x x x

 

 

   

0 if , , , 1 if , y f x L y f y f x a a a         x

 

 

 

 

2

, , , L y f y f a a   x x

 

 

 

, log , L p p x a a   x

slide-9
SLIDE 9

7th FDPVA. Frascati (March 26-28, 2012) 9

 

2 1 2 1

0 if , 1 if x ax b f x x ax b a        

 

, a b a 

slide-10
SLIDE 10

7th FDPVA. Frascati (March 26-28, 2012) 10

     

1 1 1 1 1

, , , , , f x x A f A f a  a  

     

2 2 2 2 2

, , , , , f x x A f A f     

     

3

, , , , , f x x m b m n     

slide-11
SLIDE 11

7th FDPVA. Frascati (March 26-28, 2012) 11

slide-12
SLIDE 12

7th FDPVA. Frascati (March 26-28, 2012) 12

Feature types Quantitative (numerical) Qualitative (categorical) Continuous-valued (length, pressure) Discrete (total basketball score, number of citizens in a town) Ordinal (education degree) Nominal (profession, brand of a car) Dataset: (x1, y1), (x2, y2), ..., (xi, yi), ..., (xN, yN) xi ∈ Rm: features that are of distinctive nature (object description with attributes managed by computers) yi ∈{L1, L2, ..., LK}: label of the sample xi

slide-13
SLIDE 13

7th FDPVA. Frascati (March 26-28, 2012) 13

Dataset: (x1, y1), (x2, y2), ..., (xi, yi), ..., (xN, yN) xi ∈ Rm: features that are of distinctive nature (object description with attributes managed by computers) yi ∈{L1, L2, ..., LK}: known label of the sample xi Objective ve: to determine a separating function between classes (generalization) to predict the label of new samples with known feature vectors ((xN+1, yN+1), (xN+2, yN+2), ...)

x x

Decision boundary

x x

Decision boundary

Overfitting

slide-14
SLIDE 14

 How good is a classifier?  Training set: a model is created to make

  • predictions. Given xi, the model predicts yi

 Test set: model validation. The success rate is

taken as the level of confidence and it is assumed to be the same for all future samples

7th FDPVA. Frascati (March 26-28, 2012) 14

Dataset: (x1, y1), (x2, y2), ..., (xN, yN) (xi, yi), i = 1, ..., J: training set (xi, yi), i = J+1, ..., N: test set

slide-15
SLIDE 15

 Multi-class problems: K > 2

  • It can be tackled as K binary problems. Each class is compared

with the rest (one-versus-the-rest approach)

7th FDPVA. Frascati (March 26-28, 2012) 15

c1 c2 c3 c4 c1 not c4 c2 not c2 c3 not c3 c4

ambiguity region

not c1

slide-16
SLIDE 16

 Examples of feature vectors

  • Disruptions
  • L/H transition
  • Image classification

 xi: the set of pixels of an image  yi∈{1, 2, 3}

7th FDPVA. Frascati (March 26-28, 2012) 16

5 10 15 20

time

 

 

 

 

 

 

1 1 2 2 3 3

( ), ( ), ( ),... , , ( ), ( ), ( ),... , , ( 2 ), ( 2 ), ( 2 ),... , ,

p s p s e s p s p s e s p s p s e s

t I t n t y D N t T I t T n t T y D N t T I t T n t T y D N                x x x

. . . .

   

, , , ,

m i i i i

y y L H   x x

slide-17
SLIDE 17

 Single classifiers

  • Support Vector Machines (SVM)
  • Neural networks
  • Bayes decision theory

 Parametric method  Non-parametric method

  • Classification trees

 Combining classifiers

7th FDPVA. Frascati (March 26-28, 2012) 17

slide-18
SLIDE 18

Binary classifier

It finds the optimal separating hyper-plane between classes

Samples: (xk, yk), xk∈Rn, k = 1, ..., N, y∈{C{+1}, C{-1}}

7th FDPVA. Frascati (March 26-28, 2012) 18

C{+1} C{-1} C{+1} C{-1}

Maximum margin: 2t

w

( ) . D b    x w x

D(x)>+1 D(x)<-1

( ) 1 D  x ( ) 1 D   x

| ( ) | || ||

k

D x w

 

( ) , 1, 1 , 1, ,

k k k

y D y k N t      x w

To find the optimal hyper-plane it is necessary to determine the vector w that maximizes the margin t There are infinite solutions due to the presence of a scale factor. To avoid this: t || w|| = 1 Therefore, to maximize the margin is equivalent to minimize || w|| Opti timi miza zation ion problem: lem:

 

2 ,

min ( ) , subject to . 1

w k i

J y w       

w

w w w x

slide-19
SLIDE 19

Solution:

Samples associated to ai ≠ 0 are called “support vectors”

The constant b is obtained from any condition (Karush-Kuhn-Tucker)

7th FDPVA. Frascati (March 26-28, 2012) 19

C{+1} C{-1} w

* *

( ) b    w x

* * 1 N i i i i

y a

 w x

(xk, yk), xk∈Rn, k = 1, ..., N, y∈C{+1}, C{-1}}

ai are the Lagrange multipliers

* support vectors i i i

y a 

w x  

( ) 1 0, 1

i i i

y b i , ,N a          w x

The rest of training samples are irrelevant to classify new samples

  • V. Cherkassky, F. Mulier. Learning from data. 2nd edition. Wiley-Interscience.

   

* * 1 1

Given to classify ( ) 0, . Otherwise

i i i vectores soporte

if sign y b C C a

 

          

x x x x x

* *

( ) · D b   x w x

is the distance (with sign) from X to the separating hyper-plane

Support vectors

slide-20
SLIDE 20

 Non-linearly separable case  Kernels

  • Linear:
  • Polynomials of degree q:
  • Radial basis functions:
  • Neural network:

7th FDPVA. Frascati (March 26-28, 2012) 20

feature space

H(x, x’)

input space

( , ) [( . ) 1]q H   x x' x x'

2 2

( , ) exp H               x x' x x'

( , ) tanh(2( . ) 1) H   x x' x x' ( , ) ( . ) H  x x' x x'

  • V. Cherkassky, F. Mulier. Learning from data. 2nd edition. Wiley-Interscience.

 

   

* * 1 1

Given to classify , 0, . Otherwise

i i i vectores soporte

if sign y H b C C a

 

         

x x x x x

slide-21
SLIDE 21

7th FDPVA. Frascati (March 26-28, 2012) 21

Input Hidden layer 1 Hidden layer 2 Output Samples: (xj, yj), xj∈Rn, j = 1,...,N, y∈{L1, L2} W(1): (n+1 x k) matrix W(2): (k+1 x m) matrix W(3): (m+1 x 1) matrix

   

1 1 exp s u u   

Aim: determine W for a minimum error

  • C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press

     

2 tanh 1 1 exp 2 s u u u     

X1 1 s O1

(1)

Xn s s s Ok

(1)

1 O1

(2)

s Om

(2)

1 y W(1) W(2) W(3)

. . . . . . . . .

slide-22
SLIDE 22

7th FDPVA. Frascati (March 26-28, 2012) 22

X1 X2 Xn s1 s2

. . .

W1 W2 Wn O1

(1)

     

1

2 tanh 1 1 exp 2 s u u u     

 

2

s u kx b  

 

 

1 1 1 1 2 2

2 1 1 exp 2 ...

n n

  • W x

W x W x           

 

 

2 2 1 1 2 2

2 1 exp 2 ...

n n

k

  • k

b W x W x W x            

Given x to classify: if O2

(2) ≥ threshold

x ∈ {L1}

  • therwise

x ∈ {L2}

Samples: (xj, yj), xj∈Rn, j = 1,...,N, y∈{L1, L2}

slide-23
SLIDE 23

7th FDPVA. Frascati (March 26-28, 2012) 23

     

   

1 2

| | , ,

j j j

p C P C P C j L L p   x x x

Likelihood Posterior Prior Probability distribution function of x

  • R. O. Duda, P. E. Hart, D. G. Stork. Pattern classification. 2nd edition.

Wiley-Interscience. (2001)

 

1 |

P C x

 

2 |

P C x

 

   

|

j j j

p p C P C  x x

Classification rule

   

1 2 1 2

Given to classify | | , . Otherwise if P C P C C C    x x x x x

       

1 1 2 2 1 2

| | , Otherwise if p C P C p C P C C C    x x x x

Likelihood estimation:

  • Parametric: parameter estimation
  • Non-parametric: Parzen window

Samples: (xk, yk), xk∈Rn, k = 1, ..., N, y∈{C1, C2}

slide-24
SLIDE 24

Some classification problems involve features that are nominal data (discrete descriptions without any natural notion of similarity or even ordering)

The questions asked at each node concern a particular property

Successive nodes are visited until a terminal or leaf node is reached, where the category label is read

The same question (Size?) appears in different places

Different leaf nodes can be labelled by the same category

7th FDPVA. Frascati (March 26-28, 2012) 24

Color? Size? Shape? Size? Watermelon Apple Grape Grapefruit Lemmon Size? Banana Taste? Apple Cherry Grape

root

Level 0 Level 1 Level 2 Level 3 green yellow red big medium small round thin medium small big small sweet sour

  • L. Breiman et al. Classification and Regression Trees. Chapman&Hall/CRC
slide-25
SLIDE 25

 Combining several classifiers allows getting a consensus of

results for greater accuracy

7th FDPVA. Frascati (March 26-28, 2012) 25

Approaches to building classifiers ensembles

Data set x D1 D2 DL Combiner

...

X(1) X(2) X(L)

Combination level: design different combiners Classifier level: use different base classifiers Feature level: use different feature subsets Data level: use different data subsets

  • L. I. Kuncheva. Combining pattern classifiers. Wiley Interscience

Two main strategies in combining classifiers

Fusion (competitive classifiers): each ensemble member is supposed to have knowledge of the whole feature space

Selection (cooperative classifiers): each ensemble member is supposed to know well a part of the feature space and be responsible for objects in this part

Methods

Majority vote

Fuzzy aggregation operators

Bagging

Boosting

slide-26
SLIDE 26

 Dataset: (x1, y1), (x2, y2), ..., (xN, yN)

  • (xi, yi), i = 1, ..., J: training set (model creation)
  • (xi, yi), i = J+1, ..., N: test set (model validation)

 The predicted labels are compared with the real ones and the success rate that is obtained is taken as the level of confidence of the classifier.  This level of confidence is assumed to be the same for all future samples

  • (xi, yi), i = N+1, ...: predictions

 Predictions corresponding to different samples can

have different levels of confidence

 Objective: to qualify each particular prediction with

a measure of its reliability

7th FDPVA. Frascati (March 26-28, 2012) 26

slide-27
SLIDE 27

 Bayes classifiers  Logistic regression  Conformal predictors

7th FDPVA. Frascati (March 26-28, 2012) 27

     

   

1 2

| | , , ,...

j j C j

p C P C j L L L p P C   x x x

must be known or assumed

D(x)

 

 

   

1 1 1

1 1 exp 1

class class class

P D P k P

  

         x

  • R. O. Duda, P. E. Hart, D. G. Stork. Pattern classification. 2nd edition. Wiley-Interscience. (2001)

The greater the distance D(x) the deeper is the point in its corresponding class. This has a translation in terms of probability

  • E. Alpaydin. Introduction to Machine Learning. The MIT Press
  • V. Vovk, A. Gammerman, G. Shafer. Algorithmic learning in a random world. Springer (2005)

Only require the iid hypothesis and several potential underlying classifiers can be used The accuracy and reliability of the prediction is based on two values: confidence and credibility The base of CP is to know how conformal the samples are among them Confidence: how good the class predicted is against the several possibilities Credibility: how good the training set is to make a prediction of the current sample

slide-28
SLIDE 28

 In context-free classifications no relation exists among the various

classes

 In context-dependent classifications the various classes are closely

related

  • Successive feature vectors are not independent
  • Classifying each feature vector separately from the others obviously has

no meaning

 One of the most widely used models describing the underlying class

dependence is the Markov chain rule

 In the special case in which the states of the Markov chain are not

directly observable and can only be inferred from the sequence of the observations via some optimization technique, these types of Markov models are known as hidden Markov models

 The Bayesian point of view is typically used

7th FDPVA. Frascati (March 26-28, 2012) 28

  • S. Theodoridis, K. Koutroumbas. Pattern recognition. 2nd edition. Academic Press
slide-29
SLIDE 29

7th FDPVA. Frascati (March 26-28, 2012) 29

Dataset: (x1, y1), (x2, y2), ..., (xi, yi), ..., (xN, yN) yi ∈{L1, L2, ..., LK}: class labels of the training samples xi are not available Objective ve: to “reveal” the organization of the samples into a number of “sensible” clusters which will allow us to discover similarities and differences among samples and to derive useful conclusions about them

  • The labels are known just after the training
  • A clustering criterion is needed: proximity

ty measure e (distance e or similarity ty)

{sheep, dog, cat, sparrow, seagull, viper, lizard, goldfish, red mullet, blue shark, frog} mammals birds reptiles fish amphibians Class of

animals

sheep dog cat sparrow seagull viper lizard goldfish red mullet shark frog

Existence of lungs

goldfish red mullet shark frog sheep dog cat sparrow seagull viper lizard

Environment where the animals live

slide-30
SLIDE 30

 Sequential algorithms

  • k-means

 Hierarchical clustering algorithms

  • Agglomerative algorithms
  • Divisive algorithms

 Clustering algorithms based on cost function

  • ptimization
  • Hard or crisp clustering algorithms
  • Probabilistic clustering algorithms
  • Fuzzy clustering algorithms
  • Boundary detection algorithms

7th FDPVA. Frascati (March 26-28, 2012) 30

  • S. Theodoridis, K. Koutroumbas. Pattern recognition. 2nd edition. Academic Press
slide-31
SLIDE 31

7th FDPVA. Frascati (March 26-28, 2012) 31

slide-32
SLIDE 32

7th FDPVA. Frascati (March 26-28, 2012) 32

 Support vector machines

  • No estimation of error bars

 Bayesian estimators

  • Error bars estimation

Expensive from a computational point of view

What is the prediction region of the estimations?

slide-33
SLIDE 33

 Linear regression algorithms are faster than general regression

methods

 Can we transform a (highly) non-linear regression problem into a linear

  • ne?
  • Kernel trick

 Ridge regression confidence machine (conformal predictors to

determine the prediction regions)

  • The ridge regression prediction for an object x based on samples (xi, yi), i = 1,...,n, xi∈Rd

7th FDPVA. Frascati (March 26-28, 2012) 33

H(x, x’)

  • V. Vovk, A. Gammerman, G. Shafer. Algorithmic learning in a random world. Springer (2005)

 

1

ˆ ' '

n n n n n

y Y X X aI X

  x

   

 

   

1 ,

ˆ ' , , , ,

n n n n n i j n i i j i

y Y K aI k K H k H

    x x x x

¨ depends on the objects x1, ..., xn, x only via the scalar products between them

ˆ y

slide-34
SLIDE 34

Intelligent selection of features in machine learning

  • It is not possible the a priori decision on the number of features to characterize an
  • bject (x∈Rn)
  • If n>>, the aim is to determine m < n features to

  • ptimize computation

 make physical interpretation easier

  • Classification

 Support Vector Machines  Genetic algorithms

  • Regression

 Determination of relevant variables in sparse linear regression models  Conformal predictors are used

7th FDPVA. Frascati (March 26-28, 2012) 34

  • G. Rattá et al. Improved Feature Selection based on Genetic Algorithms for Real Time Disruption Prediction on JET. Submitted

to Fus. Eng. Des.

  • J. M. Ramírez et al. Parallel software code for feature selection in combined classifiers by means of genetic algorithms. In

submission

  • M. Hebiri. Sparse conformal predictors. Stat. Comput. (2010) 20: 253:266

 

, , 1,..., , , . In matrix form ' ,

n n i i i i

y i N y Y X          x x

  • G. S. González et al. Support vector machine-based feature extractor for L/H transitions in JET. Rev. Sci. Ins. 81, 10E123

(2010) 3pp

slide-35
SLIDE 35

 Intelligent selection of samples for training purposes  Dataset: (x1, y1), (x2, y2), ..., (xN, yN)

  • Training set: (xi, yi), i = 1, ..., J
  • Test set: (xi, yi), i = J+1, ..., N

 The optimal training set cannot be chosen with a random

selection.

  • Can we determine a good enough training dataset?
  • What improvements can be achieved?
  • Conformal predictions can be used in classification problems

7th FDPVA. Frascati (March 26-28, 2012) 35

Random selection

  • L. Makili et al. Active learning using conformal predictors: application to image classification. Next talk
slide-36
SLIDE 36

Real-time prediction of disruptions in JET

Intelligent system for feature extraction to characterize L/H transitions in JET and DIII-D

Automatic determination of L/H transition times in JET

Automatic determination of L/H transition times in DIII-D

Intelligent data retrieval of waveforms and images based on patterns from massive databases (JET and TJ-II)

Automatic detection of plasma events in waveforms and video-movies (JET)

Automatic ELM location in JET

Automatic analysis system in the TJ-II Thomson Scattering based on pattern recognition

Noise reduction in images (TJ-II Thomson scattering)

Application of event-based sampling strategies

Spatial location of local perturbations in plasma emissivity derived from projections using conformal predictors

7th FDPVA. Frascati (March 26-28, 2012) 36

CIEMAT/Consorzio RFX/JET/UNED/UPM