Fusion of Continuous Output Classifiers Classifiers Jacob Hays - - PowerPoint PPT Presentation

fusion of continuous output classifiers classifiers
SMART_READER_LITE
LIVE PREVIEW

Fusion of Continuous Output Classifiers Classifiers Jacob Hays - - PowerPoint PPT Presentation

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice Definitions x feature vector c number of classes L number of classifiers { 1 , 2 , ., c } Set of class


slide-1
SLIDE 1

Fusion of Continuous Output Classifiers Classifiers

Jacob Hays Amit Pillay James DeFelice

slide-2
SLIDE 2

Definitions

  • x – feature vector
  • c – number of classes
  • L – number of classifiers
  • {ω1, ω2, …., ωc} – Set of class labels
  • {ω1, ω2, …., ωc} – Set of class labels
  • {D1, D2, …., DL} – Set of classifiers

▫ All c outputs from Di are in interval [0,1]

  • DP(x) – Decision Profile matrix

          = . ) ( ... ) ( ... ) ( ) ( ... ) ( ... ) ( ) ( ... ) ( ... ) ( ) (

, , 1 , , , 1 , , 1 , 1 1 , 1

x d x d x d x d x d x d x d x d x d x DP

C L j L L c i j i i c j

slide-3
SLIDE 3

Approaches

  • Class Conscience Use one column of DP(x) at

a time

▫ Ex) Simple/Weighted Averages

  • Class Indifferent – Treat DP(x) as a whole
  • Class Indifferent – Treat DP(x) as a whole

new feature space, Use new classifier to make final decision.

          = . ) ( ... ) ( ... ) ( ) ( ... ) ( ... ) ( ) ( ... ) ( ... ) ( ) (

, , 1 , , , 1 , , 1 , 1 1 , 1

x d x d x d x d x d x d x d x d x d x DP

C L j L L c i j i i c j

slide-4
SLIDE 4

Discriminant to Continuous

  • Non6continuous classifiers produce label
  • {g1(x), g2(x), … gc(x)} – output of D

▫ Would like to normalize to [0,1] interval

  • {g’1(x), g’2(x), … g’c(x)}, where
  • Softmax Method

Normalizes to [0,1]

  • Better if g’(x) would

be a probability

=

=

c k k j j

x g x g x g

1

)} ( exp{ )} ( exp{ ) ( '

=

=

c j j x

g

1

1 ) ( '

slide-5
SLIDE 5

Converting Linear Discriminant

  • Assuming normal densities
  • Let C be the constant additive terms we drop

)} | ( ) ( log{ ) (

j j j

x p P x g ω ω =

  • Plug into Bayes Rule, and it simplifies to the

softmax function

)} ( exp{ ) | ( ) ( x g A x p P

j j j

= ω ω

} exp{C A =

∑ ∑

= =

= × × =

c k k j c k k j j

x g x g x g A x g A x P

1 1

)} ( exp{ )} ( exp{ )} ( exp{ )} ( exp{ ) | (ω

slide-6
SLIDE 6

Neural Networks

  • Consider a NN, with c outputs, {y1, …, yc}
  • When trained using squared error rate, the
  • utputs can be used for an approximation of

posterior probability. posterior probability.

  • Normalize them to [0,1] interval using softmax

function.

  • Normalization function independent of Neural

network training, only occurs on outputs.

slide-7
SLIDE 7

Laplace Estimator for Decision Tree

  • In Decision Trees, you use entropy to split the

distribution based on a single feature per level

  • Normally, you continue to split until there is a

single class in each leaf of the tree single class in each leaf of the tree

  • In Probability Estimating Trees , instead of

splitting until a single class is in a leaf, split until around K points are in each leaf, and use various methods to calculate the probability of each class at each leaf.

slide-8
SLIDE 8

Count based probability, Laplace

  • {k1, k2, …, kc} – Number of sample points of class

{w1, w2, …., wc} respectively in leaf

  • K = k1 + k2 + …+ kc
  • Maximum Likelihood (ML) estimate of
  • Maximum Likelihood (ML) estimate of
  • When K is too small, estimates are unpredictable

c j K k x P

j j

,..., 1 , ) | ( ˆ = = ω

slide-9
SLIDE 9

Laplace Estimator

  • Laplace Correction
  • m6estimation:

c K k x P

j j

+ + = 1 ) | ( ˆ ω

  • m6estimation:
  • best to set m so

m K w P m k x P

j j j

+ × + = ) ( ˆ ) | ( ˆ ω

10 ) ( ˆ ≈ ×

j

P m ω

slide-10
SLIDE 10

Ting and William Laplace estimator

  • Ting and William

▫ ω* is majority class

 +

∑kl

1

[ ]

         × − = + + − =

∑ ∑

≠ ≠

  • therwise

k k P w w if K k x P

j l l j j j l l j

) ( ˆ 1 ) ( 2 1 1 ) | ( ˆ

* *

ω ω

slide-11
SLIDE 11

Weighted Distance Laplace Estimate

  • Take the average distance from x to all samples
  • f class wj, over the average distance to all

samples

∑ ∑

= ∈

=

k i i w x j j

x x d x x d x P

j j

1 ) ( ) (

) , ( 1 ) , ( 1 ) | ( ˆ

) (

ω

slide-12
SLIDE 12

Example

slide-13
SLIDE 13

Class Conscious Combiners

  • Non6trainable Combiners

▫ No extra parameters, all defined up front ▫ Function of classifier output for specific class

  • Simple mean

)] ( ),... ( ), ( [ ) (

, , 2 , 1

x d x d x d F x

j L j j j

= µ

=

=

L i j i j

x d L x

1 ,

) ( 1 ) ( µ

slide-14
SLIDE 14

Popular Class Conscious Combiners

  • Minimum/Maximum/Median
  • Trimmed Mean:

▫ L degrees of support sorted, X percent of values )} ( { max ) (

,

x d x

j i i j

= µ ▫ L degrees of support sorted, X percent of values are dropped. Mean taken of remaining.

  • Product

=

=

L i j i j

x d x

1 ,

) ( ) ( µ

slide-15
SLIDE 15

Generalized Mean Function

  • Generalized Mean is defined as above except for

the following special cases.

α α

α µ

/ 1 1 ,

) ( 1 ) , (       =

= L i j i j

x d L x

the following special cases.

▫ a → ∞ , Minimum, ▫ a = 61, Harmonic Mean ▫ a = 0, Geometric mean ▫ a = 1, Simple Arithmetic Mean ▫ a → ∞, Maximum

  • a is chosen before hand, level of optimism

L L i j i j

x d x

/ 1 1 ,

) ( ) (         = ∏

=

µ

slide-16
SLIDE 16

Class Conscious Combiner Example

slide-17
SLIDE 17

Example: Effect of Optimism α

  • 100 training / test sets

▫ Training set (a), 200 samples ▫ Testing set (b), 1000 ▫ Testing set (b), 1000 samples

  • For each ensemble

▫ 10 bootstrap samples (200 values) ▫ Train classifier on each (Parzen)

slide-18
SLIDE 18

Example: Effect of Optimism α

  • Generalized mean

▫ 50 <= α <= +50, steps of 1 ▫ 61 <= α <= +1, steps of 0.1

  • Simple mean combiner gives best result
slide-19
SLIDE 19

Interpreting Results

  • Mean classifier isn’t always the best
  • Shape of the error curve depends upon

▫ Problem ▫ Base classifier used ▫ Base classifier used

  • Average and product are most intensely studied

combiners

▫ For some problems, average may be…

Less accurate, but More stable

slide-20
SLIDE 20

Ordered Weight Averaging

  • Generalized, non6trainable
  • L coefficients (one for each classifier)
  • Order the results of ωj classifiers, descending

j

  • Multiply by vector of coefficients b (weights)

▫ i1, …, iL is a permutation of the indices 1, …, L

( ) ( )

=

=

L k j k i k

x d b x

1 ] [

µ

slide-21
SLIDE 21

Ordered Weight Averaging: Example

  • Consider a jury assessing sport performance

(diving)

▫ Reduce subjective bias

Trimmed mean

[ ]

T j

d 6 . 6 . 2 . 7 . 6 . =   1 1 1

Trimmed mean

Drop lowest, highest scores Average the remaining

      = 3 1 3 1 3 1 b

[ ]

6 . 2 . 6 . 6 . 6 . 7 . 3 1 3 1 3 1 =       =

T j

µ

slide-22
SLIDE 22

Ordered Weight Averaging

  • General form of trimmed mean

▫ b = [ 0, 1/(L62), 1/(L62), …, 1/(L62), 0]T

  • Other operations may be modeled with careful

selection of b selection of b

▫ Minimum: b = [0, 0, …, 1]T ▫ Maximum: b = [1, 0, …, 0]T ▫ Average: b = [ 1/L, 1/L, …, 1/L]T

  • Many resources spent on developing new

aggregation connectives

▫ Bigger question: when to use which one?

slide-23
SLIDE 23

Trainable Combiners

  • Combiners with additional parameters to be

trained

▫ Weighted Average ▫ Fuzzy Integral ▫ Fuzzy Integral

slide-24
SLIDE 24

Weighted Average

  • 3 groups, based on number of weights
  • L6weights

▫ One weight per classifier

( ) ( )

=

L j i i j

x d w x

,

µ

▫ Similar to equation we saw for ordered weight average, except we’re trying to optimize wi here (and we’re not reordering di,j) ▫ wi for classifier Di usually based on its estimated error rate

= i j i i j 1 ,

slide-25
SLIDE 25

Weighted Average

  • cxL weights

▫ Weights are specific to each class

( ) ( )

=

L j i ij j

x d w x

,

µ

▫ Only jth column is used in calc ▫ Linear regression commonly used to derive

  • ptimal weights

▫ “class conscious” combiner

= i j i ij j 1 ,

slide-26
SLIDE 26

Weighted Average

  • cxcxL weights

▫ Support for each class determined from entire decision profile DP(x)

( ) ( )

∑∑

=

L c k i ikj j

x d w x

,

µ

▫ Different weight space for each class ωj ▫ Whole decision profile is intermediate feature space

“class indifferent” combiner

∑∑

= = i k 1 1

slide-27
SLIDE 27

Weighted Average: Class Conscious

  • di,j(x) are point estimates of P(ωj | x)

▫ If estimates are unbiased,

( ) ( )

=

=

L i j i ij j

x d w x

1 ,

µ

▫ If estimates are unbiased,

Q(x) is nonbiased minimum variance estimate of P(ωj | x), conditional upon… restriction of coefficients wi to sum to 1

=

=

L i i

w

1

1

slide-28
SLIDE 28

Weighted Average: Class Conscious

  • Weights derived to minimize variance of Q(x)

( ) ( )

=

=

L i j i ij j

x d w x

1 ,

µ

  • Weights derived to minimize variance of Q(x)

▫ Q(x) variance <= variance of any single classifier

  • We assume point estimates are unbiased

▫ Variance of di,j(x) = expected squared error of di,j(x)

  • When coefficients wi minimize variance

▫ Q(x) is better estimate of P(ωj | x) than any di,j(x)

slide-29
SLIDE 29

Ex: Variance of Estimate of P(ωj | x)

  • Calculate variance of di,j(x) (estimates)
  • Target values of P(ωj | x) are

▫ 1 (in class ωj), 0 (not in class ωj) ▫ Output for class w1 of 2 classifier ensemble D = {D1, D2}, dataset Z = {z1, z2, …, z10} ▫ First 3 points in w1 ▫ Table shows first columns of 10 DPs

slide-30
SLIDE 30

Ex: Variance of Estimate of P(ωj | x)

  • Variance of Di here is variance of

approximation error

▫ Approximation error determined as

{(1 – 0.71),(1 – 0.76),(1 – 0.15),…,(0 – 0.79)}

▫ Mean of approx. error for D1 is –0.225 ▫ Mean of approx. error for D1 is –0.225 ▫ Variance of approx. error for D1 is ▫ Covariance matrix of approx. errors for classifiers is ( ) ( ) ( )

[ ]

32 . 225 . 79 . ... 225 . 71 . 1 1 10 1

2 2 2

≈ + − + + + − − = σ

      = Σ 34 . 22 . 22 . 32 .

slide-31
SLIDE 31

Constrained Regression

  • Assume approximation errors are normally

distributed, zero mean

▫ P(ωj | x) – di,j(x) = approximation errors ▫ σ2 is covariance of approximation errors between ▫ σ2

ik is covariance of approximation errors between

classifiers Di, Dk

  • General Legrange form
  • Find our optimal weights by minimizing J

( ) ( ) ( )

x g x f x L λ λ + = ,

∑∑ ∑

= = =

      − − =

L i L k L i i ik k i

w w w J

1 1 1

1 λ σ

slide-32
SLIDE 32

Constrained Regression

  • Solution for minimizing J

∑∑ ∑

= = =

      − − =

L i L k L i i ik k i

w w w J

1 1 1

1 λ σ

  • where

is our set of weights I is vector of size L, all 1’s

1 1 1

) (

− − −

Σ Σ = I I I w

T

[ ]

T L

w w w w ,..., ,

2 1

=

slide-33
SLIDE 33

Ex: Constrained Regression

  • Going back to the numbers we had from Table

5.1

      = Σ 34 . 22 . 22 . 32 .

1 1 1

) (

− − −

Σ Σ = I I I w

T

▫ All the weights and covariances need to be labeled with j to indicate which P(ωj | x) we’re estimating

    = Σ 34 . 22 .

[ ]

      =                     − −             − − =

46 . 54 . 1 1 3 . 5 6 . 3 6 . 3 6 . 5 1 1 1 1 3 . 5 6 . 3 6 . 3 6 . 5

1

w

slide-34
SLIDE 34

Constrained Regression / Comparison

  • Comparison of

two combiners

▫ Simple avg ▫ Weighted avg ▫ Weighted avg

  • L varied: 2 – 30

▫ L > 20 tends to

  • ver fit
slide-35
SLIDE 35

Constrained Regression, Extension

  • Suppose classifier outputs for ωj are

independent

▫ Σ is diagonal, with variances for D1, … DL along the diagonal diagonal ▫ Simplifies weight optimization

=

=

L k k i i

w

1 2 2

1 1 σ σ

slide-36
SLIDE 36

Fuzzy Integral

  • Based on fuzzy set theory
  • Main Idea:

Measure the strength of not only for each classifier Measure the strength of not only for each classifier but also for all subset of classifiers

  • Measure of strength of each subset of classifier gives

how good this subset for the given input x. Also called fuzzy measure

slide-37
SLIDE 37

Fuzzy Integral cont.

subset D1 D2 D3 D1, D2 D1, D3 D2, D3 D1,D2, D3 g 0.3 0.1 0.4 0.4 0.5 0.8 1

jth column of decision profile for input x as [0.1 0.7 0.5] jth column of decision profile for input x as [0.1 0.7 0.5] Goal: To find j(x)

  • 1. Sort the degrees of support in ascending order
  • 2. Append 0 and 1 in the list if not present
  • 3. For different value of α in the list find classifiers

having support more than or equal to α

  • 4. Subset of such classifiers are called α6cut (Hα)
slide-38
SLIDE 38

Fuzzy Integral cont.

α = 0 H0 = {D1,D2,D3} g(H0) = 1 α = 0.1 H0.1= {D1,D2,D3} g(H0.1) = 1 α = 0.5 H0.5= {D2,D3} g(H0.5) = 0.8 α = 0.7 H = {D2} g(H ) = 0.1 α = 0.7 H0.7= {D2} g(H0.7) = 0.1 α = 1 H1= Φ g(H1) = 0 j(x) = max{min(α,g(Hα))} = max{min(0,1), min(0.1,1), min(0.5,0.8), min(0.7,0.1), min(1,0)} = max(0, 0.1, 0.5, 0.1, 0) = 0.5

slide-39
SLIDE 39

Class4Indifferent Combiners

  • Unlike class6conscious combiners, these type of

combiners uses all Lxc degrees of support in Decision Profile DP(x)

slide-40
SLIDE 40

Decision Templates

  • Typical decision profile (DP) that is a representative of

class ωj is called Decision Template (DTj)

  • Main Idea:
  • Main Idea:

▫ Compare decision template (DTj) with the current decision profile DP(x) for some test input x using some similarity measure.

  • Training:

▫ For j = 1 to c, calculate the mean of the DPs(zk) for inputs from some data set Z that belongs to class ωj. This mean represent the decision template (DTj)

slide-41
SLIDE 41

Decision Template Cont.

so we have, where N is the number of elements of Z from ω where Nj is the number of elements of Z from ωj

  • Operation:

▫ Given an input set x € Rn, construct DP(x) and then perform similarity S between DP(x) and each DTj j(x) = S(DP(x), DTj) j = 1,…c.

slide-42
SLIDE 42
slide-43
SLIDE 43

Decision Template cont.

  • Similarity Measure

1) Squared Euclidean Distance (DT(E)) where DTj(i,k) is the (i,k) entry in decision table DTj ▫ Similar to nearest mean classification in intermediate space ▫ we can use other distance measures like Minkowski, Mahalanobis etc.

slide-44
SLIDE 44

Decision Template cont.

Similarity Measure

2) Symmetric Difference (DT(S))

▫ This measure comes from fuzzy set theory

slide-45
SLIDE 45

Decision Template cont.

  • Illustration of Decision Template

0.6 0.4 0.3 0.7 DT1 = 0.8 0.2 DT2 = 0.4 0.6 DT1 = 0.8 0.2 DT2 = 0.4 0.6 0.5 0.5 0.1 0.9 0.3 0.7 DT version

1(x) 2(x) Label

DP(x) = 0.6 0.4 DT(E) 0.9567 0.9333 ω1 0.5 0.5 DT(S) 0.5333 0.5333 ω2

slide-46
SLIDE 46

Why Class4Indifferent?

  • Decision Templates approach is a context6free (free from

the nature of classifier)

  • Unlike class6conscious combiners which are idempotent

by design by design

  • Assume we have L copies of classifier D in the ensemble and

the DTs for the two class are, 0.55 0.45 0.2 0.8 DT1 = … DT2 = … 0.55 0.45 0.2 0.8 and if we have decision of D for x to be d1 = 0.4 and d2 = 0.6

slide-47
SLIDE 47

Why Class4Indifferent?

  • Then all class6conscious methods will assign x to

class ω2

  • But based on DT(E) we have 2 Euclidean distance as,
  • Hence x classified as ω1.Which means that it is possible

that true class is ω1 hence DTs proved to be correct where other combiners including D will be wrong

slide-48
SLIDE 48

Dempster4Shafer Combination

  • Its just another method of comparing the DTs and

the DP of new x

  • Instead of calculating the similarity between the DT

and DP(x), this method measure the proximity of and DP(x), this method measure the proximity of individual classifiers output with those present in the DT where DTi

j = ith row of DTj and Di(x) = output of Di

slide-49
SLIDE 49

Dempster4Shafer Combination cont.

  • Based on this, we calculate for each class j = 1 to c;

for each of the classifier belief degrees as:

  • And the final degree of support for the given input is

given as:

slide-50
SLIDE 50

Dempster4Shafer Combination cont.

  • An illustration,

0.6 0.4 0.3 0.7 0.3 0.7 DT1 = 0.8 0.4 DT2 = 0.4 0.6 DT3 = 0.6 0.4 0.5 0.5 0.1 0.9 0.5 0.5 0.5 0.5 0.1 0.9 0.5 0.5 Then the 3 proximities for each of 3 decision template: class Φj,1(x) Φj,2(x) Φj,3(x)

ω1 0.4587 0.5000 0.5690 ω2 0.5413 0.5000 0.4310

slide-51
SLIDE 51

Dempster4Shafer Combination cont.

  • We have the belief equation for ω1
  • Similarly we calculate belief for ω2,and the final degreeof

belief each class 1 and 2 are Class bj (D1(x)) bj (D2(x)) bj (D3(x)) j(x)

ω1 0.2799 0.3333 0.4289 0.5558 ω2 0.3898 0.3333 0.2462 0.4442

slide-52
SLIDE 52

Classifier Fusion using DS

Classifier Fusion using DempsterShafer theory of evidence to predict Breast Cancer Tumors

  • DS theory of belief was applied to fuse breast cancer

data obtained from different diagnostic techniques

  • Classifiers used were SVM with linear, polynomial,

and RBF kernel

  • Classifiers gives beliefs for two classes: benign and

malignant

  • These evidences are then used to reach a final

diagnosis using DS belief combination formula.

slide-53
SLIDE 53

References

  • L. Kuncheva. (2004) Combining Pattern Classifiers,

Methods and Algorithms, Wiley. *

  • Raza, Mansoor; Gondal, Iqbal; Green, David; Coppel, Ross

L.;"Classifier Fusion Using DempsterShafer theory of evidence to Predict Breast Cancer Tumors", TENCON 2006. evidence to Predict Breast Cancer Tumors", TENCON 2006. 2006 IEEE Region 10 Conference, 14617 Nov. 2006, pp. 164