fusion of continuous output classifiers classifiers
play

Fusion of Continuous Output Classifiers Classifiers Jacob Hays - PowerPoint PPT Presentation

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice Definitions x feature vector c number of classes L number of classifiers { 1 , 2 , ., c } Set of class


  1. Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

  2. Definitions • x – feature vector • c – number of classes • L – number of classifiers • { ω 1 , ω 2 , …., ω c } – Set of class labels • { ω 1 , ω 2 , …., ω c } – Set of class labels • {D 1 , D 2 , …., D L } – Set of classifiers ▫ All c outputs from D i are in interval [0,1] • DP(x) – Decision Profile matrix   d ( x ) ... d ( x ) ... d ( x )  1 , 1 1 , j 1 , c  =  DP ( x ) d ( x ) ... d ( x ) ... d ( x ) .  i , 1 i , j i , c   d ( x ) ... d ( x ) ... d ( x )   L , 1 L , j L , C

  3. Approaches • Class Conscience � Use one column of DP(x) at a time ▫ Ex) Simple/Weighted Averages • Class Indifferent – Treat DP(x) as a whole • Class Indifferent – Treat DP(x) as a whole new feature space, Use new classifier to make final decision.   d ( x ) ... d ( x ) ... d ( x )  1 , 1 1 , j 1 , c  =  DP ( x ) d ( x ) ... d ( x ) ... d ( x ) .  i , 1 i , j i , c   d ( x ) ... d ( x ) ... d ( x )   L , 1 L , j L , C

  4. Discriminant to Continuous • Non6continuous classifiers produce label • {g 1 (x), g 2 (x), … g c (x)} – output of D ▫ Would like to normalize to [0,1] interval • {g’ 1 (x), g’ 2 (x), … g’ c (x)}, where c ∑ = ' ( ) 1 g j x = j 1 • Softmax Method exp{ g ( x )} j = g ' ( x ) Normalizes to [0,1] j c ∑ • Better if g’(x) would exp{ g ( x )} k be a probability = 1 k

  5. Converting Linear Discriminant • Assuming normal densities = ω ω g ( x ) log{ P ( ) p ( x | )} j j j • Let C be the constant additive terms we drop A = exp{ C } ω ω = P ( ) p ( x | ) A exp{ g ( x )} j j j • Plug into Bayes Rule, and it simplifies to the softmax function × A exp{ g ( x )} exp{ g ( x )} j j ( ω = = P | x ) j c c ∑ ∑ × exp{ ( )} exp{ ( )} A g x g x k k = = k 1 k 1

  6. Neural Networks • Consider a NN, with c outputs, {y 1 , …, y c } • When trained using squared error rate, the outputs can be used for an approximation of posterior probability. posterior probability. • Normalize them to [0,1] interval using softmax function. • Normalization function independent of Neural network training, only occurs on outputs.

  7. Laplace Estimator for Decision Tree • In Decision Trees, you use entropy to split the distribution based on a single feature per level • Normally, you continue to split until there is a single class in each leaf of the tree single class in each leaf of the tree • In Probability Estimating Trees , instead of splitting until a single class is in a leaf, split until around K points are in each leaf, and use various methods to calculate the probability of each class at each leaf.

  8. Count based probability, Laplace • {k 1 , k 2 , …, k c } – Number of sample points of class {w 1 , w 2 , …., w c } respectively in leaf • K = k 1 + k 2 + …+ k c • Maximum Likelihood (ML) estimate of • Maximum Likelihood (ML) estimate of k ˆ j ω = = P ( | x ) , j 1 ,..., c j K • When K is too small, estimates are unpredictable

  9. Laplace Estimator • Laplace Correction + k 1 ˆ ω j = P ( | x ) j + K c • m6estimation: • m6estimation: ˆ + × k m P ( w ) ˆ ω j j = P ( | x ) j + K m • best to set m so ˆ × ω ≈ ( ) 10 m P j

  10. Ting and William Laplace estimator • Ting and William ▫ ω * is majority class ∑ ∑ k l   + + k 1 1 l  ≠ l j * − = 1 ( ) if w w   j + K 2 ˆ  ω = ( | ) P x k [ ] j  ˆ j * − ω × 1 ( ) P otherwise ∑  k  l  ≠ l j

  11. Weighted Distance Laplace Estimate • Take the average distance from x to all samples of class w j , over the average distance to all samples 1 ∑ ( j ) ( , ) d x x ( j ) ∈ x w ˆ ω = j P ( | x ) j k 1 ∑ ( i ) d ( x , x ) = 1 i

  12. Example

  13. Class Conscious Combiners • Non6trainable Combiners ▫ No extra parameters, all defined up front ▫ Function of classifier output for specific class µ = ( x ) F [ d ( x ), d ( x ),... d ( x )] j 1 , j 2 , j L , j • Simple mean L 1 ∑ µ = ( x ) d ( x ) , j i j L = i 1

  14. Popular Class Conscious Combiners • Minimum/Maximum/Median µ = ( x ) { d ( x )} max j i , j i • Trimmed Mean: ▫ L degrees of support sorted, X percent of values ▫ L degrees of support sorted, X percent of values are dropped. Mean taken of remaining. • Product L ∏ µ = ( x ) d ( x ) j i , j = i 1

  15. Generalized Mean Function α 1 /   L 1 ∑   α µ α = ( x , ) d ( x ) j i , j   L = 1 i • Generalized Mean is defined as above except for the following special cases. the following special cases. ▫ a → �∞ , Minimum, 1 / L   ▫ a = 61, Harmonic Mean L = ∏   µ ( x ) d ( x )   ▫ a = 0, Geometric mean j i , j   = i 1 ▫ a = 1, Simple Arithmetic Mean ▫ a → ∞, Maximum • a is chosen before hand, level of optimism

  16. Class Conscious Combiner Example

  17. Example: Effect of Optimism α • 100 training / test sets ▫ Training set (a), 200 samples ▫ Testing set (b), 1000 ▫ Testing set (b), 1000 samples • For each ensemble ▫ 10 bootstrap samples (200 values) ▫ Train classifier on each (Parzen)

  18. Example: Effect of Optimism α • Generalized mean ▫ 50 <= α <= +50, steps of 1 ▫ 61 <= α <= +1, steps of 0.1 • Simple mean combiner gives best result

  19. Interpreting Results • Mean classifier isn’t always the best • Shape of the error curve depends upon ▫ Problem ▫ Base classifier used ▫ Base classifier used • Average and product are most intensely studied combiners ▫ For some problems, average may be… � Less accurate, but � More stable

  20. Ordered Weight Averaging • Generalized, non6trainable • L coefficients (one for each classifier) • Order the results of ω j classifiers, descending j • Multiply by vector of coefficients b (weights) ▫ i 1 , …, i L is a permutation of the indices 1, …, L L ∑ ( ) ( ) µ = x b d x k i [ k ] j = k 1

  21. Ordered Weight Averaging: Example • Consider a jury assessing sport performance (diving) [ ] ▫ Reduce subjective bias T = d . 6 . 7 . 2 . 6 . 6 j � Trimmed mean � Trimmed mean     1 1 1 1 1 1 = b 0 0 � Drop lowest, highest scores     3 3 3 � Average the remaining   1 1 1 [ ] T µ = = 0 0 . 7 . 6 . 6 . 6 . 2 0 . 6   j   3 3 3

  22. Ordered Weight Averaging • General form of trimmed mean ▫ b = [ 0, 1/(L62), 1/(L62), …, 1/(L62), 0] T • Other operations may be modeled with careful selection of b selection of b ▫ Minimum: b = [0, 0, …, 1] T ▫ Maximum: b = [1, 0, …, 0] T ▫ Average: b = [ 1/L, 1/L, …, 1/L] T • Many resources spent on developing new aggregation connectives ▫ Bigger question: when to use which one?

  23. Trainable Combiners • Combiners with additional parameters to be trained ▫ Weighted Average ▫ Fuzzy Integral ▫ Fuzzy Integral

  24. Weighted Average • 3 groups, based on number of weights • L6weights ▫ One weight per classifier L ∑ ∑ ( ) ( ) µ = x w d x j j i i i i , , j j = i 1 ▫ Similar to equation we saw for ordered weight average, except we’re trying to optimize w i here (and we’re not reordering d i,j ) ▫ w i for classifier D i usually based on its estimated error rate

  25. Weighted Average • c x L weights ▫ Weights are specific to each class L ∑ ∑ ( ) ( ) µ = x w d x j j ij ij i i , , j j = i 1 ▫ Only j th column is used in calc ▫ Linear regression commonly used to derive optimal weights ▫ “class conscious” combiner

  26. Weighted Average • c x c x L weights ▫ Support for each class determined from entire decision profile DP(x) L c ∑∑ ∑∑ ( ) ( ) µ = x w d x j ikj i , k = = i 1 k 1 ▫ Different weight space for each class ω j ▫ Whole decision profile is intermediate feature space � “class indifferent” combiner

  27. Weighted Average: Class Conscious L ∑ ( ) ( ) µ = x w d x j ij i , j = i 1 • d i,j (x) are point estimates of P(ω j | x) ▫ If estimates are unbiased, ▫ If estimates are unbiased, � Q(x) is nonbiased minimum variance estimate of P(ω j | x), conditional upon… � restriction of coefficients w i to sum to 1 L ∑ = w 1 i = i 1

  28. Weighted Average: Class Conscious L ∑ ( ) ( ) µ = x w d x j ij i , j = i 1 • Weights derived to minimize variance of Q(x) • Weights derived to minimize variance of Q(x) ▫ Q(x) variance <= variance of any single classifier • We assume point estimates are unbiased ▫ Variance of d i,j (x) = expected squared error of d i,j (x) • When coefficients w i minimize variance ▫ Q(x) is better estimate of P(ω j | x) than any d i,j (x)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend