Combining Classifiers: A Theoretical Framework J. Kittler Centre - PDF document

Pattern Analysis & Applic. (1998)1:18-27 �9 1998 Springer-Verlag London Limited Combining Classifiers: A Theoretical Framework J. Kittler Centre for Vision, Speech and Signal Processing, School of Electronic Engineering, Information Technology and Mathematics, University of Surrey, Guildford, UK Abstract: The problem of classifier combination is considered in the context of the two main fusion scenarios: fusion of opinions based on identical and on distinct representations. We develop a theoretical framework for classifier combination for these two scenarios. For multiple experts using distinct representations we argue that many existing schemes such as the product rule, sum rule, min rule, max rule, majority voting, and weighted combination, can be considered as special cases of compound classification. We then consider the effect of classifier combination in the case of multiple experts using a shared representation where the aim of fusion is to obtain a better estimate of the appropriate a posteriori class probabilities. We also show that the two theoretical frameworks can be used for devising fusion strategies when the individual experts use features some of which are shared and the remaining ones distinct. We show that in both cases (distinct and shared representations), the expert fusion involves the computation of a linear or nonlinear function of the a posteriori class probabilities estimated by the individual experts. Classifier combination can therefore be viewed as a multistage classification process whereby the a posteriori class probabilities generated by the individual classifiers are considered as features for a second stage classification scheme. Most importantly, when the linear or nonlinear combination functions are obtained by training, the distinctions between the two scenarios fade away, and one can view classifier fusion in a unified way. Keywords" Compound decision theory; Multiple expert fusion; Pattern classification 1. INTRODUCTION used to stabilise the training of classifiers based on a small sample size, e.g. by the use of bootstrapping [9]. The problem of classifier combination has always been More recently, it has been observed that the accu- of interest to the pattern recognition community. racy of pattern classification can also be improved by Initially, the goal of classifier combination was to multiple expert fusion. In other words, the idea is not improve the efficiency of decision making by adopting to rely on a single decision making scheme. Instead, multistage combination rules, whereby objects are several designs (experts) are used for decision making. classified by a simple classifier using a small set of By combining the opinions of the individual experts, inexpensive features in combination with a reject a consensus decision is derived. Various classifier com- option. For the more difficult objects more complex bination schemes have been devised, and it has been procedures, possibly based on additional, more costly experimentally demonstrated that some of them con- features, are employed [1-4]. In other studies, succes- sistently outperform a single best classifier. sive classification stages gradually reduce the set of An interesting issue in the research concerning clas- possible classes [5-8]. Multistage classifiers may also be sifier ensembles is the way they are combined. If only labels are available a majority vote [7,10] or a label ranking [11,12] may be used. If continuous outputs like a posteriori probabilities are supplied, an average Received: 8 October 1997 or some other linear combination has been suggested Received in revised form: 6 January 1998 [13,14]. It depends upon the nature of the input Accepted: 10 January 1998

Combining Classifiers: A Theoretical Framework 19 classifiers and the feature space as to whether this can used for devising fusion strategies when the individual be theoretically justified. A review of these possibilities experts use features some of which are shared and the is presented in Hansen and Salamon [15]. If the remaining ones distinct. classifier outputs are interpreted as fuzzy membership We show that in both cases (distinct and shared values, belief values or evidence, fuzzy rules [16,17], representations), the expert fusion involves the compu- belief functions and Dempster-Shafer techniques tation of a linear or nonlinear function of the a [10,14,18,19] are used. Finally, it is possible to train posteriori class probabilities estimated by the individual the output classifier separately using the outputs of the experts. Classifier combination can therefore be viewed input classifiers as new features [20,21]. Woods et al as a multistage classification process, whereby the a [22], on the other hand, take the view that different posteriori class probabilities generated by the individual classifiers are competent to make decisions in different classifiers are considered as features for a second stage regions, and their approach involves partitioning the classification scheme. Most importantly, when the lin- observation space into such regions. For a recent ear or nonlinear combination functions are obtained review of the literature see Kittler [23]. by training, the distinctions between the two scenarios From the point of view of their analysis, there are fade away, and one can view classifier fusion in a basically two classifier combination scenarios. In the unified way. This probably explains the success of first scenario, all the classifiers use the same represen- many heuristic combination strategies that have been tation of the input pattern. In this case, each classifier, suggested in the literature without any concerns about for a given input pattern, can be considered to produce the underlying theory. an estimate of the same a posteriori class probability. The paper is organised as follows. In Section 2 In the second scenario, each classifier uses its only we discuss combination strategies for experts using representation of the input pattern. In other words, the independent (distinct) representations. In Section 3 measurements extracted from the pattern are unique to we consider the effect of classifier combination for the each classifier. An important application of combining case of shared (identical) representation. The findings classifiers in this scenario is the possibility to integrate of the two sections are discussed in Section 4. Finally, physically different types of measurements/features. In Section 5 offers a brief summary. this case, it is no longer possible to consider the computed a posteriori probabilities to be estimates of 2. DISTINCT REPRESENTATIONS the same functional value, as the classification systems operate in different measurement spaces. It has been observed that classifier combination is In this paper, we develop a theoretical framework particularly effective if the individual classifiers employ for classifier combination approaches for these two different features [12,14,24]. Consider a pattern recog- scenarios. For multiple experts using distinct represen- nition problem where pattern Z is to be assigned to tations, we argue that many existing schemes can be one of the m possible classes {~Ol,. �9 .,tOm}. Let us assume considered as special cases of compound classification, that we have R classifiers, each representing the given where all the representations are used jointly to make pattern by a distinct measurement vector. Denote the a decision. We note that under different assumptions measurement vector used by the i-th classifier by xi. and using different approximations, we can derive the In the measurement space each class a)k is modelled commonly used classifier combination schemes such as by the probability density function p(xJ60k), and its a the product rule, sum rule, min rule, max rule, majority priori probability of occurrence is denoted P(cok). We voting and weighted combination schemes. We address shall consider the models to be mutally exclusive, the issue of the sensitivity of various combination rules which means that only one model can be associated to estimation errors, and point out that the techniques with each pattern. based on the benevolent sum-rule fusion are more Now according to the Bayesian theory, given resilient to errors than those derived from the severe measurements x~, = 1 .... ,R, the pattern, Z, should be product rule. assigned to class ~oj, i.e. its label 0 should assume value We then consider the effect of classifier combination 0= % provided the a posteriori probability of that in the case of multiple experts using a shared represen- interpretation is maximum, i.e. tation. We show that here the aim of fusion is to assign 0 ~ ~oj if obtain a better estimate of the appropriate a posteriori class probabilities. This is achieved by the means of e( e -- o,,jx, ..... =- max e( O -- <,,1 ..... xR) (1) reducing the estimation error variance. We also show k that the two theoretical frameworks for the case of Let us rewrite the a posteriori probability distinct and shared representation, respectively, can be P(0= ~ok]xi ..... xR) using the Bayes theorem. We have

Combining Classifiers: A Theoretical Framework J. Kittler Centre - PDF document

Pattern Analysis & Applic. (1998)1:18-27 9 1998 Springer-Verlag London Limited Combining Classifiers: A Theoretical Framework J. Kittler Centre for Vision, Speech and Signal Processing, School of Electronic Engineering, Information

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Combining Classifiers d i,j = 1 if D i labels x in i , and d i,j = 0 otherwise. In this case,

Combining Models Oliver Schulte - CMPT 726 Bishop PRML Ch. 14 Combining Models: Some Theory

Automatically Evading Classifiers A Case Study on PDF Malware Classifiers Weilin Xu

Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC Curves Reject Curves Reject

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

On Robust Trimming of Bayesian Network Classifiers YooJung Choi and Guy Van den Broeck UCLA

Visualization for Explainable Classifiers Yao MING THE HONG KONG UNIVERSITY OF SCIENCE AND

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

MAXIMUM MARGIN CLASSIFIERS MAXIMUM MARGIN CLASSIFIERS Matthieu R Bloch Tuesday, February 11,

PROGRAMS @tomstuart / QCon London / 2014-03-05 PROGRAMS CANT DO EVERYTHING ho w can a

SUPPORT MEETING 1 Finding a company Eniro.se, google, emfas.se, trade organizations 1.

Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior Zi Wang*

M6 Offline Analysis Katarina Pajchel University of Oslo April 18, 2008 Katarina Pajchel

Operating System Principles: Memory Management Swapping, Paging, and Virtual Memory CS 111

Operating Systems Fall 2014 Memory Management Myungjin Lee myungjin.lee@ed.ac.uk 1 Goals of

Thursday, 1 October 2015 Exam one week from today (See Web site for review exercises) Next

Chapter 16 Pointers and Arrays Pointers and Arrays We've seen examples of both of these in our