 
              Latent Classification Models Classification in continuous domains Helge Langseth and Thomas D. Nielsen LCM – p.1/ ??
Outline ➤ Probabilistic classifiers ➤ Naïve Bayes classifiers • Relaxing the assumptions ➤ Latent classification models (LCMs) • A linear generative model • A non-linear generative model ➤ Learning LCMs • Structural learning • Parameter learning ➤ Experimental results LCM – p.2/ ??
✂ � ✁ ✄ ✂ � ✄ � Classification in a probabilistic framework n to the class: A Bayes optimal classifier will classify an instance = ( x 1 , . . . , x n ) ∈ y ∗ = arg L ( y, y ′ ) P ( Y = y | min = ) , y ∈ sp(Y) y ′ ∈ sp(Y) where L ( · , · ) is the loss-function. When learning a probabilistic classifier, the task is to learn the probability distribution P ( Y = y | ) from a set of N labeled training samples D N = { N } . = 1 , . . . , LCM – p.3/ ??
Learning a probabilistic classifier An immediate approach: ➤ Learn a Bayesian network using standard score functions like e.g. MDL, BIC or BDe for Gaussian networks., But such global score functions are not necessarily well-suited for learning a classifier! Instead we could: (i) Use predictive MDL as the score function. (ii) Use the wrapper approach to score a classifier, i.e., apply cross-validation on the train- ing data and use the estimated accuracy as the score. Unfortunately, (i) does not decompose, and (ii) comes at the cost of high computational complexity! The complexity problem can be relieved by focusing on a particular sub-class of BNs! LCM – p.4/ ??
The Naïve Bayes classifier: an example A Naïve Bayes classifier for the crabs domain: Y Class: Blue male, Blue female Orange male, Orange female X 1 X 2 X 3 X 4 X 5 X 1 - Width of frontal lib X 4 - Rear width X 2 - Length along the midline X 5 - Maximum width of the carapace X 5 - Body length The two assumptions: • The attributes are conditionally independent given the class. • The continuous attributes are generated by a specific parametric family of distributions. LCM – p.5/ ??
Conditional correlations: the crabs domain A plot of Width of frontal lib vs. Rear Width for each of the four classes: 22 20 18 16 14 12 10 BM BF 8 OM OM 6 6 8 10 12 14 16 18 20 22 24 Note that there is a strong conditional correlation between the attributes. ➤ This is inconsistent with the independence assumptions of the Naïve Bayes classifier (in our experiments it achieves an accuracy of only 39.5% in this domain). LCM – p.6/ ??
Handling conditional correlations Methods for handling conditional dependencies can roughly be characterized as either: ➤ Allowing a more general correlation structure among the attributes. • TAN extended to the continuous domain (Friedman et al., 1998). • Linear Bayes; the continuous attributes are grouped together and associated with a multivariate Gaussian (Gama, 2000). ➤ Relying on a preprocessing of the data. For example: i) Transform the data into a vector of independent factors (e.g. class-wise PCA or FA). ii) Learn an NB classifier based on the expected values of the factors. LCM – p.7/ ??
Transformation of the data: an example A plot of the first two factors found by applying a PCA on the crabs database: 8 6 4 2 0 -2 -4 BM BF -6 OM OM -8 2 3 4 5 6 7 8 9 Note that the conditional correlation is reduced significantly. ➤ The Naïve Bayes classifier achieves an accuracy of 94.0% when working on the transformed data. LCM – p.8/ ??
The distribution assumption The Naïve Bayes usually assumes that the continuous attributes are generated by a specific parametric family of distributions (usually Gaussians) but this is not always appropriate: 1.4 1.2 1 The figure shows a histogram of 0.8 the in silicon contents float 0.6 processed glass taken from the glass2 0.4 domain. 0.2 0 70 71 72 73 74 75 SI To avoid having to make the Gaussian assumption, one could: ➤ Discretize the continuous attributes (Fayyad and Irani, 1993). ➤ Use kernel density estimation with Gaussian kernels (John and Langley, 1995). ➤ Use a dual representation (Friedman et al., 1998) ➤ Apply a finite mixture model (Monti and Cooper, 1999). ➤ . . . LCM – p.9/ ??
� ✂ ✂ ✂ ✂ � ✄ ✂ ✁ ✁ � Latent Classification Models: motivation I Consider a factor analysis for dimensionality reduction, i.e., the attributes is modeled with a q -dimensional vector of factor variables : = + , where: is the regression matrix • ∼ N ( 0 , ) • ∼ N ( 0 , Θ ) is an n -dimensional random variable with diagonal covariance matrix Θ . • Z 1 Z 2 X 1 X 2 X 3 X 4 X 5 In this model, the factor variables model the dependencies among the attributes, and can be interpreted as the sensor noise associated with the attributes. LCM – p.10/ ??
✂ � � � � ✄ � Latent Classification Models: motivation II Motivation: Consider the following simple idea for handling the conditional dependencies among the attributes. ( i ) Learn a factor analyzer from the data. i , y i ) let the expectation i , Y = y i ) represent | ( ii ) For each data-point i = ( ( = i in the database. observation ( iii ) Learn a Naïve Bayes classifier using the transformed dataset generated in step ( ii ) . Unfortunately: • The conditional independencies are not consistent with the NB classifier. • When we substitute the actual observations with the expected values of the factor vari- ables in step ( ii ) , we disregard information about the uncertainty in the generative model LCM – p.11/ ??
Latent Classification Models: The idea Latent classification models can be seen as a combination of the generative model from an FA and a Naïve Bayes: Class Z 1 Z 2 Z 3 Z 4 ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ☎☎ ❉ ☎☎ ❉ ☎☎ ❉ ☎☎ ❉ ❉ ❉ ❉ ❉ ☎ ❉ ☎ ❉ ☎ ❉ ☎ ❉ Z ′ Z ′ Z ′ Z ′ 1 2 3 4 X 1 X 2 X 3 X 4 X 5 LCM – p.12/ ??
� ✁ � ✁ � ✂ � ✂ � � ✁ � ✁ ✂ � ✁ � � � ✄ � Latent Classification Models: The linear case Class Z 1 Z 2 Z 3 Z 4 X 1 X 2 X 3 X 4 X 5 For the quantitative part we have: • Conditionally on Y = j the latent variables, , follow a Gaussian distribution with | Y = j ] = | Y = j ) = Γ j . [ j and Cov ( • Conditionally on = , the variables, , follow a Gaussian distribution with ( | = ) = and Cov ( | = ) = Θ . Note that: • Both Γ j and Θ must be diagonal. • As opposed to the generative model underlying factor analysis, we do not assume that | Y ) = 0 and Cov ( | Y ) = ( . The complexity is not that bad!!! LCM – p.13/ ??
✁ � ✁ ✁ � ✁ � ✁ � ✁ ✁ ✁ � ✂ ✁ ✁ � ✂ � � ✂ Linear Latent Classification Models: Expressive power Proposition: | { Y = j } ∼ N ( Assume that Y follows a multinomial distribution, and that j , Σ j ) for a j , Σ j } | sp(Y) | given set of parameters { . Assume that rank ( Σ j ) = n for at least one j . Then j =1 the joint distribution ( Y, ) can be represented by an LCM with q ≤ n · | sp (Y) | . Proof outline: j , Σ j } | sp(Y) | The proof is constructive. First we must show that for a given { we can define j =1 j } | sp(Y) | and { Γ j } | sp(Y) | , Θ , { | { Y = j } ∼ N ( such that j , Σ j ) . j =1 j =1 n . ➤ If ∈ has rank n , then = has a solution for any and { Γ j } | sp(Y) | Next we need to define (we assume that Θ = 0 ). Pick a Σ j and note j =1 that since Σ j is a positive semi-definite and symmetric matrix we can create a singular value T decomposition of Σ j as Σ j = j ; j contains the eigenvectors of Σ j and Λ j is the j Λ j diagonal matrix holding the corresponding eigenvalues. ➤ Define by the j matrices, and let Γ j be the block diagonal matrix constructed from the Λ j and 0 . LCM – p.14/ ??
✁ ✁ ✁ � ✁ Non-Linear Latent Classification Models: Motivation In the (linear) model, one of the main assumptions is that there exists a linear mapping, , from the factor space to the attribute space: ➤ Conditionally on Y = j , the attributes are assumed to follow a multivariate Gaussian T + Θ . distribution with mean j and covariance matrix Γ j Unfortunately, this is not sufficiently accurate for some domains: 0.7 0.6 The figure shows the empirical probabil- 0.5 ity distribution of the variables Silicon 0.4 and Potassium for float processed 0.3 glass (taken from the glass2 domain). 0.2 0.1 71.5 72 72.5 73 73.5 The proposed linear LCM achieved an accuracy of only 66 . 9% . LCM – p.15/ ??
� � � ✂ ✄ ✂ ✂ � ✁ Non-Linear Latent Classification Models In order to extend the expressive power of linear LCMs we propose a natural generalization, termed non-linear LCMs. Analogously to the linear LCM, the non-linear LCM can be seen as combining a NB classifier with a mixture of FA: • a mixture variable M governing the mixture distribution • a (generalized) FA for each mixture component M = m : |{ M = m } = + m + m , m ∼ N ( 0 , ) and ∼ N ( 0 , Θ m ) . Note that where m models the data mean. Z 1 Z 2 M X 1 X 2 X 3 X 4 X 5 LCM – p.16/ ??
Recommend
More recommend