Latent Classification Models Classification in continuous domains - PowerPoint PPT Presentation

Latent Classification Models Classification in continuous domains Helge Langseth and Thomas D. Nielsen LCM – p.1/ ??

Outline ➤ Probabilistic classifiers ➤ Naïve Bayes classifiers • Relaxing the assumptions ➤ Latent classification models (LCMs) • A linear generative model • A non-linear generative model ➤ Learning LCMs • Structural learning • Parameter learning ➤ Experimental results LCM – p.2/ ??

✂ � ✁ ✄ ✂ � ✄ � Classification in a probabilistic framework n to the class: A Bayes optimal classifier will classify an instance = ( x 1 , . . . , x n ) ∈ y ∗ = arg L ( y, y ′ ) P ( Y = y | min = ) , y ∈ sp(Y) y ′ ∈ sp(Y) where L ( · , · ) is the loss-function. When learning a probabilistic classifier, the task is to learn the probability distribution P ( Y = y | ) from a set of N labeled training samples D N = { N } . = 1 , . . . , LCM – p.3/ ??

Learning a probabilistic classifier An immediate approach: ➤ Learn a Bayesian network using standard score functions like e.g. MDL, BIC or BDe for Gaussian networks., But such global score functions are not necessarily well-suited for learning a classifier! Instead we could: (i) Use predictive MDL as the score function. (ii) Use the wrapper approach to score a classifier, i.e., apply cross-validation on the training data and use the estimated accuracy as the score. Unfortunately, (i) does not decompose, and (ii) comes at the cost of high computational complexity! The complexity problem can be relieved by focusing on a particular sub-class of BNs! LCM – p.4/ ??

The Naïve Bayes classifier: an example A Naïve Bayes classifier for the crabs domain: Y Class: Blue male, Blue female Orange male, Orange female X 1 X 2 X 3 X 4 X 5 X 1 - Width of frontal lib X 4 - Rear width X 2 - Length along the midline X 5 - Maximum width of the carapace X 5 - Body length The two assumptions: • The attributes are conditionally independent given the class. • The continuous attributes are generated by a specific parametric family of distributions. LCM – p.5/ ??

Conditional correlations: the crabs domain A plot of Width of frontal lib vs. Rear Width for each of the four classes: 22 20 18 16 14 12 10 BM BF 8 OM OM 6 6 8 10 12 14 16 18 20 22 24 Note that there is a strong conditional correlation between the attributes. ➤ This is inconsistent with the independence assumptions of the Naïve Bayes classifier (in our experiments it achieves an accuracy of only 39.5% in this domain). LCM – p.6/ ??

Handling conditional correlations Methods for handling conditional dependencies can roughly be characterized as either: ➤ Allowing a more general correlation structure among the attributes. • TAN extended to the continuous domain (Friedman et al., 1998). • Linear Bayes; the continuous attributes are grouped together and associated with a multivariate Gaussian (Gama, 2000). ➤ Relying on a preprocessing of the data. For example: i) Transform the data into a vector of independent factors (e.g. class-wise PCA or FA). ii) Learn an NB classifier based on the expected values of the factors. LCM – p.7/ ??

Transformation of the data: an example A plot of the first two factors found by applying a PCA on the crabs database: 8 6 4 2 0 -2 -4 BM BF -6 OM OM -8 2 3 4 5 6 7 8 9 Note that the conditional correlation is reduced significantly. ➤ The Naïve Bayes classifier achieves an accuracy of 94.0% when working on the transformed data. LCM – p.8/ ??

The distribution assumption The Naïve Bayes usually assumes that the continuous attributes are generated by a specific parametric family of distributions (usually Gaussians) but this is not always appropriate: 1.4 1.2 1 The figure shows a histogram of 0.8 the in silicon contents float 0.6 processed glass taken from the glass2 0.4 domain. 0.2 0 70 71 72 73 74 75 SI To avoid having to make the Gaussian assumption, one could: ➤ Discretize the continuous attributes (Fayyad and Irani, 1993). ➤ Use kernel density estimation with Gaussian kernels (John and Langley, 1995). ➤ Use a dual representation (Friedman et al., 1998) ➤ Apply a finite mixture model (Monti and Cooper, 1999). ➤ . . . LCM – p.9/ ??

� ✂ ✂ ✂ ✂ � ✄ ✂ ✁ ✁ � Latent Classification Models: motivation I Consider a factor analysis for dimensionality reduction, i.e., the attributes is modeled with a q -dimensional vector of factor variables : = + , where: is the regression matrix • ∼ N ( 0 , ) • ∼ N ( 0 , Θ ) is an n -dimensional random variable with diagonal covariance matrix Θ . • Z 1 Z 2 X 1 X 2 X 3 X 4 X 5 In this model, the factor variables model the dependencies among the attributes, and can be interpreted as the sensor noise associated with the attributes. LCM – p.10/ ??

✂ � � � � ✄ � Latent Classification Models: motivation II Motivation: Consider the following simple idea for handling the conditional dependencies among the attributes. ( i ) Learn a factor analyzer from the data. i , y i ) let the expectation i , Y = y i ) represent | ( ii ) For each data-point i = ( ( = i in the database. observation ( iii ) Learn a Naïve Bayes classifier using the transformed dataset generated in step ( ii ) . Unfortunately: • The conditional independencies are not consistent with the NB classifier. • When we substitute the actual observations with the expected values of the factor variables in step ( ii ) , we disregard information about the uncertainty in the generative model LCM – p.11/ ??

Latent Classification Models: The idea Latent classification models can be seen as a combination of the generative model from an FA and a Naïve Bayes: Class Z 1 Z 2 Z 3 Z 4 ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ✪ ❡ ☎☎ ❉ ☎☎ ❉ ☎☎ ❉ ☎☎ ❉ ❉ ❉ ❉ ❉ ☎ ❉ ☎ ❉ ☎ ❉ ☎ ❉ Z ′ Z ′ Z ′ Z ′ 1 2 3 4 X 1 X 2 X 3 X 4 X 5 LCM – p.12/ ??

� ✁ � ✁ � ✂ � ✂ � � ✁ � ✁ ✂ � ✁ � � � ✄ � Latent Classification Models: The linear case Class Z 1 Z 2 Z 3 Z 4 X 1 X 2 X 3 X 4 X 5 For the quantitative part we have: • Conditionally on Y = j the latent variables, , follow a Gaussian distribution with | Y = j ] = | Y = j ) = Γ j . [ j and Cov ( • Conditionally on = , the variables, , follow a Gaussian distribution with ( | = ) = and Cov ( | = ) = Θ . Note that: • Both Γ j and Θ must be diagonal. • As opposed to the generative model underlying factor analysis, we do not assume that | Y ) = 0 and Cov ( | Y ) = ( . The complexity is not that bad!!! LCM – p.13/ ??

✁ � ✁ ✁ � ✁ � ✁ � ✁ ✁ ✁ � ✂ ✁ ✁ � ✂ � � ✂ Linear Latent Classification Models: Expressive power Proposition: | { Y = j } ∼ N ( Assume that Y follows a multinomial distribution, and that j , Σ j ) for a j , Σ j } | sp(Y) | given set of parameters { . Assume that rank ( Σ j ) = n for at least one j . Then j =1 the joint distribution ( Y, ) can be represented by an LCM with q ≤ n · | sp (Y) | . Proof outline: j , Σ j } | sp(Y) | The proof is constructive. First we must show that for a given { we can define j =1 j } | sp(Y) | and { Γ j } | sp(Y) | , Θ , { | { Y = j } ∼ N ( such that j , Σ j ) . j =1 j =1 n . ➤ If ∈ has rank n , then = has a solution for any and { Γ j } | sp(Y) | Next we need to define (we assume that Θ = 0 ). Pick a Σ j and note j =1 that since Σ j is a positive semi-definite and symmetric matrix we can create a singular value T decomposition of Σ j as Σ j = j ; j contains the eigenvectors of Σ j and Λ j is the j Λ j diagonal matrix holding the corresponding eigenvalues. ➤ Define by the j matrices, and let Γ j be the block diagonal matrix constructed from the Λ j and 0 . LCM – p.14/ ??

✁ ✁ ✁ � ✁ Non-Linear Latent Classification Models: Motivation In the (linear) model, one of the main assumptions is that there exists a linear mapping, , from the factor space to the attribute space: ➤ Conditionally on Y = j , the attributes are assumed to follow a multivariate Gaussian T + Θ . distribution with mean j and covariance matrix Γ j Unfortunately, this is not sufficiently accurate for some domains: 0.7 0.6 The figure shows the empirical probabil- 0.5 ity distribution of the variables Silicon 0.4 and Potassium for float processed 0.3 glass (taken from the glass2 domain). 0.2 0.1 71.5 72 72.5 73 73.5 The proposed linear LCM achieved an accuracy of only 66 . 9% . LCM – p.15/ ??

� � � ✂ ✄ ✂ ✂ � ✁ Non-Linear Latent Classification Models In order to extend the expressive power of linear LCMs we propose a natural generalization, termed non-linear LCMs. Analogously to the linear LCM, the non-linear LCM can be seen as combining a NB classifier with a mixture of FA: • a mixture variable M governing the mixture distribution • a (generalized) FA for each mixture component M = m : |{ M = m } = + m + m , m ∼ N ( 0 , ) and ∼ N ( 0 , Θ m ) . Note that where m models the data mean. Z 1 Z 2 M X 1 X 2 X 3 X 4 X 5 LCM – p.16/ ??

Latent Classification Models Classification in continuous domains - PowerPoint PPT Presentation

Latent Classification Models Classification in continuous domains Helge Langseth and Thomas D. Nielsen LCM p.1/ ?? Outline Probabilistic classifiers Nave Bayes classifiers Relaxing the assumptions Latent classification

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

lcda : Local Classification of Discrete Data by Latent Class Models Michael B ucker

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Latent Class Models: The Latent Class Logit Model Accouting for unobserved heterogeneity:

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

C unobserved construct (e.g. Disordered v. Non- Disordered) Latent classes are mutually

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI)

Data Classification Linear Classifier II Latent Differential Analysis Mean Classification

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Optimization-Based Model Fitting for Latent Class and Latent Profile Analyses Guan-Hua Huang,

Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations Florian

+ MAGIC results on X-ray binary systems Roberta Zanin (IFAE) on behalf of the MAGIC

+ Fostering Friendships to Support Social-Emotional Learning in Early Childhood Programs Lindsay

How to run Linux on RISC-V with open hardware and open source FPGA tools FOSDEM (2020-02-02)

Data Transferability and Data Collection Consistency for Marine Renewable Energy Development

Language Models CS6200: Information Retrieval Slides by: Jesse Anderton Whats wrong with

Highlights from the ARGO-YBJ Experiment Ivan De Mitri University of Salento and Istituto

PULSAR GLITCHES spin frequency time Danai Antonopoulou Centrum Astronomiczne im. Miko aja

Sambuz

Useful Links

Newsletter

Mail Us