 
              Chapter 10. Semi-Supervised Learning Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 � Wei Pan c
Outline ◮ Mixture model: L 1 penalization for variable selection Pan et al (2006, Bioinformatics) ◮ Introduction: motivating example ◮ Methods: standard and new ones ◮ Simulation ◮ Example ◮ Discussion ◮ Transductive SVM (TSVM): Wang, Shen & Pan (2007, CM; 2009, JMLR) ◮ Constrained K-means: Wagstaff et al (2001)
Introduction ◮ Biology: Do human blood outgrowth endothelial cells (BOECs) belong to or are closer to large vessel endothelial cells (LVECs) or microvascular endothelial cells (MVECs)? ◮ Why important: BOECs are being explored for efficacy in endothelial-based gene therapy (Lin et al 2002), and as being useful for vascular diagnostic purposes (Hebbel et al 2005); in each case, it is important to know whether BOEC have characteristics of MVECs or of LVECs. ◮ Based on the expression of gene CD36, it seems reasonable to characterize BOECs as MVECs (Swerlick et al 1992). ◮ However, CD36 is expressed in endothelial cells, monocytes, some epidermal cells and a variety of cell lines; characterization of BOECs or any other cells using a single gene marker seems unreliable.
◮ Jiang (2005) conducted a genome-wide comparison: microarray gene expression profiles for BOEC, LVEC and MVEC samples were clustered; it was found that BOEC samples tended to cluster together with MVEC samples, suggesting that BOECs were closer to MVECs. ◮ Two potential shortcomings: 1. Used hierarchical clustering; ignoring the known classes of LVEC and MVEC samples; Alternative? Semi-supervised learning: treating LVEC and MVEC as known while BOEC unknown (see McLachlan and Basford 1988; Zhu 2006 for reviews). Here it requires learning a novel class: BOEC may or may not belong to LVEC or MVEC. 2. Used only 37 genes that best discriminate b/w LVEC and MVEC. Important: result may critically depend on the features or genes being used; the few genes might not reflect the whole picture. Alternative? Start with more genes; but ... A dilemma: too many genes might lead to covering true clustering structures; to be shown later.
◮ For high-dimensional data, necessary to have feature selection, preferably embedded within the learning framework – automatic/simultaneous feature selection. ◮ In contrast to sequential methods: first selecting features and then fitting/learning a model; Pre-selection may perform terribly; Why: selected features may not be relevant at all to uncovering interesting clustering structures, due to the separation between the two steps. ◮ We propose a penalized mixture model: semi-supervised learning; automatic variable selection simultaneously with model fitting.
◮ With more genes included in a starting model and with appropriate gene selection, BOEC samples are separate from LVEC and MVEC samples. ◮ Finite mixture models studied in the statistics and machine learning literature (McLachlan and Peel 2002; Nigam et al 2006), even applied to microarray data analysis (Alexandridis et al 2004), our proposal of using a penalized likelihood to realize automatic variable selection is novel; in fact, variable selection in this context is largely a neglected topic. ◮ This work extends the penalized unsupervised learning/clustering analysis method of Pan and Shen (2007) to semi-supervised learning.
Semi-Supervised Learning via Standard Mixture Model ◮ Data Given n K -dimensional obs’s: x 1 ,..., x n ; the first n 0 do not have class labels while the last n 1 have. There are g = g 0 + g 1 classes: the first g 0 unknown/novel classes to be discovered. while the last g 1 known. z ij = 1 iff x j is known to be in class i ; z ij = 0 o/w. Note: z ij ’s are missing for 1 ≤ j ≤ n 0 . ◮ A mixture model as a generative model: g � f ( x ; Θ) = π i f i ( x ; θ i ) i =1 π i : unknown prior prob’s; f i : class-specific distribution with unknown parameters θ i .
◮ For high-dim and low-sample-sized data, we propose � � 1 − 1 2( x j − µ i ) ′ V − 1 ( x j − µ i ) f i ( x j ; θ i ) = (2 π ) K / 2 | V | 1 / 2 exp , K ), and | V | = � K where V = diag ( σ 2 1 , σ 2 2 , ..., σ 2 k =1 σ 2 k . ◮ Posterior prob of x j ’s coming from class/component i : π i f i ( x j ; θ i ) τ ij = � g l =1 π l f l ( x j ; θ l ) − ( x jk − µ ik ) 2 � � � K 1 π i √ 2 πσ k exp k =1 2 σ 2 = k � , � − ( x jk − µ lk ) 2 � g � K 1 l =1 π l √ 2 πσ k exp k =1 2 σ 2 k ◮ Assign x j to cluster i 0 = argmax i τ ij . ◮ A key observation: if µ 1 k = µ 2 k = ... = µ gk for some k , the terms involving x jk will cancel out in τ ij —-feature selection!
◮ Note: variable selection is possible under a common diagonal covariance matrix V across all clusters. E.g., if use V i (or a non-diagonal V ), even if µ 1 k = µ 2 k = ... = µ gk , x jk is still informative; e.g., N (0 , 1) vs N (0 , 2). ◮ Θ = { ( π i , θ i ) : i = 1 , ..., g } need to be estimated; MLE ◮ The log-likelihood is g g n 0 n � � � � log L (Θ) = log[ π i f i ( x j ; θ i )] + log[ z ij f i ( x j ; θ i )] . j =1 i =1 j = n 0 +1 i =1 ◮ Common to use the EM (Dempster et al 1977) to get MLE; see below for details.
Penalized Mixture Model ◮ Penalized log-likelihood: use a weighted L 1 penalty; � � log L P (Θ) = log L (Θ) + λ w ik | µ ik | , i k where w ik ’s are weights to be given later. ◮ Penalty: model regularization; Bayesian connection. ◮ Assume that the data have been standardized so that each feature has sample mean 0 and sample variance 1. ◮ Hence, for any k , if µ 1 k = ... = µ gk = 0, then feature k will not be used. ◮ L 1 penalty serves to obtain a sparse solution: µ ik ’s are automatically set to 0, realizing variable selection.
◮ EM algorithm: E-step and M-step for other parameters are the same as in the usual EM, except M-step for µ ik ; n π ( m +1) τ ( m ) � ˆ = / n , (1) i ij j =1 g n σ 2 , ( m +1) � � τ ( m ) µ ( m ) ik ) 2 / n , ˆ = ( x jk − ˆ (2) k ij i =1 j =1   λ µ ( m +1) µ ( m +1) µ ( m +1) V ( m ) w i ˆ = sign(˜ )  | ˜ | − (3) , i i i  j τ ( m ) � ij + where π ( m ) f i ( x j ; θ ( m ) � ) f ( x j ;Θ ( m ) ) , if 1 ≤ j ≤ n 0 i i τ ( m ) = (4) ij z ij , if n 0 < j ≤ n n n µ ( m +1) τ ( m ) τ ( m ) � � ˜ = x j / (5) i ij ij j =1 j =1
j =1 τ ( m ) x jk /σ 2 , ( m ) ◮ Soft-thresholding: If λ w ik > | � n | , then ij k µ ( m +1) µ ( m +1) µ ( m +1) ˆ = 0; otherwise, ˆ is obtained by shrinking ˜ ik ik ik by an amount λ w ik σ 2 , ( m ) j =1 τ ( m ) / � n . ij k µ ( m +1) ◮ In the EM for the standard mixture model, use ˜ ; no i shrinkage or thresholding. ◮ Zou (2005, 2006) proposed using the weighted L 1 penalty in the context of supervised learning; we extend the idea to the µ ik | w with w ≥ 0; the current context: using w ij = 1 / | ˜ standard L 1 penalty corresponds to w = 0. ◮ The weighted penalty automatically realizes a data-adaptive penalization: it penalizes more on smaller µ ik while penalizing less on, and thus reducing the bias for, larger µ ik , leading to better feature selection and classification performance. ◮ As in Zou (2006), we tried w ∈ { 0 , 1 , 2 , 4 } and found only minor differences in results for w > 0; for simplicity we will present results only for w = 0 and w = 1.
Model Selection ◮ To determine g 0 (and λ ), use BIC (Schwartz 1978) BIC = − 2 log L (ˆ Θ) + log( n ) d , where d = g + K + gK − 1 is the total number of unknown parameters in the model; the model with a minimum BIC is selected (Fraley and Raftery 1998). ◮ For the penalized mixture model, Pan and Shen (2007) proposed a modified BIC: BIC = − 2 log L (ˆ Θ) + log( n ) d e , where d e = g + K + gK − 1 − q = d − q with q = # { ˆ µ ik : ˆ µ ik = 0 } , an estimate of the “effective” number of parameters.
◮ The idea was borrowed from Efron et al (2004) and Zou et al (2004) in penalized regression/LASSO. ◮ No proof yet... ◮ Data-based methods, such as cross-validation or data perturbation (Shen and Ye 2002; Efron 2004), can be also used; but computationally more demanding. ◮ Trials and errors to find a λ (and g 0 ).
Simulated Data ◮ Simulation set-ups: ◮ Four non-null (i.e. g 0 > 0) cases; ◮ 20 obs’s in each of the g 0 = 1 unknown and g 1 = 2 known classes; ◮ K = 200 independent attributes; only 2 K 1 were informative; ◮ Each of the first K 1 informative attributes: indep N (0 , 1), N (0 , 1) and N (1 . 5 , 1) for 3 classes; ◮ Each of the next K 1 informative ones: indep N (1 . 5 , 1), N (0 , 1) and N (0 , 1); ◮ Each of the K − 2 K 1 noise variables: N (0 , 1); ◮ K 1 = 10, 15, 20 and 30. ◮ Null case: g 0 = 0; only the first K 1 = 30 attributes were discriminatory as before, and others not.
◮ For each case, 100 independent datasets. ◮ Comparing standard method without variable selection (i.e. λ = 0) and penalized method with w = 0. ◮ For each dataset, the EM was run 10 times; its starting values were from the output of the K-means with random starts; final result was the one with the max (penalized) likelihood (for the given λ ). ◮ λ ∈ Φ = { 0 , 2 , 4 , 6 , 8 , 10 , 12 , 15 , 20 , 25 } ; for a given g 0 , chose the one with min BIC. ◮ Comparison between the standard and penalized methods:
Recommend
More recommend