T-61.3050 Machine Learning: Basic Principles Decision Trees Kai - PowerPoint PPT Presentation

Clustering Decision Trees T-61.3050 Machine Learning: Basic Principles Decision Trees Kai Puolam¨ aki Laboratory of Computer and Information Science (CIS) Department of Computer Science and Engineering Helsinki University of Technology (TKK) Autumn 2007 AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm Outline Clustering 1 k-means Clustering Greedy algorithms EM Algorithm Decision Trees 2 Introduction Classification Trees Regression Trees AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm k-means Clustering Lloyd’s algorithm LLOYDS( X , k ) { Input: X , data set; k , number of clusters. Output: { m i } k i =1 , cluster prototypes. } Initialize m i , i = 1 , . . . , k , appropriately for example, in random. repeat for all t ∈ { 1 , . . . , N } do { E step }  1 ˛ x t − m i ˛˛ ˛ ˛˛ ˛ , i = arg min i b t ˛ i ← 0 , otherwise end for for all i ∈ { 1 , . . . , k } do { M step } t b t i x t P m i ← P t b t i end for until the error E ( { m i } k i =1 | X ) does not change return { m i } k AB i =1 Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm k-means Clustering Lloyd’s algorithm (a) (b) (c) 2 2 2 0 0 0 −2 −2 −2 −2 0 2 −2 0 2 −2 0 2 (d) (e) (f) 2 2 2 0 0 0 −2 −2 −2 −2 0 2 −2 0 2 −2 0 2 (g) (h) (i) 2 2 2 0 0 0 −2 −2 −2 −2 0 2 −2 0 2 −2 0 2 Figure 9.1 of Bishop (2006) AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm k-means Clustering Lloyd’s algorithm Observations: Iteration cannot increase the error E ( { m i } k i =1 | X ). There are finite number, k N , of possible clusterings. It follows that the algorithm always stops after a finite time. (It can take no more than k N steps.) Usually k-means is however relatively fast. “In practice the number of iterations is generally much less than the number of points.” (Duda & Hart & Stork, 2000) Worst-case running time with really bad data and really bad √ N ) — luckily this usually does not initialization is however 2 Ω( happen in real life (David A, Vassilivitskii S (2006) How slow is the k-means method? In Proc 22nd SCG.) AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm k-means Clustering Lloyd’s algorithm Observations: The result can in the worst case be really bad. Example: Four data vectors ( N = 4) from R d in X : x 1 = (0 , 0 , . . . , 0) T , x 2 = (1 , 0 , . . . , 0) T , x 3 = (0 , 1 , . . . , 1) T and x 4 = (1 , 1 , . . . , 1) T . Optimal clustering into two ( k = 2) is given by the prototype vectors m 1 = (0 . 5 , 0 , . . . , 0) T and m 2 = (0 . 5 , 1 , . . . , 1) T , error being E ( { m i } k i =1 | X ) = 1. Lloyd’s algorithm can however converge also to m 1 = (0 , 0 . 5 , . . . , 0 . 5) T and m 2 = (1 , 0 . 5 , . . . , 0 . 5) T , error being E ( { m i } k i =1 | X ) = d − 1. (Check that iteration stops here!) AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm k-means Clustering Lloyd’s algorithm Example: cluster taxa into k = 6 clusters 1000 times with Lloyd’s algorithm. The error E ( { m i } k i =1 | X ) is different for different runs! You should try several random initializations, and choose the solution with smallest error. For a cool initialization see Arthur D, Vassilivitskii S (2006) k-means++: The Advantages of Careful Seeding. Error (1000 runs, k=6) Cenozoic Large Land Mammals (k=6) Cenozoic Large Land Mammals (cluster prototypes) 120 120 250 100 100 200 80 80 Frequency 150 fossil sites fossil sites 60 60 100 40 40 cluster 1 ● cluster 2 ● 50 cluster 3 ● 20 cluster 4 20 ● cluster 5 ● cluster 6 ● 0 1200 1250 1300 1350 1400 1450 20 40 60 80 100 120 20 40 60 80 100 120 AB error taxa taxa Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm Greedy algorithm Task: solve arg min θ E ( θ | X ). 0 ≤ E ( θ | X ) < ∞ Assume that the cost/error E ( θ | X ) can be evaluated in polynomial time O ( N k ), given an instance of parameters θ and a data set X , where N is the size of the data set and k is some constant. Often, no polynomial time algorithm to minimize the cost is known. Assume that for each instance parameter values θ there exists a candidate set C ( θ ) such that θ ∈ C ( θ ). Assume arg min θ ′ ∈ C ( θ ) E ( θ ′ | X ) can be solved in polynomial time. AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm Greedy algorithm GREEDY( E , C , ǫ , X ) { Input: E , cost function; C , candidate set; ǫ ≥ 0, convergence cutoff; X , data set. Output: Instance of parameter values θ . } Initialize θ appropriately, for example, in random. repeat θ ′ ∈ C ( θ ) E ( θ ′ | X ) θ ← arg min until the change in E ( θ | X ) is no more than ǫ return θ AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm Greedy algorithm Examples of greedy algorithms: Forward and backward selection. Lloyd’s algorithm. Optimizing a cost function using gradient descent and line search. AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm Greedy algorithm Observations Each step (except the last) reduces the cost by more than ǫ . Each step can be done in polynomial time. The algorithm stops after a finite number of steps (at least if ǫ > 0). Difficult parts: What is a good initialization? What is a good candidate set C ( θ )? θ is a global optimum if θ = arg min θ E ( θ | X ). θ is a local optimum if θ = arg min θ ′ ∈ C ( θ ) E ( θ ′ | X ). Algorithm always finds a local optimum, but not necessarily a global optimum. (Interesting sidenote: greedoid.) AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm Greedy algorithm Approximation ratio Denote E ∗ = min θ E ( θ | X ), θ ALG = GREEDY ( E , C ,ǫ, X ) and E ALG = E ( θ ALG | X ) 1 ≤ α < ∞ is an approximation ratio if E ALG ≤ α E ∗ is always satisfied for all X . 1 ≤ α < ∞ is an expected approximation ratio if E [ E ALG ] ≤ α E ∗ is always satisfied for all X (expectation is over instances of the algorithm). Observation: if approximation ratio exists, then the algorithm always finds the zero cost solution if such a solution exists for a given data set. Sometimes the approximation ratio can be proven; often one can only run algorithm several times and observe the distribution of costs. For kmeans with approximation ratio α = O (log k ) and AB references see Arthur D, Vassilivitskii S (2006) k-means++: The Advantages of Careful Seeding. Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm Greedy algorithm Running times We can usually easily say that the running time of one step is polynomial. Often, the number of steps the algorithm takes is also polynomial, hence the algorithm is often polynomial (at least in practice). Proving the number of steps required until convergence is often quite difficult, however. Again, the easiest is to run algorithm several times and observe the distribution of the number of steps. AB Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm Greedy algorithm Questions to ask about a greedy algorithm Does the definition of the cost function make sense in your application? Should you use some other cost, for example, some utility? There may be several solutions with small cost. Do these solutions have similar parameters, for example, prototype vectors (interpretation of the results)? How efficient is the optimization step involving C ( θ )? Could you find better C ( θ )? If there exists a zero-cost solution, does your algorithm find it? Is there an approximation ratio? Can you say anything about number of steps required? What is the empirical distribution of the error E ALG and the AB number of steps taken, in your typical application? Kai Puolam¨ aki T-61.3050

k-means Clustering Clustering Greedy algorithms Decision Trees EM Algorithm EM Algorithm Expectation-Maximization algorithm (EM): greedy algorithm that finds soft cluster assignments Probabilistic interpretation, that is, we are maximizing a likelihood. AB Kai Puolam¨ aki T-61.3050

T-61.3050 Machine Learning: Basic Principles Decision Trees Kai - PowerPoint PPT Presentation

Clustering Decision Trees T-61.3050 Machine Learning: Basic Principles Decision Trees Kai Puolam aki Laboratory of Computer and Information Science (CIS) Department of Computer Science and Engineering Helsinki University of Technology

T-61.3050 Machine Learning: Basic Principles Clustering Kai Puolam aki Laboratory of Computer

T-61.3050 Machine Learning: Basic Principles Model Selection Kai Puolam aki Laboratory of

T-61.3050 Machine Learning: Basic Principles Introduction Kai Puolam aki Laboratory of

T-61.3050 Machine Learning: Basic Principles Multivariate Methods Kai Puolam aki Laboratory

T-61.3050 Machine Learning: Basic Principles Bayesian Networks Kai Puolam aki Laboratory of

T-61.3050 Machine Learning: Basic Principles Dimensionality Reduction Kai Puolam aki

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Unsupervised Learning About this class Build a model for your data. Which datapoints

Learning a Belief Network If you know the structure have observed all of the variables

High Dimensional Data Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 UCSD

On Casting Importance Weighted Autoencoder to an EM Algorithm to Learn Deep Generative Models

An Ensemble of Epoch-wise Empirical Bayes for Few-Shot Learning Yaoyao Liu Bernt Schiele Qianru

A Two-Level Toeplitz Model for Large-Scale Simultaneous Hypothesis Testing Dan Cervone Advisor:

Nave Bayes and Perceptrons Read AIMA Chapter 19.1-19.6 Slides courtesy of Dan Klein and Pieter

Bayesian perspective on QCD global analysis In collaboration with: Nobuo Sato University of