Outils Statistiques pour Data Science Part II : Unsupervised - PowerPoint PPT Presentation

Outils Statistiques pour Data Science Part II : Unsupervised Learning Massih-Reza Amini Université Grenoble Alpes Laboratoire d’Informatique de Grenoble Massih-Reza.Amini@imag.fr

2 Clustering observations within a given collection. observations that are close one to another, and separating the best those that are different observations. An element of G is called group (or cluster ). gravity r k , is called prototype. Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ The aim of clustering is to identify disjoint groups of ⇒ The aim is to find homogenous groups, by assembling ❑ Let G be a partition found over the collection C of N A group, G k , where 1 ≤ k ≤ | G | , corresponds to a subset of observations in C . ❑ A representative of a group G k , generally its center of

3 aim is to find homogeneous clusters or groups reflecting the Massih-Reza.Amini@imag.fr depends on the initialization. arbitrary value, to be found and it is generally fixed before hand to some space, found with the disposition of examples in the characteristic Classification vs. Clustering relationship between observations. the ERM or the SRM principle association between the inputs and the outputs following by observations and their associated class labels Introduction to Data-Science ❑ In classification : we have pairs of examples constituted ( x , y ) ∈ R d × { 1 , . . . , K } . ❑ The class information is provided by an expert and the aim is to find a prediction function f : R d → Y that makes the ❑ In clustering : the class information does not exist and the ❑ The main hypothesis here is that this relationship can be ❑ The exact number of groups for a problem is very difficult ❑ The partitioning is usually done iteratively and it mainly

4 K -means algorithm [MacQueen, 1967] Massih-Reza.Amini@imag.fr found. closest, resulting in new clusters; iteratively 2 Introduction to Data-Science K minimised: which the average distance between different groups is argmin G ❑ The K -means algorithm tends to find the partition for   ∑ ∑ || x − r k || 2   k = 1 d ∈ G k ❑ From a given set of centroids, the algorithm then ❑ affects each observation to the centroid to which it is the ❑ estimates new centroids for the clusters that have been

5 Clustering with K -means Massih-Reza.Amini@imag.fr Introduction to Data-Science

6 But also ... Massih-Reza.Amini@imag.fr Introduction to Data-Science

7 Different forms of clustering There are two main forms of clustering: 1. Flat partitioning, where groups are supposed to be independent one from another. The user then chooses a number of clusters and a threshold over the similarity measure. 2. Hierarchical partitioning, where the groups are structured in the form of a taxonomy, which in general is a binary tree (each group has two siblings). Massih-Reza.Amini@imag.fr Introduction to Data-Science

8 Hierarchical partitioning realized observations (agglomerative techniques), or top-down , by creating a tree from its root (divisives techniques). require that a number of groups to be fixed before hand. In opposite, their complexity is in general quadratique in the number of observations ( N ) ! Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ The hierarchical tends to construct a tree and it can be ❑ in bottom-up manner, by creating a tree from the ❑ Hierarchical methods are purely determinists and do not

8 Hierarchical partitioning realized observations (agglomerative techniques), or top-down , by creating a tree from its root (divisives techniques). require that a number of groups to be fixed before hand. the number of observations ( N ) ! Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ The hierarchical tends to construct a tree and it can be ❑ in bottom-up manner, by creating a tree from the ❑ Hierarchical methods are purely determinists and do not ❑ In opposite, their complexity is in general quadratique in

9 Steps of clustering Clustering is an iterative process including the following steps: 1. Choose a similarity measure and eventually compute a similarity matrix. 2. Clustering. a. Choose a family of partitioning methods. b. Choose an algorithm within that family. 3. Validate the obtained groups. 4. Return to step 2, by modifying the parameters of the clustering algorithm or the family of the partitioning family. Massih-Reza.Amini@imag.fr Introduction to Data-Science

10 i Massih-Reza.Amini@imag.fr x 2 d i d i Similarity measures d Introduction to Data-Science termes within two documents. In the case where the feature There exists several similarity measures or distance, the most d common ones are: characteristics are between 0 and 1, this measure takes the form: ❑ Jaccard measure, which estimates the proportion of common ∑ x i x ′ simJaccard ( x , x ′ ) = i = 1 ∑ x i + x ′ i − x i x ′ i = 1 ❑ Dice coefficient takes the form: ∑ x i x ′ simDice ( x , x ′ ) = i = 1 ∑ i + ( x ′ i ) 2 i = 1

11 d Massih-Reza.Amini@imag.fr by using for example its opposite. This distance is then transformed into a similarity measure, d d Similarity measures i x 2 Introduction to Data-Science d i ❑ cosine similarity, writes: ∑ x i x ′ simcos ( x , x ′ ) = i = 1 � � � � � � ∑ ∑ � � ( x ′ i ) 2 i = 1 i = 1 ❑ Euclidean distance is given by: � � � ∑ disteucl ( x , x ′ ) = || x − x ′ || 2 = ( x i − x ′ i ) 2 � i = 1

12 K Massih-Reza.Amini@imag.fr mixture models fits the best the observations the mixture. Mixture models Introduction to Data-Science x is then supposed to be generated with a probability ❑ With the probabilistic approaches, we suppose that each group G k is generated by a probability density of parameters θ k ❑ Following the formula of total probabilities, an observation ∑ P ( x , Θ) = P ( y = k ) P ( x | y = k , θ k ) � �� k = 1 π k where Θ = { π k , θ k ; k ∈ { 1 , . . . , K }} are the parameters of ❑ The aim is then to find the parameters Θ with which the

13 ln Massih-Reza.Amini@imag.fr because it implies a sum of a logarithm of a sum. this criterion Mixture models (2) Introduction to Data-Science N log-likelihood writes ❑ If we have a collection of N observations, x 1 : N , the [ K ] ∑ ∑ L M (Θ) = π k P ( x i | y = k , θ k ) i = 1 k = 1 ❑ The aim is then to find the parameters Θ ∗ that maximize Θ ∗ = argmax L M (Θ) Θ ❑ The direct maximisation of this criterion is impossible

14 Bayesian decision rule: Massih-Reza.Amini@imag.fr Mixture models (3) where Introduction to Data-Science each document is then assigned to a group following the the EM algorithm). ❑ We use then iterative methods for its maximisation (e.g. ❑ Once the optimal parameters of the mixture are found, x ∈ G k ⇔ P ( y = k | x , Θ ∗ ) = argmax P ( y = ℓ | x , Θ ∗ ) ℓ π ∗ ℓ P ( x | y = ℓ, θ ∗ k ) ∀ ℓ ∈ { 1 , . . . , K } , P ( y = ℓ | x , Θ ∗ ) = P ( x , Θ ∗ ) π ∗ ℓ P ( x | y = ℓ, θ ∗ ∝ k )

15 Z Massih-Reza.Amini@imag.fr Z EM algorithm [?] Introduction to Data-Science be find: parameters maximizing the likelihood would be simple to random variables Z such that if Z were known, the value of ❑ The idea behind the algorithm is to introduce hidden ∑ L M (Θ) = ln P ( x 1 : N | Z , Θ) P ( Z | Θ) ❑ by denoting the current estimates of the parameters at time t by Θ ( t ) , the next iteration t + 1 consists in finding the new parameters Θ that maximize L M (Θ) − L M (Θ ( t ) ) ∑ P ( x 1 : N | Z , Θ) P ( Z | Θ) L M (Θ) − L M (Θ ( t ) ) = ln P ( Z | x 1 : N , Θ ( t ) ) P ( Z | x 1 : N , Θ ( t ) ) P ( x 1 : N | Θ ( t ) )

17 EM algorithm [?] Massih-Reza.Amini@imag.fr Introduction to Data-Science L M (Θ ( t + 1 ) ) L M (Θ ( t ) ) L M (Θ) Q (Θ , Θ ( t ) ) Θ ( t + 1 ) Θ ( t ) Θ

Outils Statistiques pour Data Science Part II : Unsupervised - PowerPoint PPT Presentation

Outils Statistiques pour Data Science Part II : Unsupervised Learning Massih-Reza Amini Universit Grenoble Alpes Laboratoire dInformatique de Grenoble Massih-Reza.Amini@imag.fr 2 Clustering observations within a given collection.

Outils Statistiques pour Data Science Part I : Supervised Learning Massih-Reza Amini Universit

Session 3 Statistiques pour les donnes omiques Teachers: Claire Vandiedonck, Jacques van Helden

Techniques et outils pour la vrification de systmes-sur-puces au niveau transactionnel

M ethodologies and outils pour l evaluation des protocoles de transport dans les r

ISO TC67 WG10 ISO TC252 WG2 Normalisation internationale pour les installations et

pour le dveloppement ? Serge Tomasi, Directeur Adjoint Direction de la Coopration pour le

Descripteurs divers niveaux de concepts pour la classification concepts pour la classification

Statistiques 2014 FIXME Hackerspace Jean-Baptiste (Rorist) Aubort - April 26, 2015 Summary

Fast overview about the CERT-TCC Helmi Rais CERT-TCC Team Manager Helmi.rais@ansi.tn Les IT en

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

M ISSION AND VALUES AVAID - Association de Volontaires pour lAide au Dveloppement,

Simulations Thermo-Hydro- Mcaniques pour le stockage profond des dchets nuclaires

Integrity Under Pressure Systmes intgrs pour conduites en pression Integrated systems for

codage pour ados CAMP DT 2018 Prsent par Axiom Academy Offert par lAcad mie

Systme lectronique de monitoring de d donnes pour sportifs de haut niveau if d h i

Unified Patent Court Mock Trial Paris April, 2 nd 2015 U.J.U.B. - Union pour la Juridiction

Education 2 Camps Sono-savvy needs to bring others up to speed Sono-Learner

Welcome! Welcome To Harbourside Function Venue, we hope you enjoy your time here with us. If

PMWG Readmissions Sub-group 05/28 / 2019 Agenda In-depth Issue Exploration: Framework of

methadone or buprenorphine in the management of hospitalized patients with opioid use disorder

Modeling nuclear effects Modeling nuclear effects in precise oscillation experiments in precise

Overabundance as hybrid infmection Quantitative evidence from Czech Matas Guzmn Naranjo and

Small Number is a 5 year-old boy who gets into a lot of mischief. He lives with his Grandma and

Its not just a cupcake its a Cupcakerie cupcake!!! W W W . T H E C U P C A K E R I