Be certain of how-to before mining uncertain data F. Gullo G. Ponti - PowerPoint PPT Presentation

Be certain of how-to before mining uncertain data F. Gullo ∗ G. Ponti † A. Tagarelli ‡ ∗ Yahoo Labs Barcelona, Spain † ENEA Research Center Portici (NA), Italy ‡ University of Calabria Cosenza, Italy 7th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2014) September 15-19, 2014, Nancy (France) Giovanni Ponti Be certain of how-to before mining uncertain data

Uncertainty Uncertainty inherently affects data from a wide range of emerging application domains: sensor data location-based services (e.g., moving objects data) biomedical and biometric data (e.g., gene expression data) distributed applications RFID data Generally due to noisy factors, such as signal noise, instrumental errors, wireless transmission Giovanni Ponti Be certain of how-to before mining uncertain data

Uncertainty (a) (b) (c) Giovanni Ponti Be certain of how-to before mining uncertain data

Uncertainty representation Different granularities: table tuple attribute Different models: fuzzy evidence-oriented probabilistic Attribute-level uncertainty modeled according to a probabilistic model (i.e., a probability distribution) ⇒ uncertain object Giovanni Ponti Be certain of how-to before mining uncertain data

Uncertain object Modeling by regions (domains) of definition and probability density functions (pdfs) Figure borrowed from [Kriegel and Pfeifle, ICDM 2005] Giovanni Ponti Be certain of how-to before mining uncertain data

Uncertain object m -dimensional region multivariate pdf defined over the region Definition (uncertain object) An uncertain object o is a pair ( R , f ): R ⊆ R m is the m -dimensional domain region in which o is defined f : R m → R + 0 is the probability density function of o at each point x ∈ R m such that: ∀ x ∈ R m \ R f ( x ) > 0 , ∀ x ∈ R and f ( x ) = 0 , Giovanni Ponti Be certain of how-to before mining uncertain data

Dealing with uncertainty Two main general tasks: 1 Defining a proximity measure between uncertain objects needed in almost all major data-management and data-mining tasks (e.g., visualization, classification, clustering) 2 Defining a model to summarize a set of uncertain objects required for tasks like data compression or clustering, and to speed-up complex data-analysis/management tasks Giovanni Ponti Be certain of how-to before mining uncertain data

Similarity detection in uncertain data Giovanni Ponti Be certain of how-to before mining uncertain data

Distance between uncertain objects Traditional approaches: 1 Difference between expected values 2 Expected Distance (ED) � � � x − y � 2 ED ( o 1 , o 2 ) = 2 f 1 ( x ) f 2 ( y ) d x d y x ∈ R 1 y ∈ R 2 Main drawbacks: Difference between expected values is inaccurate: it considers only 1 very little information stored in the pdfs: Expected distance is slow: it has quadratic complexity in the 2 number of statistical samples used to represent/approximate pdfs Giovanni Ponti Be certain of how-to before mining uncertain data

Distance between uncertain objects Need for a novel distance measure that trades off between accuracy and efficiency Idea: resort to Information Theory Information Theory alone is not enough Giovanni Ponti Be certain of how-to before mining uncertain data

Distance measures for pdfs Distance measures for pdfs: information-theoretic (IT) measures: Kullback-Leibler (KL), Chernoff , Hellinger , . . . IT measures are accurate, but they work out for pdfs that share a reasonably large overlapping probability values area 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0 2 2 4 4 12 12 10 6 10 6 8 8 8 8 6 6 4 4 10 10 2 2 12 0 12 0 (a) (b) Giovanni Ponti Be certain of how-to before mining uncertain data

Compound distance for uncertain objects ∆( o i , o j ) = f (∆ IT ( o i , o j ) , ∆ EV ( o i , o j )) ∆ IT involves a comparison by means of a certain IT measure ∆ EV measures the distance proportionally to the difference of the expected values Two critical choices for defining ∆ : IT-measure used for ∆ IT ⇒ Hellinger distance ( H ) 1 � � � ρ ( f , f ′ ) = H ( f , f ′ ) = f ( x ) f ′ ( x ) d x 1 − ρ ( f , f ′ ) x ∈ℜ m way of combining ∆ IT and ∆ EV ⇒ ∆ IT should prevail on ∆ EV as 2 long as discriminating among different cases by means of IT-measures is possible Giovanni Ponti Be certain of how-to before mining uncertain data

Compound distance for uncertain objects Definition (uncertain distance) The uncertain distance between two uncertain objects o = ( R , f ) and o ′ = ( R ′ , f ′ ) is defined as � � � × e − ED 2 (˜ f , ˜ f ′ ) ∆( o , o ′ ) = H ( f , f ′ ) ρ ( f , f ′ ) − 1 − � �� ∆ EV term ∆ IT term combination between ∆ IT and ∆ EV ED 2 (˜ f , ˜ f ′ ) is the expected distance between the uniform-approximation of f and f ′ Giovanni Ponti Be certain of how-to before mining uncertain data

Centroid-based agglomerative hierarchical clustering F. Gullo, G. Ponti, A. Tagarelli, S. Greco [ICDM’08] Application : hierarchical clustering of uncertain objects The U-AHC Algorithm Motivations: Input: a set of uncertain objects D = { o 1 , . . . , o n } Hierarchical clustering is Output: a set of partitions D computationally expensive: need for a fast 1: C ← {{ o 1 } , . . . , { o n }} (yet accurate) proximity 2: D ← { C } measure 3: repeat The way of combining 4: let C i , C j be the pair of clusters in ∆ IT and ∆ EV C such that ∆( P C i , P C j ) is theoretically guarantees high accuracy in an minimum agglomerative hierarchical 5: C ← C \ {C i , C j } ∪ {C i ∪ C j } clustering scheme 6: D ← D ∪ { C } 7: until | C | = 1 Giovanni Ponti Be certain of how-to before mining uncertain data

Uncertain data summarization Giovanni Ponti Be certain of how-to before mining uncertain data

Summarization of a set of uncertain objects Traditional approaches (e.g., Chau et al., UK-means, PAKDD’06) ⇒ uncertain prototype defined as the average of the expected values of the objects to be summarized Main drawbacks: Deterministic representation ⇒ a lot of information is discarded Only central tendency is expressed ⇒ variance is completely ignored Giovanni Ponti Be certain of how-to before mining uncertain data

Summarization of a set of uncertain objects Uncertain objects with the same central tendency: lower-variance, more-compact cluster (left) and higher-variance, less-compact cluster (right) Uncertain objects with different central tendency: lower-variance, less-compact cluster (left) and higher-variance, more-compact cluster (right) Giovanni Ponti Be certain of how-to before mining uncertain data

Summarization of a set of uncertain objects Solutions: 1 Mixture-model-based uncertain data summarization 2 Random-variable-based uncertain data summarization Giovanni Ponti Be certain of how-to before mining uncertain data

Mixture-model-based uncertain data summarization Idea Compute a prototype of a set of uncertain objects as mixture model : set of uncertain objects S = { o i } k i =1 uncertain prototype P S = ( R S , f S ), where R S = � o =( R , f ) ∈ S R , f S ( x ) = ( | S | ) − 1 � o =( R , f ) ∈ S f ( x ) Giovanni Ponti Be certain of how-to before mining uncertain data

Mixture-model-based uncertain data summarization Despite its simplicity, the mixture-model-based prototype plays a key role in a task of clustering uncertain objects: capability of employing a novel clustering criterion that does not require any distance measure between uncertain objects ⇒ minimizing the variance of cluster prototypes (a)–(c): Sets of uncertain objects (a) (b) (b)–(d): The corresponding mixture models (c) (d) Giovanni Ponti Be certain of how-to before mining uncertain data

Minimizing the variance of cluster mixture models for clustering uncertain objects F. Gullo, G. Ponti, A. Tagarelli [ICDM’10, SAM’13] A novel criterion for clustering uncertain objects: minimizing variance of cluster mixture models � σ 2 ( P C ) J ( C ) = C ∈C - accuracy : the lower the variance, the higher the cluster compactness - efficiency : capability of exploiting interesting analytical properties Computing objective function J - Moving object o from C ∈ C to � C ∈ C leads to a new C ′ = C \ ( C ∪ � C ) ∪ ( C ′ ∪ � C ′ ), where C ′ = C \ { o } , � C ′ = � C ∪ { o } - J ( C ′ ) can be efficiently computed in O ( m ) as: J ( C ′ ) = J ( C ) − ( σ 2 ( P C ) + σ 2 ( P � C )) + ( σ 2 ( P C ′ ) + σ 2 ( P � C ′ )) Giovanni Ponti Be certain of how-to before mining uncertain data

The MMVar algorithm Input: A set D of UO; the number k of output clusters Output: A partition C of D 1: compute µ ( o ), µ 2 ( o ), ∀ o ∈ D 2: C ← randomPartition ( D , k ) MMVar 3: compute µ ( P C ), µ 2 ( P C ), ∀ C ∈ C converges to a 4: v ← J ( C ) local optimum 5: repeat of function J in 6: for all o ∈ D do a finite number 7: let C ∈ C be the cluster s.t. o ∈ C I of iterations C ∗ ← arg min � C J C ( C , o , � 8: C ) if C ∗ � = C then MMVar works 9: v = J C ( C , o , � in O ( I k |D| m ) 10: C ) recompute C by moving o from C to C ∗ 11: 12: recompute µ ( P C ), µ 2 ( P C ), µ ( P C ∗ ), µ 2 ( P C ∗ ) 13: until no object in D is relocated Giovanni Ponti Be certain of how-to before mining uncertain data

Be certain of how-to before mining uncertain data F. Gullo G. Ponti - PowerPoint PPT Presentation

Be certain of how-to before mining uncertain data F. Gullo G. Ponti A. Tagarelli Yahoo Labs Barcelona, Spain ENEA Research Center Portici (NA), Italy University of Calabria Cosenza, Italy 7th European Conference on

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

COVID-19 VIRTUAL FORUM STRATEGY IN UNCERTAIN TIMES COVID-19: STRATEGY IN UNCERTAIN TIMES APRIL

Uncertain< T > A First-Order Type for Uncertain Data James Bornholt Australian National

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

GLO Science Professional Before & After Images Before GLO After GLO Before GLO After GLO

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Uncertain Centroid based Partitional Clustering of Uncertain Data Francesco Gullo Andrea

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA

MCDA-GIS integration: an application in GRASS GIS 6.4 Massei G., Rocchi L., Paolotti L.*,

Similarity-based Learning Methods for the Semantic Web Claudia dAmato Dipartimento di

Towards a cloud-based computing and analysis framework to process environmental science big data

Probase Haixun Wang Microsoft Research Asia Short Text Document Title Search Caption

SMART MANUFACTURING With Apache Spark Streaming and Deep Leaning Prajod Vettiyattil, Wipro

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

IR in Social Media Andreas Hotho Data Mining and Information Retrieval Group University of

Be certain of how-to before mining uncertain data F. Gullo G. Ponti - PowerPoint PPT Presentation

Be certain of how-to before mining uncertain data F. Gullo G. Ponti A. Tagarelli Yahoo Labs Barcelona, Spain ENEA Research Center Portici (NA), Italy University of Calabria Cosenza, Italy 7th European Conference on

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

COVID-19 VIRTUAL FORUM STRATEGY IN UNCERTAIN TIMES COVID-19: STRATEGY IN UNCERTAIN TIMES APRIL

Uncertain&lt; T &gt; A First-Order Type for Uncertain Data James Bornholt Australian National

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

GLO Science Professional Before &amp; After Images Before GLO After GLO Before GLO After GLO

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Uncertain Centroid based Partitional Clustering of Uncertain Data Francesco Gullo Andrea

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA

MCDA-GIS integration: an application in GRASS GIS 6.4 Massei G.*, Rocchi L.*, Paolotti L.*,

Similarity-based Learning Methods for the Semantic Web Claudia dAmato Dipartimento di

Towards a cloud-based computing and analysis framework to process environmental science big data

Probase Haixun Wang Microsoft Research Asia Short Text Document Title Search Caption

SMART MANUFACTURING With Apache Spark Streaming and Deep Leaning Prajod Vettiyattil, Wipro

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

IR in Social Media Andreas Hotho Data Mining and Information Retrieval Group University of

Uncertain< T > A First-Order Type for Uncertain Data James Bornholt Australian National

GLO Science Professional Before & After Images Before GLO After GLO Before GLO After GLO

MCDA-GIS integration: an application in GRASS GIS 6.4 Massei G., Rocchi L., Paolotti L.*,