Pseudodimension for Data Analytics Mateo Riondato Amherst College - PowerPoint PPT Presentation

Pseudodimension for Data Analytics Mateo Riondato Amherst College ICERM — May 17, 2019

Takeaway message High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension , a concept from statistical learning theory can be used to analyze the trade-off between sample size and approximation quality . Originally developed for supervised learning , we use it to analyze algorithms for unsupervised, combinatorial problems on graphs, transactional datasets, databases, ... . 2 / 41

Outline 1 Random sampling for data analytics 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension 3 / 41

Approximations from a random sample Random D ataset D , |D| = n Sample S , |S| = ℓ ≪ n Data Mining Task Sampling-based Data Analytics Task For each color c ∈ C , compute the fraction Estimate r c ( D ) with r c ( S ) , for each c ∈ C . r c ( D ) of c -colored jelly beans in D . r c ( S ) = 1 r c ( D ) = 1 � � f c ( j ) f c ( j ) where ℓ n j ∈S j ∈D � 1 if j is c -colored Fast . f c ( j ) = 0 otherwise Acceptable if max c ∈C | r c ( S ) − r c ( D ) | is small . Too expensive . Key challenge : tell how small this error is. 4 / 41

Error bounds max c ∈C | r c ( S ) − r c ( D ) | is not computable from S . Let’s get an upper bound ε to it. Probabilistic upper bound to the max. error Fix a failure probability δ ∈ (0 , 1) . A value ε ∈ (0 , 1) is a probabilistic upper bound to max c ∈C | r c ( S ) − r c ( D ) | if � � Pr max c ∈C | r c ( S ) − r c ( D ) | < ε ≥ 1 − δ . The probability is over the samples of size ℓ . Ingredients to compute ε : δ , C , D or S , and |S| = ℓ : ε = g ( δ, C , D or S , ℓ ) How do we find such a function g ? 5 / 41

Outline ✔ 1 Random sampling for data mining 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension 6 / 41

A classic probabilistic upper bound to the error Theorem (Chernoff bound + Union bound) Let � � ln |C| + ln 2 � 3 δ ε = . ℓ Then � � Pr max c ∈C | r c ( S ) − r c ( D ) | < ε ≥ 1 − δ . Not a function of D or S ! We want ε = g ( δ, C , D or S , ℓ ) ; D or S give information on the complexity of approximation through sampling; “ ln |C| ” is a rough measure of the sample complexity of the task, as it ignores the data. 7 / 41

Are there beter measures of sample complexity? Measures from Statistical Learning Theory replace “ ln |C| ” with h ( C , D ) or h ( C , S ) . VC-dimension, pseudodimension , covering numbers, Rademacher averages, ... Developed for supervised learning and had reputation of being only of theoretical interest ; We showed they can be used for efficient practical algorithms for data mining problems . Example: Betweenness centrality estimation on a graph G = ( V, E ) :  �   �  ln | V | + ln 1 ln diam ( G ) + ln 1 Union  ; VC-dimension  ; δ δ bound : ε = O [ACM WSDM’14,DMKD’16] : ε = O   ℓ ℓ Exponential reduction on important classes of graphs (small-world, power-law, ...). 8 / 41

VC-dimension Vapnik-Chervonenkis VC ( F ) of a family F of subsets of X (or of 0–1 functions from X ) Combinatorial measure of the richness of F . Originally developed to study generalization error bounds for classification [VC71]. Picked also up by the computational geometry community A set X = { x 1 , . . . , x ℓ } ⊆ X is shatered by F if { X ∩ A : A ∈ F} = 2 X VC-dimension of F : size of the largest set that can be shatered by F . 9 / 41

VC-dimension example X = R 2 , F = axis-aligned rectangles Shatering a set of four points is easy, but shatering five points is impossible. y y 0 0 x x Proving a VC-dimension upper bound k requires showing that there is no set of size k that can be shatered. 10 / 41

Pseudodimension Pseudodimension PD ( F ) of a family F of real-valued functions from domain X to [ a, b ] . Combinatorial measure of the richness of F . Originally developed to study generalization error bounds for regression [Pollard84]. Intuition: If the graphs of the f ’s in F cross many times , the pseudodimension is high. 11 / 41

Pseudodimension A set X = { x 1 , . . . , x ℓ } ⊆ X is (pseudo-)shatered by F if there exist t 1 , . . . , t ℓ ∈ R s.t.: � � � � � �     sgn( f ( x 1 ) − t 1 ) � � �   � .   � . � = 2 ℓ  : f ∈ F   . � �  � �   sgn( f ( x ℓ ) − t ℓ ) �   � � � � � vectors in {− 1 , 1 } ℓ � � � � � � � �     sgn( f ( x 1 ) − t 1 ) � � �   �  .  � � = 2 ℓ .  : f ∈ F   . � �  � �   sgn( f ( x ℓ ) − t ℓ ) �   � � � � � vectors in {− 1 , 1 } ℓ � � PD ( F ) : size of the largest pseudo-shatered set. 12 / 41

Pseudodimension as VC-dimension For each f ∈ F , let R f = { ( x, t ) : t ≤ f ( x ) } ⊂ X × [ a, b ] Define the family of sets F + = { R f , f ∈ F} PD ( F ) = VC ( F + ) 13 / 41

Proving upper bounds to pseudodimension The game is always about restricting the class of sets that may be shatered. Two useful general restrictions [R.-Upfal18 (someone must have known before)]: If B ⊆ X × [ a, b ] is shatered by F + : 1) B may contain at most one element ( x, t ) for each x ∈ X ; 2) B cannot contain any element ( x, a ) for any x ∈ X . 14 / 41

Pseudodimension and sampling Theorem [Li et al. ’01] Let PD ( F ) ≤ d and  �  � d + ln 2 � 1  . δ ε = O  ℓ Then � � � �� 1 1 � � � � Pr max f ( x ) − E S f ( x ) � < ε ≥ 1 − δ . � � � s s � f ∈F � x ∈S x ∈S If F is finite and d ≪ ln |F| , ε ≪ the one derived with Hoeffding+Union. This theorem works even if F is infinite. 15 / 41

Outline ✔ 1 Random sampling for data analytics ✔ 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension 16 / 41

Making miso soup Ingredients (for 4 people): • 2 teaspoons dashi granules • 4 cups water • 3 tablespoons miso paste • 1 (8 ounce) package silken tofu , diced • 2 green onions , sliced diagonally into 1 / 2 inch pieces “Miso makes a soup loaded with flavour that saves you the hassle of making stock.” Y. Otolenghi (world-class chef) (Really: [R.-Vandin ’18]: Mining Interesting Subgroup with Sampling and Pseudodimension) 17 / 41

Section outline 1 Setings: datasets, subgroups, interestigness measures 2 Approximating subgroups: a sufficient condition 3 The pseudodimension of subgroups 18 / 41

Subgroups D = { t 1 , . . . , t n } t = ( t.A 1 , t.A 2 , . . . , t.A z , t.T ) ∈ Y 1 × · · · × Y z × { 0 , 1 } target transactions description atributes E.g., t 3 = ( blue , 4 , false , 1) , t 4 = ( red , 3 , true , 1) Subgroup B = ( cond 1 , 1 ∨ · · · cond 1 ,r 1 ) ∧ · · · ∧ ( cond q, 1 ∨ · · · cond q,r q ) E.g., B = ( A 1 = blue ∨ A 1 = red ) ∧ ( A 2 = 4) t 3 supports B , t 4 does not. Language L : candidate subgroups of potential interest to the analyst. E.g., L = “conjunctions of two equality conditions” B �∈ L , but (( A 1 = blue ) ∧ ( A 2 = 4)) ∈ L 19 / 41

Mining Interesting Subgroups Interesting subgroup: subgroup associated with target value (e.g., 1) Examples • social networks: atribute = user features, target = interest in a topic • biomedicine: atribute = mutations, target = response to therapy • classification: atribute = features, target = test label XOR prediction. Inherently interpretable ! 20 / 41

Subgroup quality measures p - Qality of B in D : q ( p ) D ( B ) = g D ( B ) p × u D ( B ) cover C D ( B ) Generality of a subgroup B in D : g D ( B ) = | { t ∈ D : t supports B } | |D| Unusualness of B in D : 1 1 � � u D ( B ) = t.T − t.T | C D ( B ) | |D| t ∈ C D ( B ) t ∈D target mean µ of D target mean of C D ( B ) p weights generality vs unusualness (usually p ∈ { 1 / 2 , 1 , 2 } ) p = 1 / 2 ⇒ quality of B ∼ z-score of B Rest of the talk: p = 1 ⇒ quality: q D ( B ) 21 / 41

Example of subgroup quality measures Dataset: A 1 A 2 A 3 T Target mean µ of D : 1 1 0 1 1 � t.T = 3 / 4 = 0 . 75 3 1 1 0 |D| 1 1 0 1 t ∈ D 2 0 1 1 Subgroup B = “ A 1 ≥ 2 ∧ A 3 = 1 ”: Generality g D ( B ) = |{ t ∈D : t supports B } = 2 / 4 = 0 . 5 |D| � 1 Unusualness u D ( B ) = t.T − µ = 1 / 2 − 0 . 75 = − 0 . 25 | C D ( B ) | t ∈ C D ( B ) 1-quality: q D ( B ) = g D ( B ) q D ( B ) = − 0 . 125 22 / 41

The top-k subgroup mining task Input: D , L , k ≥ 1 r D ( k ) : k -th highest quality in D of a subgroup from L , k ≥ 1 . Output: TOP ( D , L , k ) = { B ∈ L : q D ( B ) ≥ r D ( k ) } -1 r( k ) 1 quality D 23 / 41

Pseudodimension for Data Analytics Mateo Riondato Amherst College - PowerPoint PPT Presentation

Pseudodimension for Data Analytics Mateo Riondato Amherst College ICERM May 17, 2019 Takeaway message High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension , a

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area

Paying new hires fairly Ben Teusch HR Analytics Consultant DataCamp Human Resources Analytics

Data Analytics in Healthcare Health Data Analytics Conference October 2017 Dr Richard Ashby

Data Analytics CS301 Introduction to Data Analytics Week 1: 1 st Sept Fall 2020 Oliver

THINGWORX ANALYTICS Name Title KEY TAKEAWAYS IoT Analytics Analytics is a journey that

Analytics@TP Pre resen ented ed by: : Michael Yap 2018-09-28 Agenda Our Analytics

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

BLUEcloud Analytics After much anticipation we present to you BLUEcloud Analytics What is

Three-gluings of elliptic curves (Revised slides) Everett W. Howe Center for Communications

Engineering Economics 4-1 Cash Flow Cash flow is the sum of money recorded as receipts or

C-SPARQL: A Continuous Extension of SPARQL Marco Balduini marco.balduini@polimi.it Share,

Structured Probabilistic Models for Deep Learning Lecture slides for Chapter 16 of Deep Learning

WHAT IS Marshall High School Sociology Mr. Cline Unit One- Slides D The History of Sociology

Exercise in Place At-home exercise for myositis A Live Q&A with Megan and Lauren, Occupational

Further Improvement in Approximating the Maximum Duo-Preservation

VBID and Part D Payment Modernization Models Application and Design Discussion Centers for

Pseudodimension for Data Analytics Mateo Riondato Amherst College - PowerPoint PPT Presentation

Pseudodimension for Data Analytics Mateo Riondato Amherst College ICERM May 17, 2019 Takeaway message High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension , a

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Data Mining &amp; Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area

Paying new hires fairly Ben Teusch HR Analytics Consultant DataCamp Human Resources Analytics

Data Analytics in Healthcare Health Data Analytics Conference October 2017 Dr Richard Ashby

Data Analytics CS301 Introduction to Data Analytics Week 1: 1 st Sept Fall 2020 Oliver

THINGWORX ANALYTICS Name Title KEY TAKEAWAYS IoT Analytics Analytics is a journey that

Analytics@TP Pre resen ented ed by: : Michael Yap 2018-09-28 Agenda Our Analytics

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

BLUEcloud Analytics After much anticipation we present to you BLUEcloud Analytics What is

Three-gluings of elliptic curves (Revised slides) Everett W. Howe Center for Communications

Engineering Economics 4-1 Cash Flow Cash flow is the sum of money recorded as receipts or

C-SPARQL: A Continuous Extension of SPARQL Marco Balduini marco.balduini@polimi.it Share,

Structured Probabilistic Models for Deep Learning Lecture slides for Chapter 16 of Deep Learning

WHAT IS Marshall High School Sociology Mr. Cline Unit One- Slides D The History of Sociology

Exercise in Place At-home exercise for myositis A Live Q&amp;A with Megan and Lauren, Occupational

Further Improvement in Approximating the Maximum Duo-Preservation

VBID and Part D Payment Modernization Models Application and Design Discussion Centers for

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Exercise in Place At-home exercise for myositis A Live Q&A with Megan and Lauren, Occupational