pseudodimension for data analytics
play

Pseudodimension for Data Analytics Mateo Riondato Amherst College - PowerPoint PPT Presentation

Pseudodimension for Data Analytics Mateo Riondato Amherst College ICERM May 17, 2019 Takeaway message High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension , a


  1. Pseudodimension for Data Analytics Mateo Riondato Amherst College ICERM — May 17, 2019

  2. Takeaway message High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension , a concept from statistical learning theory can be used to analyze the trade-off between sample size and approximation quality . Originally developed for supervised learning , we use it to analyze algorithms for unsupervised, combinatorial problems on graphs, transactional datasets, databases, ... . 2 / 41

  3. Outline 1 Random sampling for data analytics 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension 3 / 41

  4. Approximations from a random sample Random D ataset D , |D| = n Sample S , |S| = ℓ ≪ n Data Mining Task Sampling-based Data Analytics Task For each color c ∈ C , compute the fraction Estimate r c ( D ) with r c ( S ) , for each c ∈ C . r c ( D ) of c -colored jelly beans in D . r c ( S ) = 1 r c ( D ) = 1 � � f c ( j ) f c ( j ) where ℓ n j ∈S j ∈D � 1 if j is c -colored Fast . f c ( j ) = 0 otherwise Acceptable if max c ∈C | r c ( S ) − r c ( D ) | is small . Too expensive . Key challenge : tell how small this error is. 4 / 41

  5. Error bounds max c ∈C | r c ( S ) − r c ( D ) | is not computable from S . Let’s get an upper bound ε to it. Probabilistic upper bound to the max. error Fix a failure probability δ ∈ (0 , 1) . A value ε ∈ (0 , 1) is a probabilistic upper bound to max c ∈C | r c ( S ) − r c ( D ) | if � � Pr max c ∈C | r c ( S ) − r c ( D ) | < ε ≥ 1 − δ . The probability is over the samples of size ℓ . Ingredients to compute ε : δ , C , D or S , and |S| = ℓ : ε = g ( δ, C , D or S , ℓ ) How do we find such a function g ? 5 / 41

  6. Outline ✔ 1 Random sampling for data mining 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension 6 / 41

  7. A classic probabilistic upper bound to the error Theorem (Chernoff bound + Union bound) Let � � ln |C| + ln 2 � 3 δ ε = . ℓ Then � � Pr max c ∈C | r c ( S ) − r c ( D ) | < ε ≥ 1 − δ . Not a function of D or S ! We want ε = g ( δ, C , D or S , ℓ ) ; D or S give information on the complexity of approximation through sampling; “ ln |C| ” is a rough measure of the sample complexity of the task, as it ignores the data. 7 / 41

  8. Are there beter measures of sample complexity? Measures from Statistical Learning Theory replace “ ln |C| ” with h ( C , D ) or h ( C , S ) . VC-dimension, pseudodimension , covering numbers, Rademacher averages, ... Developed for supervised learning and had reputation of being only of theoretical interest ; We showed they can be used for efficient practical algorithms for data mining problems . Example: Betweenness centrality estimation on a graph G = ( V, E ) :  �   �  ln | V | + ln 1 ln diam ( G ) + ln 1 Union  ; VC-dimension  ; δ δ bound : ε = O [ACM WSDM’14,DMKD’16] : ε = O   ℓ ℓ Exponential reduction on important classes of graphs (small-world, power-law, ...). 8 / 41

  9. VC-dimension Vapnik-Chervonenkis VC ( F ) of a family F of subsets of X (or of 0–1 functions from X ) Combinatorial measure of the richness of F . Originally developed to study generalization error bounds for classification [VC71]. Picked also up by the computational geometry community A set X = { x 1 , . . . , x ℓ } ⊆ X is shatered by F if { X ∩ A : A ∈ F} = 2 X VC-dimension of F : size of the largest set that can be shatered by F . 9 / 41

  10. VC-dimension example X = R 2 , F = axis-aligned rectangles Shatering a set of four points is easy, but shatering five points is impossible. y y 0 0 x x Proving a VC-dimension upper bound k requires showing that there is no set of size k that can be shatered. 10 / 41

  11. Pseudodimension Pseudodimension PD ( F ) of a family F of real-valued functions from domain X to [ a, b ] . Combinatorial measure of the richness of F . Originally developed to study generalization error bounds for regression [Pollard84]. Intuition: If the graphs of the f ’s in F cross many times , the pseudodimension is high. 11 / 41

  12. Pseudodimension A set X = { x 1 , . . . , x ℓ } ⊆ X is (pseudo-)shatered by F if there exist t 1 , . . . , t ℓ ∈ R s.t.: � � � � � �     sgn( f ( x 1 ) − t 1 ) � � �   � .   � . � = 2 ℓ  : f ∈ F   . � �  � �   sgn( f ( x ℓ ) − t ℓ ) �   � � � � � vectors in {− 1 , 1 } ℓ � � � � � � � �     sgn( f ( x 1 ) − t 1 ) � � �   �  .  � � = 2 ℓ .  : f ∈ F   . � �  � �   sgn( f ( x ℓ ) − t ℓ ) �   � � � � � vectors in {− 1 , 1 } ℓ � � PD ( F ) : size of the largest pseudo-shatered set. 12 / 41

  13. Pseudodimension as VC-dimension For each f ∈ F , let R f = { ( x, t ) : t ≤ f ( x ) } ⊂ X × [ a, b ] Define the family of sets F + = { R f , f ∈ F} PD ( F ) = VC ( F + ) 13 / 41

  14. Proving upper bounds to pseudodimension The game is always about restricting the class of sets that may be shatered. Two useful general restrictions [R.-Upfal18 (someone must have known before)]: If B ⊆ X × [ a, b ] is shatered by F + : 1) B may contain at most one element ( x, t ) for each x ∈ X ; 2) B cannot contain any element ( x, a ) for any x ∈ X . 14 / 41

  15. Pseudodimension and sampling Theorem [Li et al. ’01] Let PD ( F ) ≤ d and  �  � d + ln 2 � 1  . δ ε = O  ℓ Then � � � �� � 1 1 � � � � Pr max f ( x ) − E S f ( x ) � < ε ≥ 1 − δ . � � � s s � f ∈F � x ∈S x ∈S If F is finite and d ≪ ln |F| , ε ≪ the one derived with Hoeffding+Union. This theorem works even if F is infinite. 15 / 41

  16. Outline ✔ 1 Random sampling for data analytics ✔ 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension 16 / 41

  17. Making miso soup Ingredients (for 4 people): • 2 teaspoons dashi granules • 4 cups water • 3 tablespoons miso paste • 1 (8 ounce) package silken tofu , diced • 2 green onions , sliced diagonally into 1 / 2 inch pieces “Miso makes a soup loaded with flavour that saves you the hassle of making stock.” Y. Otolenghi (world-class chef) (Really: [R.-Vandin ’18]: Mining Interesting Subgroup with Sampling and Pseudodimension) 17 / 41

  18. Section outline 1 Setings: datasets, subgroups, interestigness measures 2 Approximating subgroups: a sufficient condition 3 The pseudodimension of subgroups 18 / 41

  19. Subgroups D = { t 1 , . . . , t n } t = ( t.A 1 , t.A 2 , . . . , t.A z , t.T ) ∈ Y 1 × · · · × Y z × { 0 , 1 } target transactions description atributes E.g., t 3 = ( blue , 4 , false , 1) , t 4 = ( red , 3 , true , 1) Subgroup B = ( cond 1 , 1 ∨ · · · cond 1 ,r 1 ) ∧ · · · ∧ ( cond q, 1 ∨ · · · cond q,r q ) E.g., B = ( A 1 = blue ∨ A 1 = red ) ∧ ( A 2 = 4) t 3 supports B , t 4 does not. Language L : candidate subgroups of potential interest to the analyst. E.g., L = “conjunctions of two equality conditions” B �∈ L , but (( A 1 = blue ) ∧ ( A 2 = 4)) ∈ L 19 / 41

  20. Mining Interesting Subgroups Interesting subgroup: subgroup associated with target value (e.g., 1) Examples • social networks: atribute = user features, target = interest in a topic • biomedicine: atribute = mutations, target = response to therapy • classification: atribute = features, target = test label XOR prediction. Inherently interpretable ! 20 / 41

  21. Subgroup quality measures p - Qality of B in D : q ( p ) D ( B ) = g D ( B ) p × u D ( B ) cover C D ( B ) Generality of a subgroup B in D : g D ( B ) = | { t ∈ D : t supports B } | |D| Unusualness of B in D : 1 1 � � u D ( B ) = t.T − t.T | C D ( B ) | |D| t ∈ C D ( B ) t ∈D target mean µ of D target mean of C D ( B ) p weights generality vs unusualness (usually p ∈ { 1 / 2 , 1 , 2 } ) p = 1 / 2 ⇒ quality of B ∼ z-score of B Rest of the talk: p = 1 ⇒ quality: q D ( B ) 21 / 41

  22. Example of subgroup quality measures Dataset: A 1 A 2 A 3 T Target mean µ of D : 1 1 0 1 1 � t.T = 3 / 4 = 0 . 75 3 1 1 0 |D| 1 1 0 1 t ∈ D 2 0 1 1 Subgroup B = “ A 1 ≥ 2 ∧ A 3 = 1 ”: Generality g D ( B ) = |{ t ∈D : t supports B } = 2 / 4 = 0 . 5 |D| � 1 Unusualness u D ( B ) = t.T − µ = 1 / 2 − 0 . 75 = − 0 . 25 | C D ( B ) | t ∈ C D ( B ) 1-quality: q D ( B ) = g D ( B ) q D ( B ) = − 0 . 125 22 / 41

  23. The top-k subgroup mining task Input: D , L , k ≥ 1 r D ( k ) : k -th highest quality in D of a subgroup from L , k ≥ 1 . Output: TOP ( D , L , k ) = { B ∈ L : q D ( B ) ≥ r D ( k ) } -1 r( k ) 1 quality D 23 / 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend