Pseudodimension for Data Analytics
Mateo Riondato
Amherst College
Pseudodimension for Data Analytics Mateo Riondato Amherst College - - PowerPoint PPT Presentation
Pseudodimension for Data Analytics Mateo Riondato Amherst College ICERM May 17, 2019 Takeaway message High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension , a
Amherst College
High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension, a concept from statistical learning theory can be used to analyze the trade-off between sample size and approximation quality. Originally developed for supervised learning, we use it to analyze algorithms for unsupervised, combinatorial problems on graphs, transactional datasets, databases, ....
2 / 41
1 Random sampling for data analytics 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension
3 / 41
Dataset D, |D| = n Random Sample S, |S| = ℓ ≪ n Data Mining Task For each color c ∈ C, compute the fraction rc(D) of c-colored jelly beans in D. rc(D) = 1 n
fc(j) where fc(j) = 1 if j is c-colored 0 otherwise Too expensive. Sampling-based Data Analytics Task Estimate rc(D) with rc(S), for each c ∈ C. rc(S) = 1 ℓ
fc(j) Fast. Acceptable if max
c∈C |rc(S) − rc(D)| is small.
Key challenge: tell how small this error is.
4 / 41
max
c∈C |rc(S) − rc(D)| is not computable from S. Let’s get an upper bound ε to it.
Probabilistic upper bound to the max. error Fix a failure probability δ ∈ (0, 1). A value ε ∈ (0, 1) is a probabilistic upper bound to max
c∈C |rc(S) − rc(D)| if
Pr
c∈C |rc(S) − rc(D)| < ε
The probability is over the samples of size ℓ. Ingredients to compute ε: δ, C, D or S, and |S| = ℓ: ε = g (δ, C, D or S, ℓ) How do we find such a function g?
5 / 41
✔ 1 Random sampling for data mining 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension
6 / 41
Theorem (Chernoff bound + Union bound) Let ε =
δ
. Then Pr
c∈C |rc(S) − rc(D)| < ε
Not a function of D or S! We want ε = g (δ, C, D or S, ℓ); D or S give information on the complexity of approximation through sampling; “ln |C|” is a rough measure of the sample complexity of the task, as it ignores the data.
7 / 41
Measures from Statistical Learning Theory replace “ln |C|” with h(C, D) or h(C, S). VC-dimension, pseudodimension, covering numbers, Rademacher averages, ... Developed for supervised learning and had reputation of being only of theoretical interest; We showed they can be used for efficient practical algorithms for data mining problems. Example: Betweenness centrality estimation on a graph G = (V, E): Union bound : ε = O
δ
ℓ ; VC-dimension
[ACM WSDM’14,DMKD’16] : ε = O
δ
ℓ ; Exponential reduction on important classes of graphs (small-world, power-law, ...).
8 / 41
Vapnik-Chervonenkis VC(F) of a family F of subsets of X (or of 0–1 functions from X) Combinatorial measure of the richness of F. Originally developed to study generalization error bounds for classification [VC71]. Picked also up by the computational geometry community A set X = {x1, . . . , xℓ} ⊆ X is shatered by F if {X ∩ A : A ∈ F} = 2X VC-dimension of F: size of the largest set that can be shatered by F.
9 / 41
X = R2, F = axis-aligned rectangles Shatering a set of four points is easy, but shatering five points is impossible. y x y x Proving a VC-dimension upper bound k requires showing that there is no set of size k that can be shatered.
10 / 41
Pseudodimension PD(F) of a family F of real-valued functions from domain X to [a, b]. Combinatorial measure of the richness of F. Originally developed to study generalization error bounds for regression [Pollard84]. Intuition: If the graphs of the f’s in F cross many times, the pseudodimension is high.
11 / 41
A set X = {x1, . . . , xℓ} ⊆ X is (pseudo-)shatered by F if there exist t1, . . . , tℓ ∈ R s.t.:
sgn(f(x1) − t1) . . . sgn(f(xℓ) − tℓ) : f ∈ F
vectors in {−1,1}ℓ
sgn(f(x1) − t1) . . . sgn(f(xℓ) − tℓ) : f ∈ F
vectors in {−1,1}ℓ
PD(F): size of the largest pseudo-shatered set.
12 / 41
For each f ∈ F, let Rf = {(x, t) : t ≤ f(x)} ⊂ X × [a, b] Define the family of sets F+ = {Rf, f ∈ F} PD(F) = VC(F+)
13 / 41
The game is always about restricting the class of sets that may be shatered. Two useful general restrictions [R.-Upfal18 (someone must have known before)]: If B ⊆ X × [a, b] is shatered by F+: 1) B may contain at most one element (x, t) for each x ∈ X; 2) B cannot contain any element (x, a) for any x ∈ X.
14 / 41
Theorem [Li et al. ’01] Let PD(F) ≤ d and ε = O
δ
. Then Pr
f∈F
s
f(x) − ES
s
f(x)
If F is finite and d ≪ ln |F|, ε ≪ the one derived with Hoeffding+Union. This theorem works even if F is infinite.
15 / 41
✔ 1 Random sampling for data analytics ✔ 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension
16 / 41
Ingredients (for 4 people):
“Miso makes a soup loaded with flavour that saves you the hassle of making stock.”
(Really: [R.-Vandin ’18]: Mining Interesting Subgroup with Sampling and Pseudodimension)
17 / 41
1 Setings: datasets, subgroups, interestigness measures 2 Approximating subgroups: a sufficient condition 3 The pseudodimension of subgroups
18 / 41
D = { t1, . . . , tn transactions } t = ( t.A1, t.A2, . . . , t.Az description atributes , t.T target ) ∈ Y1 × · · · × Yz × {0, 1} E.g., t3 = (blue, 4, false, 1), t4 = (red, 3, true, 1) Subgroup B = (cond1,1 ∨ · · · cond1,r1) ∧ · · · ∧ (condq,1 ∨ · · · condq,rq) E.g., B = (A1 = blue ∨ A1 = red) ∧ (A2 = 4) t3 supports B, t4 does not. Language L: candidate subgroups of potential interest to the analyst. E.g., L =“conjunctions of two equality conditions” B ∈ L, but ((A1 = blue) ∧ (A2 = 4)) ∈ L
19 / 41
Interesting subgroup: subgroup associated with target value (e.g., 1) Examples
Inherently interpretable!
20 / 41
p-Qality of B in D: q(p)
D (B) = gD(B)p×uD(B)
Generality of a subgroup B in D: gD(B) = | cover CD(B) {t ∈ D : t supports B} | |D| Unusualness of B in D: uD(B) = 1 |CD(B)|
t.T target mean of CD(B) − 1 |D|
t.T target mean µ of D p weights generality vs unusualness (usually p ∈ {1/2, 1, 2}) p = 1/2 ⇒ quality of B ∼ z-score of B Rest of the talk: p = 1 ⇒ quality: qD(B)
21 / 41
Dataset: A1 A2 A3 T 1 1 1 3 1 1 1 1 1 2 1 1 Target mean µ of D: 1 |D|
t.T = 3/4 = 0.75 Subgroup B = “A1 ≥ 2 ∧ A3 = 1”: Generality gD(B) = |{t∈D : t supports B}
|D|
= 2/4 = 0.5 Unusualness uD(B) =
1 |CD(B)|
t.T − µ = 1/2 − 0.75 = −0.25 1-quality: qD(B) = gD(B)qD(B) = −0.125
22 / 41
Input: D, L, k ≥ 1 rD(k): k-th highest quality in D of a subgroup from L, k ≥ 1. Output: TOP(D, L, k) = {B ∈ L : qD(B) ≥ rD(k)}
23 / 41
Observations:
Many approaches to obtain an exact solution [Klosgen’92], [Wrobel’97],... They require to process the entire dataset ⇒ computationally expensive as data grows! Goal: find an approximate solution by processing only a random sample S of the data Trade-off: sample size / speed vs accuracy / confidence Qs: What approximation? How to estimate quality? How many transactions in S? What algorithm?
24 / 41
ε-Approximation to TOP(D, L, k): C = {(B, fB) : B ∈ L, fB ∈ [−1, 1]} s.t.: 1 For each B ∈ TOP(D, L, k), there is (B, fB) ∈ C; 2 There is no (B, fB) ∈ C s.t. qD(B) < rD(k) − ε; 3 For each (B, fB) ∈ C, |fB − qD(B)| < ε/4;
25 / 41
(Recall: target mean µ =
1 |D|
For B ∈ L, t ∈ D, ρB(t) = 1 − µ if t ∈ CD(B) and t.T = 1 −µ if t ∈ CD(B) and t.T = 0
Lemma: qD(B) = 1 |D|
ρB(t) Approximate quality of B ∈ L on a sample S: ˜ qS(B) = 1 |S|
ρB(t)= qS(B) Lemma: ES [˜ qS(B)] = qD(B).
26 / 41
✔ 1 Setings: datasets, subgroups, interestigness measures 2 Approximating subgroups: a sufficient condition 3 The pseudodimension of subgroups
27 / 41
˜ rS(k): k-th highest approx. quality in S of a B ∈ L Thm: If |˜ qS(B) − qD(B)| < ε 4 for every B ∈ L then
qS(B)) : B ∈ L, ˜ qS(B) > ˜ rS(k) − ε 2
28 / 41
Hyp: |˜ qS(B) − qD(B)| < ε/4 for every B ∈ L Want: 1 For each B ∈ TOP(D, L, k), ˜ qS(B) ≥ ˜ rS(k) − ε/2;✔ 2 For each B ∈ L s.t. qD(B) < rD(k) − ε, ˜ qS(B) < ˜ rS(k) − ε/2;✔ 3 For each (B, ˜ qS(B)) ∈ C, |˜ qS(B) − qD(B)| < ε/4;✔
29 / 41
Probabilistic tail bounds: Hoeffding + Union (baseline) Thm: Given ε, δ ∈ (0, 1), let S be a sample of s = 16 ε2
δ
|˜ qS(B) − qD(B)| < ε 4 for every B ∈ L Q: How to beter characterize the dependency on L?
30 / 41
✔ 1 Setings: datasets, subgroups, interestigness measures ✔ 2 Approximating subgroups: a sufficient condition 3 The pseudodimension of subgroups
31 / 41
FL,D = {ρB : B ∈ L}, X = D,
s
f(x) − ES
s
f(x)
qS(B) − qD(B)| Missing ingredient: upper bound to PD(FL,D) Thm: Let d be the max. number of subgroups from L that a transaction of D can support. Then PD(FL,D) ≤ ⌊log2 d⌋ + 1 Corol: Let L be the set of subgroups involving up to c conjunctions of equality conditions, and let z be the number of atributes of D. Then PD(FL,D) ≤
c
z i
32 / 41
Z = D × [−µ, 1 − µ], R = {RB : B ∈ L}, RB = {(t, q) : t ∈ D, q ≤ ρB(t)} PD(FL,D) = |A|, where A = largest A ⊆ Z s.t. {A ∩ RB : B ∈ L} = 2A
A is shatered
We show lemmas to restrict the class of subsets of Z that can be shatered. 0) Only subsets of D × (−µ, 1 − µ] can be shatered. 1) There is a shatered set of maximal size w/ only elements in form (•, 0) or (•, 1 − µ). 2) Only sets of elements in the form (t, µ) if t.T = 1 or (t, 0) if t.T = 0 can be shatered. 3.a) A shatered set containing an element (t, µ) with t.T = 1 cannot be larger than d. 3.b) A shatered set containing an element (t, 0) with t.T = 0 cannot be larger than d. Proof of 3 follows from pigeonhole principle.
33 / 41
Thm: Let d be the max. number of subgroups from L that a transaction of D can support. Then PD(FL,D) ≤ ⌊log2 d⌋ + 1 Corol: Let L be the set of subgroups involving up to c conjunctions of equality conditions, and let z be the number of atributes of D. Then PD(FL,D) ≤
c
z i
34 / 41
In: D, L, k, ε, δ Out: With prob. ≥ 1 − δ, an ε-approximation to TOP(D, L, k)
1 |D|
ε2 (d + ln(2/δ))
qS(B)) : B ∈ L, ˜ qS(B) ≥ ˜ rS(k) − ε/2}
35 / 41
There are experiments. Stuff works well.
36 / 41
✔ 1 Random sampling for data analytics ✔ 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day ✔ 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension
37 / 41
Betwenness Centrality: a measure of importance of vertex/edges in a graph For each vertex, what is the fraction of shortest paths that go through it? [R.-Upfal ’18]: how to approximate the BC of all nodes through sampling Obtain sample-dependent error bounds with Rademacher averages But we can also get sample-agnostic but graph-dependent error bound using pseudodimension.
38 / 41
Seting: very large relational database with multiple tables Goal: run many aggregate queries SELECT AVG(INCOME) FROM TAXPAYERS WHERE STATE=’RI’ AND CITY=’PVD’ GROUP BY ZIPCODE How good of an approximation do we get if we run the queries on a sample? Depends on the maximal syntactical complexity of the queries we want to run: number, type, and arrangement of selection predicates; number, type, and arrangement of join predicates; presence of GROUP BY clauses ... Can be measured with pseudodimension (work in progress)
39 / 41
✔ 1 Random sampling for data analytics ✔ 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day ✔ 3 MiSoSouP: approximating interesting subgroups with pseudodimension ✔ 4 What else to do with pseudodimension
EML: mriondato@amherst.edu TWTR: @teorionda WWW: http://matteo.rionda.to
40 / 41
Image on slide 9 from Mohri et al., Foundations of Machine Learning, page 241. Image on slide 17: http://www.publicdomainfiles.com/show_file.php?id=13920173422245, public domain.
41 / 41