Pseudodimension for Data Analytics Mateo Riondato Amherst College - - PowerPoint PPT Presentation

pseudodimension for data analytics
SMART_READER_LITE
LIVE PREVIEW

Pseudodimension for Data Analytics Mateo Riondato Amherst College - - PowerPoint PPT Presentation

Pseudodimension for Data Analytics Mateo Riondato Amherst College ICERM May 17, 2019 Takeaway message High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension , a


slide-1
SLIDE 1

Pseudodimension for Data Analytics

Mateo Riondato

Amherst College

ICERM — May 17, 2019

slide-2
SLIDE 2

Takeaway message

High-quality approximations of data mining tasks can be obtained very quickly from small random samples of the dataset. Pseudodimension, a concept from statistical learning theory can be used to analyze the trade-off between sample size and approximation quality. Originally developed for supervised learning, we use it to analyze algorithms for unsupervised, combinatorial problems on graphs, transactional datasets, databases, ....

2 / 41

slide-3
SLIDE 3

Outline

1 Random sampling for data analytics 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension

3 / 41

slide-4
SLIDE 4

Approximations from a random sample

Dataset D, |D| = n Random Sample S, |S| = ℓ ≪ n Data Mining Task For each color c ∈ C, compute the fraction rc(D) of c-colored jelly beans in D. rc(D) = 1 n

  • j∈D

fc(j) where fc(j) = 1 if j is c-colored 0 otherwise Too expensive. Sampling-based Data Analytics Task Estimate rc(D) with rc(S), for each c ∈ C. rc(S) = 1 ℓ

  • j∈S

fc(j) Fast. Acceptable if max

c∈C |rc(S) − rc(D)| is small.

Key challenge: tell how small this error is.

4 / 41

slide-5
SLIDE 5

Error bounds

max

c∈C |rc(S) − rc(D)| is not computable from S. Let’s get an upper bound ε to it.

Probabilistic upper bound to the max. error Fix a failure probability δ ∈ (0, 1). A value ε ∈ (0, 1) is a probabilistic upper bound to max

c∈C |rc(S) − rc(D)| if

Pr

  • max

c∈C |rc(S) − rc(D)| < ε

  • ≥ 1 − δ .

The probability is over the samples of size ℓ. Ingredients to compute ε: δ, C, D or S, and |S| = ℓ: ε = g (δ, C, D or S, ℓ) How do we find such a function g?

5 / 41

slide-6
SLIDE 6

Outline

✔ 1 Random sampling for data mining 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension

6 / 41

slide-7
SLIDE 7

A classic probabilistic upper bound to the error

Theorem (Chernoff bound + Union bound) Let ε =

  • 3
  • ln |C| + ln 2

δ

. Then Pr

  • max

c∈C |rc(S) − rc(D)| < ε

  • ≥ 1 − δ .

Not a function of D or S! We want ε = g (δ, C, D or S, ℓ); D or S give information on the complexity of approximation through sampling; “ln |C|” is a rough measure of the sample complexity of the task, as it ignores the data.

7 / 41

slide-8
SLIDE 8

Are there beter measures of sample complexity?

Measures from Statistical Learning Theory replace “ln |C|” with h(C, D) or h(C, S). VC-dimension, pseudodimension, covering numbers, Rademacher averages, ... Developed for supervised learning and had reputation of being only of theoretical interest; We showed they can be used for efficient practical algorithms for data mining problems. Example: Betweenness centrality estimation on a graph G = (V, E): Union bound : ε = O  

  • ln |V | + ln 1

δ

ℓ   ; VC-dimension

[ACM WSDM’14,DMKD’16] : ε = O

 

  • ln diam(G) + ln 1

δ

ℓ   ; Exponential reduction on important classes of graphs (small-world, power-law, ...).

8 / 41

slide-9
SLIDE 9

VC-dimension

Vapnik-Chervonenkis VC(F) of a family F of subsets of X (or of 0–1 functions from X) Combinatorial measure of the richness of F. Originally developed to study generalization error bounds for classification [VC71]. Picked also up by the computational geometry community A set X = {x1, . . . , xℓ} ⊆ X is shatered by F if {X ∩ A : A ∈ F} = 2X VC-dimension of F: size of the largest set that can be shatered by F.

9 / 41

slide-10
SLIDE 10

VC-dimension example

X = R2, F = axis-aligned rectangles Shatering a set of four points is easy, but shatering five points is impossible. y x y x Proving a VC-dimension upper bound k requires showing that there is no set of size k that can be shatered.

10 / 41

slide-11
SLIDE 11

Pseudodimension

Pseudodimension PD(F) of a family F of real-valued functions from domain X to [a, b]. Combinatorial measure of the richness of F. Originally developed to study generalization error bounds for regression [Pollard84]. Intuition: If the graphs of the f’s in F cross many times, the pseudodimension is high.

11 / 41

slide-12
SLIDE 12

Pseudodimension

A set X = {x1, . . . , xℓ} ⊆ X is (pseudo-)shatered by F if there exist t1, . . . , tℓ ∈ R s.t.:

       sgn(f(x1) − t1) . . . sgn(f(xℓ) − tℓ)    : f ∈ F     

vectors in {−1,1}ℓ

  • = 2ℓ

       sgn(f(x1) − t1) . . . sgn(f(xℓ) − tℓ)    : f ∈ F     

vectors in {−1,1}ℓ

  • = 2ℓ

PD(F): size of the largest pseudo-shatered set.

12 / 41

slide-13
SLIDE 13

Pseudodimension as VC-dimension

For each f ∈ F, let Rf = {(x, t) : t ≤ f(x)} ⊂ X × [a, b] Define the family of sets F+ = {Rf, f ∈ F} PD(F) = VC(F+)

13 / 41

slide-14
SLIDE 14

Proving upper bounds to pseudodimension

The game is always about restricting the class of sets that may be shatered. Two useful general restrictions [R.-Upfal18 (someone must have known before)]: If B ⊆ X × [a, b] is shatered by F+: 1) B may contain at most one element (x, t) for each x ∈ X; 2) B cannot contain any element (x, a) for any x ∈ X.

14 / 41

slide-15
SLIDE 15

Pseudodimension and sampling

Theorem [Li et al. ’01] Let PD(F) ≤ d and ε = O  

  • 1
  • d + ln 2

δ

  . Then Pr

  • max

f∈F

  • 1

s

  • x∈S

f(x) − ES

  • 1

s

  • x∈S

f(x)

  • < ε
  • ≥ 1 − δ .

If F is finite and d ≪ ln |F|, ε ≪ the one derived with Hoeffding+Union. This theorem works even if F is infinite.

15 / 41

slide-16
SLIDE 16

Outline

✔ 1 Random sampling for data analytics ✔ 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension

16 / 41

slide-17
SLIDE 17

Making miso soup

Ingredients (for 4 people):

  • 2 teaspoons dashi granules
  • 4 cups water
  • 3 tablespoons miso paste
  • 1 (8 ounce) package silken tofu, diced
  • 2 green onions, sliced diagonally into 1/2 inch pieces

“Miso makes a soup loaded with flavour that saves you the hassle of making stock.”

  • Y. Otolenghi (world-class chef)

(Really: [R.-Vandin ’18]: Mining Interesting Subgroup with Sampling and Pseudodimension)

17 / 41

slide-18
SLIDE 18

Section outline

1 Setings: datasets, subgroups, interestigness measures 2 Approximating subgroups: a sufficient condition 3 The pseudodimension of subgroups

18 / 41

slide-19
SLIDE 19

Subgroups

D = { t1, . . . , tn transactions } t = ( t.A1, t.A2, . . . , t.Az description atributes , t.T target ) ∈ Y1 × · · · × Yz × {0, 1} E.g., t3 = (blue, 4, false, 1), t4 = (red, 3, true, 1) Subgroup B = (cond1,1 ∨ · · · cond1,r1) ∧ · · · ∧ (condq,1 ∨ · · · condq,rq) E.g., B = (A1 = blue ∨ A1 = red) ∧ (A2 = 4) t3 supports B, t4 does not. Language L: candidate subgroups of potential interest to the analyst. E.g., L =“conjunctions of two equality conditions” B ∈ L, but ((A1 = blue) ∧ (A2 = 4)) ∈ L

19 / 41

slide-20
SLIDE 20

Mining Interesting Subgroups

Interesting subgroup: subgroup associated with target value (e.g., 1) Examples

  • social networks: atribute = user features, target = interest in a topic
  • biomedicine: atribute = mutations, target = response to therapy
  • classification: atribute = features, target = test label XOR prediction.

Inherently interpretable!

20 / 41

slide-21
SLIDE 21

Subgroup quality measures

p-Qality of B in D: q(p)

D (B) = gD(B)p×uD(B)

Generality of a subgroup B in D: gD(B) = | cover CD(B) {t ∈ D : t supports B} | |D| Unusualness of B in D: uD(B) = 1 |CD(B)|

  • t∈CD(B)

t.T target mean of CD(B) − 1 |D|

  • t∈D

t.T target mean µ of D p weights generality vs unusualness (usually p ∈ {1/2, 1, 2}) p = 1/2 ⇒ quality of B ∼ z-score of B Rest of the talk: p = 1 ⇒ quality: qD(B)

21 / 41

slide-22
SLIDE 22

Example of subgroup quality measures

Dataset: A1 A2 A3 T 1 1 1 3 1 1 1 1 1 2 1 1 Target mean µ of D: 1 |D|

  • t∈D

t.T = 3/4 = 0.75 Subgroup B = “A1 ≥ 2 ∧ A3 = 1”: Generality gD(B) = |{t∈D : t supports B}

|D|

= 2/4 = 0.5 Unusualness uD(B) =

1 |CD(B)|

  • t∈CD(B)

t.T − µ = 1/2 − 0.75 = −0.25 1-quality: qD(B) = gD(B)qD(B) = −0.125

22 / 41

slide-23
SLIDE 23

The top-k subgroup mining task

Input: D, L, k ≥ 1 rD(k): k-th highest quality in D of a subgroup from L, k ≥ 1. Output: TOP(D, L, k) = {B ∈ L : qD(B) ≥ rD(k)}

  • 1

1 quality r(k)

D

23 / 41

slide-24
SLIDE 24

Exact solutions and approximations

Observations:

  • no. of subgroups maybe exponential in the number of atributes
  • quality function is not anti-monotonic
  • quality function is not a mean

Many approaches to obtain an exact solution [Klosgen’92], [Wrobel’97],... They require to process the entire dataset ⇒ computationally expensive as data grows! Goal: find an approximate solution by processing only a random sample S of the data Trade-off: sample size / speed vs accuracy / confidence Qs: What approximation? How to estimate quality? How many transactions in S? What algorithm?

24 / 41

slide-25
SLIDE 25

ε-approximation

ε-Approximation to TOP(D, L, k): C = {(B, fB) : B ∈ L, fB ∈ [−1, 1]} s.t.: 1 For each B ∈ TOP(D, L, k), there is (B, fB) ∈ C; 2 There is no (B, fB) ∈ C s.t. qD(B) < rD(k) − ε; 3 For each (B, fB) ∈ C, |fB − qD(B)| < ε/4;

25 / 41

slide-26
SLIDE 26

A novel formulation of quality

(Recall: target mean µ =

1 |D|

  • z∈D z.T)

For B ∈ L, t ∈ D, ρB(t) =    1 − µ if t ∈ CD(B) and t.T = 1 −µ if t ∈ CD(B) and t.T = 0

  • therwise

Lemma: qD(B) = 1 |D|

  • t∈D

ρB(t) Approximate quality of B ∈ L on a sample S: ˜ qS(B) = 1 |S|

  • t∈S

ρB(t)= qS(B) Lemma: ES [˜ qS(B)] = qD(B).

26 / 41

slide-27
SLIDE 27

Section outline

✔ 1 Setings: datasets, subgroups, interestigness measures 2 Approximating subgroups: a sufficient condition 3 The pseudodimension of subgroups

27 / 41

slide-28
SLIDE 28

Sufficient condition for ε-approximation

˜ rS(k): k-th highest approx. quality in S of a B ∈ L Thm: If |˜ qS(B) − qD(B)| < ε 4 for every B ∈ L then

  • (B, ˜

qS(B)) : B ∈ L, ˜ qS(B) > ˜ rS(k) − ε 2

  • is an ε-approximation to TOP(D, L, k).

28 / 41

slide-29
SLIDE 29

Sufficient condition for ε-approximation

Hyp: |˜ qS(B) − qD(B)| < ε/4 for every B ∈ L Want: 1 For each B ∈ TOP(D, L, k), ˜ qS(B) ≥ ˜ rS(k) − ε/2;✔ 2 For each B ∈ L s.t. qD(B) < rD(k) − ε, ˜ qS(B) < ˜ rS(k) − ε/2;✔ 3 For each (B, ˜ qS(B)) ∈ C, |˜ qS(B) − qD(B)| < ε/4;✔

29 / 41

slide-30
SLIDE 30

Random samples and approximations

Probabilistic tail bounds: Hoeffding + Union (baseline) Thm: Given ε, δ ∈ (0, 1), let S be a sample of s = 16 ε2

  • ln |L| + ln 2

δ

  • transactions from D. Then, with probability ≥ 1 − δ over the choice of S,

|˜ qS(B) − qD(B)| < ε 4 for every B ∈ L Q: How to beter characterize the dependency on L?

30 / 41

slide-31
SLIDE 31

Section outline

✔ 1 Setings: datasets, subgroups, interestigness measures ✔ 2 Approximating subgroups: a sufficient condition 3 The pseudodimension of subgroups

31 / 41

slide-32
SLIDE 32

Pseudodimension for subgroups

FL,D = {ρB : B ∈ L}, X = D,

  • 1

s

  • x∈S

f(x) − ES

  • 1

s

  • x∈S

f(x)

  • = |˜

qS(B) − qD(B)| Missing ingredient: upper bound to PD(FL,D) Thm: Let d be the max. number of subgroups from L that a transaction of D can support. Then PD(FL,D) ≤ ⌊log2 d⌋ + 1 Corol: Let L be the set of subgroups involving up to c conjunctions of equality conditions, and let z be the number of atributes of D. Then PD(FL,D) ≤

  • log2

c

  • i=1

z i

  • + 1 ≤ z

32 / 41

slide-33
SLIDE 33

Proof “intuition”

Z = D × [−µ, 1 − µ], R = {RB : B ∈ L}, RB = {(t, q) : t ∈ D, q ≤ ρB(t)} PD(FL,D) = |A|, where A = largest A ⊆ Z s.t. {A ∩ RB : B ∈ L} = 2A

A is shatered

We show lemmas to restrict the class of subsets of Z that can be shatered. 0) Only subsets of D × (−µ, 1 − µ] can be shatered. 1) There is a shatered set of maximal size w/ only elements in form (•, 0) or (•, 1 − µ). 2) Only sets of elements in the form (t, µ) if t.T = 1 or (t, 0) if t.T = 0 can be shatered. 3.a) A shatered set containing an element (t, µ) with t.T = 1 cannot be larger than d. 3.b) A shatered set containing an element (t, 0) with t.T = 0 cannot be larger than d. Proof of 3 follows from pigeonhole principle.

33 / 41

slide-34
SLIDE 34

Look at the bound again

Thm: Let d be the max. number of subgroups from L that a transaction of D can support. Then PD(FL,D) ≤ ⌊log2 d⌋ + 1 Corol: Let L be the set of subgroups involving up to c conjunctions of equality conditions, and let z be the number of atributes of D. Then PD(FL,D) ≤

  • log2

c

  • i=1

z i

  • + 1 ≤ z

34 / 41

slide-35
SLIDE 35

MiSoSouP, the algorithm

In: D, L, k, ε, δ Out: With prob. ≥ 1 − δ, an ε-approximation to TOP(D, L, k)

  • 0. Compute

1 |D|

  • t∈D t.T
  • 1. d ← upper bound to PD(FL,D)
  • 2. s ← 16

ε2 (d + ln(2/δ))

  • 3. S ← uniform random sample of s transactions from D
  • 4. Return {(B, ˜

qS(B)) : B ∈ L, ˜ qS(B) ≥ ˜ rS(k) − ε/2}

35 / 41

slide-36
SLIDE 36

Experimental evaluation

There are experiments. Stuff works well.

36 / 41

slide-37
SLIDE 37

Outline

✔ 1 Random sampling for data analytics ✔ 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day ✔ 3 MiSoSouP: approximating interesting subgroups with pseudodimension 4 What else to do with pseudodimension

37 / 41

slide-38
SLIDE 38

Pseudodimension and betweenness centrality

Betwenness Centrality: a measure of importance of vertex/edges in a graph For each vertex, what is the fraction of shortest paths that go through it? [R.-Upfal ’18]: how to approximate the BC of all nodes through sampling Obtain sample-dependent error bounds with Rademacher averages But we can also get sample-agnostic but graph-dependent error bound using pseudodimension.

38 / 41

slide-39
SLIDE 39

Pseudodimension and aggregate SQL queries

Seting: very large relational database with multiple tables Goal: run many aggregate queries SELECT AVG(INCOME) FROM TAXPAYERS WHERE STATE=’RI’ AND CITY=’PVD’ GROUP BY ZIPCODE How good of an approximation do we get if we run the queries on a sample? Depends on the maximal syntactical complexity of the queries we want to run: number, type, and arrangement of selection predicates; number, type, and arrangement of join predicates; presence of GROUP BY clauses ... Can be measured with pseudodimension (work in progress)

39 / 41

slide-40
SLIDE 40

Recap

✔ 1 Random sampling for data analytics ✔ 2 Sample size vs error trade-off: how Statistical Learning Theory saves the day ✔ 3 MiSoSouP: approximating interesting subgroups with pseudodimension ✔ 4 What else to do with pseudodimension

Thank you! Matteo Riondato

EML: mriondato@amherst.edu TWTR: @teorionda WWW: http://matteo.rionda.to

40 / 41

slide-41
SLIDE 41

Image Credits

Image on slide 9 from Mohri et al., Foundations of Machine Learning, page 241. Image on slide 17: http://www.publicdomainfiles.com/show_file.php?id=13920173422245, public domain.

41 / 41