Background Information Stephen D. Bay and Michael J. Pazzani - - PowerPoint PPT Presentation

background information
SMART_READER_LITE
LIVE PREVIEW

Background Information Stephen D. Bay and Michael J. Pazzani - - PowerPoint PPT Presentation

Detecting Change in Categorical Data: Mining Contrast Sets Background Information Stephen D. Bay and Michael J. Pazzani University of California, Irvine 1999 Presented by: Ken Dwyer April 4, 2007 1 2 What is a Contrast Set? Definitions


slide-1
SLIDE 1

Detecting Change in Categorical Data: Mining Contrast Sets

Stephen D. Bay and Michael J. Pazzani University of California, Irvine 1999 Presented by: Ken Dwyer April 4, 2007 1

Background Information

2

What is a Contrast Set?

Definition

A contrast set is a conjunction of attributes and values that differs meaningfully in its distribution across groups.

Example

United States census data from 1979 to 1992:

◮ P(occupation=sales | PhD) = 2.7% ◮ P(occupation=sales | Bachelor) = 15.8%

Mining contrast sets

◮ Question: “How do several contrasting groups differ?” ◮ Given data, automatically detect such differences

between contrasting groups. 3

Definitions

Framework: Grouped categorical data

◮ The data is a set of k-dimensional attribute vectors.

◮ The attributes/variables are denoted A1, A2, . . . , Ak. ◮ Each attribute takes one of a finite number of discrete

values Vi1, Vi2, . . . , Vim.

◮ The vectors (or examples) are divided into n mutually

exclusive groups G1, G2, . . . , Gn.

◮ A contrast set is a conjunction of attribute-value pairs

defined over the groups.

◮ e.g. (sex = male) ∧ (occupation = manager)

◮ The support of a contrast set for a group G is the

fraction of examples in G where the contrast set is true. 4

slide-2
SLIDE 2

Mining Contrast Sets: Objective

Formally, find contrast sets ‘cset’ where:

  • 1. ∃ij P(cset = True | Gi) = P(cset = True | Gj)
  • 2. maxij |support(cset, Gi) − support(cset, Gj)| ≥ δ

◮ δ is a threshold called the minimum support difference

Definitions

A contrast set is. . .

◮ significant if Equation 1 is statistically valid. ◮ large if Equation 2 is satisfied. ◮ a deviation if both requirements are met.

5

Background Information Na¨ ıve Approach

6

Na¨ ıve Approach

Problem definition re-visited

Find conjunctions of attributes and values that have different levels of support in different groups.

Apply association rule learning!

◮ Association rule learners [AS94] can discover relationships

between attributes in a dataset.

◮ Idea: Add a group attribute to the data and run the

association rule learner to find group differences. 7

Na¨ ıve Approach

Association Rule Learning with a Group Variable

Shortcomings

◮ The number of rules discovered is large:

◮ For US census data: 26,796 rules for Bachelor group,

1,674 for PhD group (1% min supp, 80% min conf).

◮ Main problem: The association rule learner does not

enforce consistent contrast [DB96]:

◮ The same attributes are not necessarily used to compare

the groups.

◮ At least 25,122 of the PhD rules have no match.

◮ Are the differences in support and confidence statistically

significant? 8

slide-3
SLIDE 3

Background Information Na¨ ıve Approach The STUCCO Algorithm

9

STUCCO: An Algorithm for Mining Contrast Sets

Search and Testing for Understandable Consistent COntrasts

◮ Construct a search tree in which the root node is the

empty contrast set.

◮ Generate a node’s children by adding a new

attribute-value pair, based on an ordering of such pairs.

◮ Search in a breadth-first manner, counting the support

for each group in a level as we proceed.

10

Finding Significant Contrast Sets

Statistical hypothesis testing

◮ Null hypothesis: Contrast set support is equal across all

groups (support is independent of group membership).

◮ Form a 2 × |G| contingency table.

Example

SAT Verbal scores for different UCI schools: Arts Bio. Eng. ICS SocEc SATV > 700 45 142 85 60 11 ¬(SATV > 700) 583 2465 1523 502 414 High scores (%) 7.2 5.4 5.3 10.7 2.6 11

Finding Significant Contrast Sets

Chi-square test for independence

◮ χ2 = r i=1

c

j=1 (oij −eij )2 eij Arts Bio. Eng. ICS SocEc Total SATV > 700 45 142 85 60 11 343 ¬(SATV > 700) 583 2465 1523 502 414 5487 Total 628 2607 1608 562 425 5830 ◮ Consider cell 1,1: e1,1 = (343 · 628)/5830 = 36.95

◮ Cell contributes (45 − 36.95)2/36.95 = 1.75 to χ2

◮ In total, we get χ2 = 35.4

  • 1. Choose a significance level α
  • 2. Calculate the p-value p
  • 3. Reject the null hypothesis if p < α

◮ In this case, we compute p = 3.8e−7

◮ For α = .05, ‘SATV > 700’ is significant.

12

slide-4
SLIDE 4

Controlling Type I Error

Type I error (false positive)

◮ Occurs when the null hypothesis is falsely rejected. ◮ The maximum probability of such an error is α.

◮ e.g. With 1000 tests at α = .05, we could detect 50

significant differences, even if there really are none.

Bonferroni method (see [Sha95])

◮ Choose a significance level αi for the i th test such that

  • i αi ≤ α

◮ Traditionally, for n tests, αi = α/n ◮ STUCCO: For level l, use αl = min( α 2l·|Cl|, αl−1)

◮ |Cl| is the number of candidates at level l.

13

Pruning

◮ Idea: If none of the children (specializations) of a node

can possibly be significant and large, then the node is pruned from the search tree.

Pruning criteria

  • 1. Minimum deviation size
  • 2. Expected cell frequencies
  • 3. Chi-square bounds

(1) Recall: The maximum difference between the support of any two groups must be at least δ for the contrast set to be

  • large. This is not possible for the children of a node unless at

least one group’s support is greater than δ. 14

Pruning

◮ Idea: If none of the children (specializations) of a node

can possibly be significant and large, then the node is pruned from the search tree.

Pruning criteria

  • 1. Minimum deviation size
  • 2. Expected cell frequencies
  • 3. Chi-square bounds

(2) The expected cell frequencies in the top row of the con- tingency table cannot possibly increase as new attributes are added to the contrast set. The χ2 test is invalid for small counts (i.e. < 5), and so the node is pruned in such cases. 14

Pruning

◮ Idea: If none of the children (specializations) of a node

can possibly be significant and large, then the node is pruned from the search tree.

Pruning criteria

  • 1. Minimum deviation size
  • 2. Expected cell frequencies
  • 3. Chi-square bounds

(3) An upper bound on the χ2 statistic can be be computed for any child node. A candidate is pruned if further specializations cannot possibly meet the αl significance cutoff. 14

slide-5
SLIDE 5

Surprising Contrast Sets

◮ Idea: Only show those contrast sets that are surprising

given what has already been shown.

Example

◮ Suppose we know the following:

◮ P(sex=male | PhD) = .81 ◮ P(occupation=manager | PhD) = .14

◮ Assuming independence, we expect that:

◮ P(male ∧ manager | PhD) = .81 × .14 = .113

◮ The actual probability is .109, which is not surprising in

this sense.

◮ Iterative proportional fitting [Eve92] is used to find the

MLEs for a conjunction of variables based on its subsets. 15

Background Information Na¨ ıve Approach The STUCCO Algorithm Experimental Evaluation

16

Experimental Evaluation

◮ Two datasets; continuous attributes discretized.

  • 1. US Census [NHBM98] – 48,842 examples; 14 attributes.

◮ “What are the differences between people with PhD and

Bachelor degrees?” (δ = 1%, α = .05)

  • 2. UCI Admissions – 6 years of data with ∼ 17,000 examples

per year; 17 attributes.

◮ “How has the applicant pool changed from 1993-1998?”

(δ = 5%, α = .05)

17

Experimental Evaluation

Results for US Census data

Observed % Expected % Contrast Set PhD Bach. PhD Bach. χ2 p workclass = State-gov 21.0 5.4 225.1 6.9e-51

  • ccupation = sales

2.7 15.8 74.9 4.8e-18 hour per week > 60 8.4 3.2 43.4 4.4e-11 native country = U.S. 80.5 89.5 45.9 1.3e-11 native country = Canada 1.9 0.5 18.6 1.6e-5 native country = India 1.6 0.5 15.2 9.5e-5 salary > 50K 72.6 41.3 220.2 8.3e-50 sex = male 61.8 34.8 58.8 28.5 173.6 1.2e-39 ∧ salary > 50K

  • ccupation = prof-specialty

7.6 2.6 10.7 3.5 48.2 3.8e-12 ∧ sex = female ∧ salary > 50K

18

slide-6
SLIDE 6

Experimental Evaluation

Results for UCI Admissions data

‘Admit = Yes ∧ SAT Verbal < 400’ (Expected values are indicated by the dotted lines) 19

Experimental Evaluation

Effectiveness of Pruning

◮ Graph plots number of candidates counted at each level. ◮ Deviation size pruning (dashed) versus all three pruning

methods (solid).

◮ Census (left) and UCI Admissions in 1998 (right).

20

Background Information Na¨ ıve Approach The STUCCO Algorithm Experimental Evaluation Concluding Remarks

21

Concluding Remarks

Summary

◮ Motivated the problem of identifying differences between

several groups.

◮ Defined a contrast set, and proposed the STUCCO

algorithm for mining meaningful contrast sets.

◮ Demonstrated STUCCO’s utility on two real datasets.

Key Strengths

◮ Formulated a statistically grounded approach. ◮ Introduced pruning strategies that are specific to contrast

set mining.

◮ Concentrated on keeping the results compact.

22

slide-7
SLIDE 7

Selected References

  • R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large
  • databases. In Proc. 20th International Conference on Very Large Data Bases,

pages 487–499, 1994. S.D. Bay and M.J. Pazzani. Detecting change in categorical data: mining contrast sets. Proc. 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 302–306, 1999.

  • J. Davies and D. Billman. Hierarchical categorization and the effects of contrast

inconsistency in an unsupervised learning task. In Proc. 18th Annual Conference

  • f the Cognitive Science Society, page 750, 1996.

B.S. Everitt. The analysis of contingency tables. Chapman and Hall, second edition, 1992.

  • D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of

machine learning databases, 1998. J.P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46:561–584, 1995.

23