Subgroup Discovery Exploratory Data Analysis Exploratory Data - - PowerPoint PPT Presentation

subgroup discovery exploratory data analysis exploratory
SMART_READER_LITE
LIVE PREVIEW

Subgroup Discovery Exploratory Data Analysis Exploratory Data - - PowerPoint PPT Presentation

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification: model the dependence of the target on the remaining attributes. problem: sometimes uses only some of the available dependencies, or classifier is


slide-1
SLIDE 1

Subgroup Discovery

slide-2
SLIDE 2

Exploratory Data Analysis

slide-3
SLIDE 3

Exploratory Data Analysis

§ Classification: model the dependence of the target on the remaining attributes.

§ problem: sometimes uses only some of the available dependencies, or classifier is a black-box. § for example: in decision trees, some attributes may not appear because of overshadowing.

slide-4
SLIDE 4

Exploratory Data Analysis

§ Classification: model the dependence of the target on the remaining attributes.

§ problem: sometimes uses only some of the available dependencies, or classifier is a black-box. § for example: in decision trees, some attributes may not appear because of overshadowing.

§ Exploratory Data Analysis: understanding the effects of all attributes on the target.

slide-5
SLIDE 5

Exploratory Data Analysis

§ Classification: model the dependence of the target on the remaining attributes.

§ problem: sometimes uses only some of the available dependencies, or classifier is a black-box. § for example: in decision trees, some attributes may not appear because of overshadowing.

§ Exploratory Data Analysis: understanding the effects of all attributes on the target. Q: How can we use ideas from C4.5 to approach this task?

slide-6
SLIDE 6

Exploratory Data Analysis

§ Classification: model the dependence of the target on the remaining attributes.

§ problem: sometimes uses only some of the available dependencies, or classifier is a black-box. § for example: in decision trees, some attributes may not appear because of overshadowing.

§ Exploratory Data Analysis: understanding the effects of all attributes on the target. Q: How can we use ideas from C4.5 to approach this task? A: Why not list the info gain of all attributes, and rank according to this?

slide-7
SLIDE 7

Interactions between Attributes

§ Single-attribute effects are not enough § XOR problem is extreme example: 2 attributes with no info gain form a good model § Apart from

A=a, B=b, C=c, …

§ consider also

A=a∧B=b, A=a∧C=c, …, B=b∧C=c, … A=a∧B=b∧C=c, … …

slide-8
SLIDE 8

Subgroup Discovery Task

“Find all subgroups within the inductive constraints that show a significant deviation in the distribution

  • f the target attribute”

§ Inductive constraints:

§ Minimum support § (Maximum support) § Minimum quality (Information gain, X2, WRAcc) § Maximum complexity § …

slide-9
SLIDE 9

Confusion Matrix

§ A confusion matrix (or contingency table) describes the frequency of the four combinations of subgroup and target:

§ within subgroup, positive § within subgroup, negative § outside subgroup, positive

T F T .42 .13 .55 F .12 .33 .54 1.0 subgroup target

slide-10
SLIDE 10

Confusion Matrix

§ High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed

T F T .42 .13 .55 F .12 .33 .45 .54 .46 1.0 subgroup target

slide-11
SLIDE 11

Confusion Matrix

§ High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed

T F T .42 .13 .55 F .12 .33 .45 .54 .46 1.0 subgroup target

slide-12
SLIDE 12

Confusion Matrix

§ High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed

T F T .42 .13 .55 F .12 .33 .45 .54 .46 1.0 subgroup target

slide-13
SLIDE 13

Confusion Matrix

§ High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed

T F T .42 .13 .55 F .12 .33 .45 .54 .46 1.0 subgroup target

slide-14
SLIDE 14

Confusion Matrix

§ High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed

T F T .42 .13 .55 F .12 .33 .45 .54 .46 1.0 subgroup target

slide-15
SLIDE 15

Quality Measures

A quality measure for subgroups summarizes the interestingness

  • f its confusion matrix into a single number

WRAcc, weighted relative accuracy

§ WRAcc(S,T) = p(ST) – p(S)⋅p(T) § between −.25 and .25, 0 means uninteresting § Balance between coverage and unexpectedness

T F T .42 .13 .55 F .12 .33 .54 1.0

WRAcc(S,T) = p(ST)−p(S)⋅p(T) = .42 − .297 = .123

subgroup target

slide-16
SLIDE 16

Quality Measures

§ WRAcc: Weighted Relative Accuracy § Information gain § X2 § Correlation Coefficient § Laplace § Jaccard § Specificity

slide-17
SLIDE 17

Subgroup Discovery as Search

T

slide-18
SLIDE 18

Subgroup Discovery as Search

T A=a1 A=a2 B=b1 B=b2 C=c1 …

slide-19
SLIDE 19

Subgroup Discovery as Search

T A=a1 A=a2 B=b1 B=b2 C=c1 …

T F T .42 .13 .55 F .12 .33 .54 1.0

slide-20
SLIDE 20

Subgroup Discovery as Search

T A=a1 A=a2 B=b1 B=b2 C=c1 …

slide-21
SLIDE 21

Subgroup Discovery as Search

T A=a1 A=a2 B=b1 B=b2 C=c1 … A=a2∧B=b1 A=a1∧B=b1 A=a1∧B=b2 … …

slide-22
SLIDE 22

Subgroup Discovery as Search

T A=a1 A=a2 B=b1 B=b2 C=c1 … A=a2∧B=b1 A=a1∧B=b1 A=a1∧B=b2 … … … A=a1∧B=b1∧C=c1

slide-23
SLIDE 23

Subgroup Discovery as Search

T A=a1 A=a2 B=b1 B=b2 C=c1 … A=a2∧B=b1 A=a1∧B=b1 A=a1∧B=b2 … … … A=a1∧B=b1∧C=c1 minimum support level reached

slide-24
SLIDE 24

Subgroup Discovery as Search

T A=a1 A=a2 B=b1 B=b2 C=c1 … A=a2∧B=b1 A=a1∧B=b1 A=a1∧B=b2 … … … A=a1∧B=b1∧C=c1

slide-25
SLIDE 25

Subgroup Discovery as Search

T A=a1 A=a2 B=b1 B=b2 C=c1 … A=a2∧B=b1 A=a1∧B=b1 A=a1∧B=b2 … … … A=a1∧B=b1∧C=c1

slide-26
SLIDE 26

Refinements are (anti-)monotonic

target concept entire database

slide-27
SLIDE 27

Refinements are (anti-)monotonic

target concept subgroup S1 entire database

slide-28
SLIDE 28

Refinements are (anti-)monotonic

target concept subgroup S1 S2 refinement of S1 entire database

slide-29
SLIDE 29

Refinements are (anti-)monotonic

target concept subgroup S1 S2 refinement of S1 S3 refinement of S2 entire database

slide-30
SLIDE 30

Refinements are (anti-)monotonic

target concept subgroup S1 S2 refinement of S1 S3 refinement of S2

Refinements are (anti-) monotonic in their support…

entire database

slide-31
SLIDE 31

Refinements are (anti-)monotonic

target concept subgroup S1 S2 refinement of S1 S3 refinement of S2

Refinements are (anti-) monotonic in their support… …but not in interestingness. This may go up or down.

entire database

slide-32
SLIDE 32

Subgroup Discovery and ROC space

slide-33
SLIDE 33

ROC Space

Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate. TPR = TP/Pos = TP/TP+FN (fraction of positive cases in the subgroup) FPR = FP/Neg = FP/FP+TN (fraction of negative cases in the subgroup) ROC = Receiver Operating Characteristics

slide-34
SLIDE 34

ROC Space Properties

slide-35
SLIDE 35

ROC Space Properties

‘ROC heaven’ perfect subgroup

slide-36
SLIDE 36

ROC Space Properties

‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup

slide-37
SLIDE 37

ROC Space Properties

‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup perfect negative subgroup

slide-38
SLIDE 38

ROC Space Properties

‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup entire database empty subgroup perfect negative subgroup

slide-39
SLIDE 39

ROC Space Properties

‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup entire database empty subgroup minimum support threshold perfect negative subgroup

slide-40
SLIDE 40

Measures in ROC Space

WRAcc Information Gain positive negative

source: Flach & Fürnkranz

slide-41
SLIDE 41

Measures in ROC Space

WRAcc Information Gain positive negative

source: Flach & Fürnkranz

isometric

slide-42
SLIDE 42

Other Measures

Precision Gini index Correlation coefficient Foil gain

slide-43
SLIDE 43

Refinements in ROC Space

slide-44
SLIDE 44

Refinements in ROC Space

Refinements of S will reduce the FPR and TPR, so will appear to the left and below S.

slide-45
SLIDE 45

Refinements in ROC Space

Refinements of S will reduce the FPR and TPR, so will appear to the left and below S.

slide-46
SLIDE 46

Refinements in ROC Space

Refinements of S will reduce the FPR and TPR, so will appear to the left and below S. Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners.

slide-47
SLIDE 47

Refinements in ROC Space

Refinements of S will reduce the FPR and TPR, so will appear to the left and below S. Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners.

.. . . .

If corners are not above minimum quality or current best (top k?), prune search space below S.

slide-48
SLIDE 48

Combining Two Subgroups

slide-49
SLIDE 49

Combining Two Subgroups

slide-50
SLIDE 50

Combining Two Subgroups

slide-51
SLIDE 51

Combining Two Subgroups

slide-52
SLIDE 52

Combining Two Subgroups

slide-53
SLIDE 53

Multi-class problems

§ Generalising to problems with more than 2 classes is fairly staightforward:

C1 C2 C3 T .27 .06 .22 .55 F .03 .19 .23 .45 .3 .25 .45 1.0

combine values to quality measure

subgroup target

slide-54
SLIDE 54

Multi-class problems

§ Generalising to problems with more than 2 classes is fairly staightforward:

X2 Information gain

C1 C2 C3 T .27 .06 .22 .55 F .03 .19 .23 .45 .3 .25 .45 1.0

combine values to quality measure

subgroup target

slide-55
SLIDE 55

Numeric Subgroup Discovery

slide-56
SLIDE 56

Numeric Subgroup Discovery

h = 2200 h = 3100

slide-57
SLIDE 57

Numeric Subgroup Discovery

§ Target is numeric: find subgroups with significantly higher or lower average value

h = 2200 h = 3100

slide-58
SLIDE 58

Numeric Subgroup Discovery

§ Target is numeric: find subgroups with significantly higher or lower average value § Trade-off between size of subgroup and average target value

h = 2200 h = 3100 h = 3600

slide-59
SLIDE 59

Quiz 1

slide-60
SLIDE 60

Quiz 1

Q: Assume you have found a subgroup with a positive WRAcc (or infoGain). Can any refinement of this subgroup be negative?

slide-61
SLIDE 61

Quiz 1

Q: Assume you have found a subgroup with a positive WRAcc (or infoGain). Can any refinement of this subgroup be negative? A: Yes.

slide-62
SLIDE 62

Quiz 1

Q: Assume you have found a subgroup with a positive WRAcc (or infoGain). Can any refinement of this subgroup be negative? A: Yes.

slide-63
SLIDE 63

Quiz 2

.

slide-64
SLIDE 64

Quiz 2

Q: Assume both A and B are uninteresting subgroups. Can the subgroup A ∧ B be an interesting subgroup?

.

slide-65
SLIDE 65

Quiz 2

Q: Assume both A and B are uninteresting subgroups. Can the subgroup A ∧ B be an interesting subgroup? A: Yes.

.

slide-66
SLIDE 66

Quiz 2

Q: Assume both A and B are uninteresting subgroups. Can the subgroup A ∧ B be an interesting subgroup? A: Yes. Think of the XOR problem. A ∧ B is either completely positive or negative.

.

slide-67
SLIDE 67

Quiz 2

Q: Assume both A and B are uninteresting subgroups. Can the subgroup A ∧ B be an interesting subgroup? A: Yes. Think of the XOR problem. A ∧ B is either completely positive or negative.

.

slide-68
SLIDE 68

Quiz 2

Q: Assume both A and B are uninteresting subgroups. Can the subgroup A ∧ B be an interesting subgroup? A: Yes. Think of the XOR problem. A ∧ B is either completely positive or negative.

.

slide-69
SLIDE 69

Quiz 2

Q: Assume both A and B are uninteresting subgroups. Can the subgroup A ∧ B be an interesting subgroup? A: Yes. Think of the XOR problem. A ∧ B is either completely positive or negative.

.

slide-70
SLIDE 70

Quiz 2

Q: Assume both A and B are uninteresting subgroups. Can the subgroup A ∧ B be an interesting subgroup? A: Yes. Think of the XOR problem. A ∧ B is either completely positive or negative.

. . .

slide-71
SLIDE 71

Quiz 3

slide-72
SLIDE 72

Quiz 3

Q: Can the combination of two positive subgroups ever produce a negative subgroup?

slide-73
SLIDE 73

Quiz 3

Q: Can the combination of two positive subgroups ever produce a negative subgroup? A: Yes.

slide-74
SLIDE 74

Quiz 3

Q: Can the combination of two positive subgroups ever produce a negative subgroup? A: Yes.

slide-75
SLIDE 75

Quiz 3

Q: Can the combination of two positive subgroups ever produce a negative subgroup? A: Yes.

slide-76
SLIDE 76

Quiz 3

Q: Can the combination of two positive subgroups ever produce a negative subgroup? A: Yes.

slide-77
SLIDE 77

Quiz 3

Q: Can the combination of two positive subgroups ever produce a negative subgroup? A: Yes.

.