Finding Outstanding Aspects and Contrast Subspaces Jian Pei School - - PowerPoint PPT Presentation

finding outstanding aspects and contrast subspaces
SMART_READER_LITE
LIVE PREVIEW

Finding Outstanding Aspects and Contrast Subspaces Jian Pei School - - PowerPoint PPT Presentation

Finding Outstanding Aspects and Contrast Subspaces Jian Pei School of Computing Science Simon Fraser University jpei@cs.sfu.ca CHIRC Computational Health Intelligence Research Centre Population health powered by big data


slide-1
SLIDE 1

Finding Outstanding Aspects and Contrast Subspaces

Jian Pei School of Computing Science Simon Fraser University jpei@cs.sfu.ca

slide-2
SLIDE 2

CHIRC

  • Computational Health Intelligence Research

Centre

– Population health powered by big data – Healthcare business intelligence – Predictive health analytics

  • A collaborative research initiative with

industry leaders

  • Technology transferred to industry

– Multi-million US dollars financial gain per year for industry partners

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

2

slide-3
SLIDE 3
  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

5

Symptoms:

  • verweight,

high blood pressure, back pain, short of breadth, chest pain, cold sweat … In what aspect is he most similar to cases of coronary artery disease and, at the same time, dissimilar to adiposity?

slide-4
SLIDE 4

Fraud Suspect Analysis

  • An insurance analyst is investigating a

suspicious claim

  • How is the claim compared with the normal

and fraud claims?

– In what aspects the suspicious case is most similar to fraudulent cases and different from normal claims?

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

6

slide-5
SLIDE 5

Don’t You Ever Google Yourself?

  • Big data makes one know oneself better
  • 57% American adults search themselves on

Internet

– Good news: those people are better paid than those who haven’t done so! (Investors.com)

  • Egocentric analysis becomes

more and more important with big data

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

7

slide-6
SLIDE 6

Egocentric Analysis

  • How am I different from (more often than

not, better than) others?

  • In what aspects am I good?
  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

8

slide-7
SLIDE 7

Contrast Subspace Finding

  • Given a set of labeled objects in two classes
  • For a query object q that is also labeled, the

contrast subspace is the one where q is most likely to belong to the target class against the other class

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

9

slide-8
SLIDE 8

Related Work

  • Finding patterns and models that manifest

drastic differences from one class against the other

– Example: emerging patterns

  • Subspace outlier detection

– The query object may not be an outlier

  • Typicality queries do not consider

subspaces

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

10

slide-9
SLIDE 9

Problem Formulation

  • Find subspaces maximizing
  • To avoid triviality, consider only subspaces

where

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

11

LCS(q) = LS(q | O+) LS(q | O−) LS(q | O+) ≥ δ

slide-10
SLIDE 10

Density Estimation

  • Density estimated by
  • Then,
  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

12

LS(q | O) = ˆ fS(q, O) = 1 |O| √ 2πhS X

  • ∈O

e

−distS (q,o)2 2h2 S

LCS(q, O+, O−) = ˆ fS(q, O+) ˆ fS(q, O−) = |O−|hS− |O+|hS+ · P

  • ∈O+

e

−distS (q,o)2 2h2 S+

P

  • ∈O−

e

−distS (q,o)2 2h2 S−

slide-11
SLIDE 11

Complexity

  • MAX SNP-hard

– Reduction from the emerging pattern mining problem

  • Impossible to design a good approximation

algorithm

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

13

slide-12
SLIDE 12

A Monotonic Bound

  • is not monotonic in subspaces
  • Develop an upper bound of , which

is monotonic in subspaces

– Sort all the dimensions in their standard deviation descending order – Let be the set of children of S in the subspace set enumeration tree using the standard deviation descending order – –

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

14

LS(q | O+) LS(q | O+) S L∗

S(q | O+) = 1 |O+| √ 2πσ0

minh0

  • pt min

P

  • ∈O+

e

distS (q,o)2 2(σS h0

  • pt max)2

σ0

min = min{σS0 | S0 ∈ S}, h0

  • pt min = min{hS0 opt | S0 ∈ S}, and

h0

  • pt max = max{hS0 opt | S0 ∈ S}
slide-13
SLIDE 13

Monotonic Bound

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

15

For a query object q, a set of objects O, and subspaces S1, S2 such that S1 is an ancestor of S2 in the subspace set enumeration tree using the standard deviation descending order in O+, L∗

S1(q | O+) ≥ LS2(q | O+).

Baseline algorithm time complexity:

O(2|D| · (|O+| + |O−|))

slide-14
SLIDE 14

Bounding Using Neighborhoods

  • Divide the neighborhood of an object into

two parts and the rest

  • Then,
  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

16

N ✏

S(q) = {o ∈ O | distS(q, o) ≤ ✏}

LS(q | O) = LN ✏

S(q | O) + Lrest

S

(q | O) LN ✏

S(q | O) =

1 |O| √ 2πhS

P

  • ∈N ✏

S(q)

e

−distS (q,o)2 2h2 S

Lrest

S

(q | O) =

1 |O| √ 2πhS

P

  • ∈O\N ✏

S(q)

e

−distS (q,o)2 2h2 S

slide-15
SLIDE 15

Bounding the Rest

  • Let be the maximum distance

between q and all objects in O in subspace S

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

17

distS(q | O)

|O|−|N ✏

S(q)|

|O| √ 2πhS · e − distS (q,O)2

2h2 S

≤ Lrest

S

(q | O) ≤ |O|−|N ✏

S(q)|

|O| √ 2πhS · e −

✏2 2h2 S

slide-16
SLIDE 16

Bounding

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

18

For a query object q, a set of objects O and ✏ ≥ 0, LL✏

S(q | O) ≤ LS(q | O) ≤ UL✏ S(q | O)

where LL✏

S(q | O) =

1 |O| √ 2⇡hS @ X

  • ∈N ✏

S(q)

e

−dist✏ S (q,o)2 2h2 S

+ (|O| − |N ✏

S(q)|)e − distS (q,O)2

2h2 S

1 A and UL✏

S(q | O) =

1 |O| √ 2⇡hS @ X

  • ∈N ✏

S(q)

e

−dist✏ S (q,o)2 2h2 S

+ (|O| − |N ✏

S(q)|)e −

✏2 2h2 S

1 A For a query object q, a set of objects O+, a set of objects O−, and ✏ ≥ 0, LCS(q) ≤ UL✏

S(q|O+)

LL✏

S(q|O−).

slide-17
SLIDE 17

Algorithm

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

19

slide-18
SLIDE 18

Dimensionality of Inlying Contrast Subspaces

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

20

slide-19
SLIDE 19

Dimensionality of Outlying Contrast Subspaces

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

21

slide-20
SLIDE 20

Runtime

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

22

slide-21
SLIDE 21

In Which Aspects Johnson Is Good?

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

23

2 4 6 8 10 12 1 2 3 4

Assist Personal foul

Joe

2 4 6 8 10 12 5 10 15 20 25 30

Assist Points/game

Joe 1 2 3 4 5 10 15 20 25 30

Personal foul Points/game

Joe

slide-22
SLIDE 22

Fraud Investigation

  • Given a set of claims in an insurance

company

  • For a claim c, in which aspects c is most

different from the other claims?

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

24

slide-23
SLIDE 23

Outlying/Outstanding Aspect Mining

  • Given a set of objects in a multi-dimensional

space

  • For an object q, find the subspaces where q

is most unusual compared to the rest of the data

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

25

slide-24
SLIDE 24

Differences from Outlier Detection

  • Outlier detection finds objects that are

different from the rest of the data

  • The query object in outlying aspect finding

may not be an outlier

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

26

slide-25
SLIDE 25

Problem Formulation

  • A set of objects O in full space
  • Query object q
  • The density of q measures how outlying

(uncommon) q is

– Density estimation

  • Find a subspace where the density of q is

lowest?

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

27

D = {D1, . . . , Dd}

ˆ fh(o) = 1 n

n

X

i=1

Kh(o − oi) = 1 nh

n

X

i=1

K ✓o − oi h ◆

slide-26
SLIDE 26

Why Rank Statistics?

  • Densities in different subspaces are not

comparable

  • We compare the same set of objects in

different subspaces

  • Rank statistics
  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

28

rankS(o) = |{o0 | o0 ∈ O, OutDeg(o0) < OutDeg(o)}| + 1

slide-27
SLIDE 27

Unsupervised Problem Formulation

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

29

Given a set of objects O in a multidimensional space D, a query object q 2 O and a maximum dimensionality threshold 0 < `  |D|, a subspace S ✓ D (0 < |S|  `) is called a minimal outlying subspace of q if

  • 1. (Rank minimality) there does not exist another subspace S0 ✓ D (S0 6= ;),

such that rankS0(q) < rankS(q); and

  • 2. (Subspace minimality) there does not exist another subspace S00 ⇢ S such

that rankS00(q) = rankS(q). The problem of outlying aspect mining is to find the minimal outlying subspaces of q.

slide-28
SLIDE 28

Density Estimation for Ranking

  • Invariance
  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

30

ˆ fS(q) ∼ ˜ fS(q) = X

  • ∈O

e

− P

Di∈S (q.Di−o.Di)2 2h2 Di

Given a set of objects O in space S = {D1, . . . , Dd}, define a linear transfor- mation g(o) = (a1o.D1+b1, . . . , ado.Dd+bd) for any o ∈ O, where a1, . . . , ad and b1, . . . , bd are real numbers. Let O0 = {g(o)|o ∈ O} be the transformed data set. For any objects o1, o2 ∈ O such that ˜ fS(o1) > ˜ fS(o2) in O, ˜ fS(g(o1)) > ˜ fS(g(o2)) if the product kernel is used and the bandwidths are set using H¨ ardle’s rule of thumb

slide-29
SLIDE 29

Algorithm Framework

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

31

slide-30
SLIDE 30

Pruning Rule 1

  • If rankS(q) = 1, according to the

dimensionality minimality condition in the problem definition, all super-spaces of S can be pruned.

  • Pruning on other ranks or density values?

– Neither rank nor density is not monotonic with respect to subspaces

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

32

slide-31
SLIDE 31

Reducing Density Estimation Cost

  • To obtain the exact rank statistics in a

subspace, the query object has to compare with every other object

  • By estimating density values using

neighborhood, density computation can be reduced

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

33

slide-32
SLIDE 32

Cross Subspace Pruning

  • For subspaces , by estimating the

bounds of possible changes in density, then the range of the rank in S’ can be estimated by the rank in S

  • Some subspaces can be pruned using the

ranges

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

34

S ⊂ S0

slide-33
SLIDE 33

Distribution of Ranks

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

35

slide-34
SLIDE 34

Distribution of # Outlying Aspects

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

36

slide-35
SLIDE 35

Computational Performance

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

37

slide-36
SLIDE 36

Conclusions

  • Finding outlying/outstanding aspects and

contrast subspaces has many applications

  • Computationally, it is challenging – even

cannot be approximated well

  • Future work

– Faster algorithms – More effective measures – Scaling out

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

38

slide-37
SLIDE 37

Papers

  • L. Duan, G. Tang, J. Pei, J. Bailey, G. Dong, A.

Campbell, and C. Tang. "Mining Contrast Subspaces". In Proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’14), (Best Paper Award) Tainan, Taiwan, May 13-16, 2014.

  • L. Duan, G. Tang, J. Pei, J. Bailey, G. Dong, A.

Campbell, and C. Tang. “Mining Outlying Aspects on Numeric Data”. ECML/PKDD 2015, and to appear in Data Mining and Knowledge Discovery, Springer-Verlag.

  • J. Pei: Finding Outstanding Aspects and Contrast Subspaces

39