Finding Outstanding Aspects and Contrast Subspaces Jian Pei School - - PowerPoint PPT Presentation
Finding Outstanding Aspects and Contrast Subspaces Jian Pei School - - PowerPoint PPT Presentation
Finding Outstanding Aspects and Contrast Subspaces Jian Pei School of Computing Science Simon Fraser University jpei@cs.sfu.ca CHIRC Computational Health Intelligence Research Centre Population health powered by big data
CHIRC
- Computational Health Intelligence Research
Centre
– Population health powered by big data – Healthcare business intelligence – Predictive health analytics
- A collaborative research initiative with
industry leaders
- Technology transferred to industry
– Multi-million US dollars financial gain per year for industry partners
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
2
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
5
Symptoms:
- verweight,
high blood pressure, back pain, short of breadth, chest pain, cold sweat … In what aspect is he most similar to cases of coronary artery disease and, at the same time, dissimilar to adiposity?
Fraud Suspect Analysis
- An insurance analyst is investigating a
suspicious claim
- How is the claim compared with the normal
and fraud claims?
– In what aspects the suspicious case is most similar to fraudulent cases and different from normal claims?
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
6
Don’t You Ever Google Yourself?
- Big data makes one know oneself better
- 57% American adults search themselves on
Internet
– Good news: those people are better paid than those who haven’t done so! (Investors.com)
- Egocentric analysis becomes
more and more important with big data
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
7
Egocentric Analysis
- How am I different from (more often than
not, better than) others?
- In what aspects am I good?
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
8
Contrast Subspace Finding
- Given a set of labeled objects in two classes
- For a query object q that is also labeled, the
contrast subspace is the one where q is most likely to belong to the target class against the other class
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
9
Related Work
- Finding patterns and models that manifest
drastic differences from one class against the other
– Example: emerging patterns
- Subspace outlier detection
– The query object may not be an outlier
- Typicality queries do not consider
subspaces
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
10
Problem Formulation
- Find subspaces maximizing
- To avoid triviality, consider only subspaces
where
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
11
LCS(q) = LS(q | O+) LS(q | O−) LS(q | O+) ≥ δ
Density Estimation
- Density estimated by
- Then,
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
12
LS(q | O) = ˆ fS(q, O) = 1 |O| √ 2πhS X
- ∈O
e
−distS (q,o)2 2h2 S
LCS(q, O+, O−) = ˆ fS(q, O+) ˆ fS(q, O−) = |O−|hS− |O+|hS+ · P
- ∈O+
e
−distS (q,o)2 2h2 S+
P
- ∈O−
e
−distS (q,o)2 2h2 S−
Complexity
- MAX SNP-hard
– Reduction from the emerging pattern mining problem
- Impossible to design a good approximation
algorithm
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
13
A Monotonic Bound
- is not monotonic in subspaces
- Develop an upper bound of , which
is monotonic in subspaces
– Sort all the dimensions in their standard deviation descending order – Let be the set of children of S in the subspace set enumeration tree using the standard deviation descending order – –
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
14
LS(q | O+) LS(q | O+) S L∗
S(q | O+) = 1 |O+| √ 2πσ0
minh0
- pt min
P
- ∈O+
e
distS (q,o)2 2(σS h0
- pt max)2
σ0
min = min{σS0 | S0 ∈ S}, h0
- pt min = min{hS0 opt | S0 ∈ S}, and
h0
- pt max = max{hS0 opt | S0 ∈ S}
Monotonic Bound
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
15
For a query object q, a set of objects O, and subspaces S1, S2 such that S1 is an ancestor of S2 in the subspace set enumeration tree using the standard deviation descending order in O+, L∗
S1(q | O+) ≥ LS2(q | O+).
Baseline algorithm time complexity:
O(2|D| · (|O+| + |O−|))
Bounding Using Neighborhoods
- Divide the neighborhood of an object into
two parts and the rest
- Then,
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
16
N ✏
S(q) = {o ∈ O | distS(q, o) ≤ ✏}
LS(q | O) = LN ✏
S(q | O) + Lrest
S
(q | O) LN ✏
S(q | O) =
1 |O| √ 2πhS
P
- ∈N ✏
S(q)
e
−distS (q,o)2 2h2 S
Lrest
S
(q | O) =
1 |O| √ 2πhS
P
- ∈O\N ✏
S(q)
e
−distS (q,o)2 2h2 S
Bounding the Rest
- Let be the maximum distance
between q and all objects in O in subspace S
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
17
distS(q | O)
|O|−|N ✏
S(q)|
|O| √ 2πhS · e − distS (q,O)2
2h2 S
≤ Lrest
S
(q | O) ≤ |O|−|N ✏
S(q)|
|O| √ 2πhS · e −
✏2 2h2 S
Bounding
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
18
For a query object q, a set of objects O and ✏ ≥ 0, LL✏
S(q | O) ≤ LS(q | O) ≤ UL✏ S(q | O)
where LL✏
S(q | O) =
1 |O| √ 2⇡hS @ X
- ∈N ✏
S(q)
e
−dist✏ S (q,o)2 2h2 S
+ (|O| − |N ✏
S(q)|)e − distS (q,O)2
2h2 S
1 A and UL✏
S(q | O) =
1 |O| √ 2⇡hS @ X
- ∈N ✏
S(q)
e
−dist✏ S (q,o)2 2h2 S
+ (|O| − |N ✏
S(q)|)e −
✏2 2h2 S
1 A For a query object q, a set of objects O+, a set of objects O−, and ✏ ≥ 0, LCS(q) ≤ UL✏
S(q|O+)
LL✏
S(q|O−).
Algorithm
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
19
Dimensionality of Inlying Contrast Subspaces
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
20
Dimensionality of Outlying Contrast Subspaces
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
21
Runtime
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
22
In Which Aspects Johnson Is Good?
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
23
2 4 6 8 10 12 1 2 3 4
Assist Personal foul
Joe
2 4 6 8 10 12 5 10 15 20 25 30
Assist Points/game
Joe 1 2 3 4 5 10 15 20 25 30
Personal foul Points/game
Joe
Fraud Investigation
- Given a set of claims in an insurance
company
- For a claim c, in which aspects c is most
different from the other claims?
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
24
Outlying/Outstanding Aspect Mining
- Given a set of objects in a multi-dimensional
space
- For an object q, find the subspaces where q
is most unusual compared to the rest of the data
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
25
Differences from Outlier Detection
- Outlier detection finds objects that are
different from the rest of the data
- The query object in outlying aspect finding
may not be an outlier
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
26
Problem Formulation
- A set of objects O in full space
- Query object q
- The density of q measures how outlying
(uncommon) q is
– Density estimation
- Find a subspace where the density of q is
lowest?
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
27
D = {D1, . . . , Dd}
ˆ fh(o) = 1 n
n
X
i=1
Kh(o − oi) = 1 nh
n
X
i=1
K ✓o − oi h ◆
Why Rank Statistics?
- Densities in different subspaces are not
comparable
- We compare the same set of objects in
different subspaces
- Rank statistics
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
28
rankS(o) = |{o0 | o0 ∈ O, OutDeg(o0) < OutDeg(o)}| + 1
Unsupervised Problem Formulation
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
29
Given a set of objects O in a multidimensional space D, a query object q 2 O and a maximum dimensionality threshold 0 < ` |D|, a subspace S ✓ D (0 < |S| `) is called a minimal outlying subspace of q if
- 1. (Rank minimality) there does not exist another subspace S0 ✓ D (S0 6= ;),
such that rankS0(q) < rankS(q); and
- 2. (Subspace minimality) there does not exist another subspace S00 ⇢ S such
that rankS00(q) = rankS(q). The problem of outlying aspect mining is to find the minimal outlying subspaces of q.
Density Estimation for Ranking
- Invariance
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
30
ˆ fS(q) ∼ ˜ fS(q) = X
- ∈O
e
− P
Di∈S (q.Di−o.Di)2 2h2 Di
Given a set of objects O in space S = {D1, . . . , Dd}, define a linear transfor- mation g(o) = (a1o.D1+b1, . . . , ado.Dd+bd) for any o ∈ O, where a1, . . . , ad and b1, . . . , bd are real numbers. Let O0 = {g(o)|o ∈ O} be the transformed data set. For any objects o1, o2 ∈ O such that ˜ fS(o1) > ˜ fS(o2) in O, ˜ fS(g(o1)) > ˜ fS(g(o2)) if the product kernel is used and the bandwidths are set using H¨ ardle’s rule of thumb
Algorithm Framework
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
31
Pruning Rule 1
- If rankS(q) = 1, according to the
dimensionality minimality condition in the problem definition, all super-spaces of S can be pruned.
- Pruning on other ranks or density values?
– Neither rank nor density is not monotonic with respect to subspaces
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
32
Reducing Density Estimation Cost
- To obtain the exact rank statistics in a
subspace, the query object has to compare with every other object
- By estimating density values using
neighborhood, density computation can be reduced
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
33
Cross Subspace Pruning
- For subspaces , by estimating the
bounds of possible changes in density, then the range of the rank in S’ can be estimated by the rank in S
- Some subspaces can be pruned using the
ranges
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
34
S ⊂ S0
Distribution of Ranks
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
35
Distribution of # Outlying Aspects
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
36
Computational Performance
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
37
Conclusions
- Finding outlying/outstanding aspects and
contrast subspaces has many applications
- Computationally, it is challenging – even
cannot be approximated well
- Future work
– Faster algorithms – More effective measures – Scaling out
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
38
Papers
- L. Duan, G. Tang, J. Pei, J. Bailey, G. Dong, A.
Campbell, and C. Tang. "Mining Contrast Subspaces". In Proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’14), (Best Paper Award) Tainan, Taiwan, May 13-16, 2014.
- L. Duan, G. Tang, J. Pei, J. Bailey, G. Dong, A.
Campbell, and C. Tang. “Mining Outlying Aspects on Numeric Data”. ECML/PKDD 2015, and to appear in Data Mining and Knowledge Discovery, Springer-Verlag.
- J. Pei: Finding Outstanding Aspects and Contrast Subspaces
39