An Embedding A Approac ach t to Anom
- mal
aly D Detection
Renjun Hu1, Charu Aggarwal2, Shuai Ma1, and Jinpeng Huai1
1SKLSDE Lab, Beihang University, China 2IBM T. J. Watson Research Center, USA 1
An Embedding A Approac ach t to Anom omal aly D Detection - - PowerPoint PPT Presentation
An Embedding A Approac ach t to Anom omal aly D Detection Renjun Hu 1 , Charu Aggarwal 2 , Shuai Ma 1 , and Jinpeng Huai 1 1 SKLSDE Lab, Beihang University, China 2 IBM T. J. Watson Research Center, USA 1 Motiv tivatio tion Anomaly
Renjun Hu1, Charu Aggarwal2, Shuai Ma1, and Jinpeng Huai1
1SKLSDE Lab, Beihang University, China 2IBM T. J. Watson Research Center, USA 1
expected behaviors [Chandola et al. 2009]
mining tasks such as community detection and classification
2
have complementary sources to information
Burt, Ronald S. (1992). Structural holes: the social structure of competition. Harvard University Press. Burt, Ronald S. (2004). Structural Holes and Good Ideas. American Journal of Sociology 110 (2): 349–399.
than B, even though they have the same number of links.
u v How to detect social brokers? A formal quantitative definition is needed in the first place!
3
diverse influential communities
E.g., community detection, collective classification, link prediction, influence analysis
review of sociology, Vol. 27: 415-444, 2001. 4
diverse influential communities
E.g., all nodes tend to form one large cluster
E.g., hard for community detection algorithms to achieve meaningful clusters
5
6
C, even though they have the same (global) distance from B.
A B C
7
8
and the d communities A reasonable selection of d suffices for anomaly detection. Not necessary to use the number of real-life communities.
9
components in O
( )
( )
2 2 ( , ) ( , ) 2
1 ,
i j i j n i j E i j E
m O X X X X m α α
∈ ∉
= − + ⋅ − − = −
10
( )
( )
( )
1 ,
( ) ,..., 1
d i i i j j i j E
NB i y y X X X
∈
= = − − ⋅
{ }
1 1
( ) , max ,...,
k d d i i i i k i
y AScore i y y y y
∗ ∗ =
= =
11
because of missing edges
(no better embedding)
in dimensions of NB(red))
dimension) ( )
2 2 ( , ) ( , )
1
i j i j i j E i j E
O X X X X α
∈ ∉
= − + ⋅ − −
{ }
1 1
( ) , max ,...,
k d d i i i i k i
y AScore i y y y y
∗ ∗ =
= =
The red node is detected as an anomaly!
12
13
Leskovec 2012]
2012. 14
( )
2 2 ( , ) ( , )
1 , {( , ) | ( , ) }
s
i j i j s i j E i j E
O X X X X E i j i j E
∈ ∈
≈ − + − − ⊂ ∉
( )
( )
2 2 ( , ) ( , ) 2
1 ,
i j i j n i j E i j E
m O X X X X m α α
∈ ∉
= − + ⋅ − − = −
15
have similar values of Xi
diverse values of Xi
1
1 2 ( ,...., ),
d j i i i i i i
j P X x x x j P = = = ≠
16
limited number of communities
suffice to ascertain anomalies
(Gordon) Hughes Effect
17
Space Efficiency Effectiveness Sampling / Prev.: O(n2∙d) Remain effective (from experiments) After: O(m∙d) Graph partitioning / Prev.: 0 Provide a good initialization After: O(n+m+d∙log(d)) k+β reduction Prev.: O(n∙d) Prev.: O(t∙m∙d) t : # of iterations Slightly improve effectiveness After: O(n∙(k+β)) After: O(t∙m∙(k+β))
18
19
Dataset # of nodes # of edges Descriptions Amazon 334,863 925,872 Product co-purchasing DBLP 1,150,852 5,098,175 Co-authorship Synthetic 105 - 4x106 m = n1.15 LFR-benchmark graph
embedding (preserve global structure)
20
Wei Wang
California, San Diego (USA), etc.
100 countries
cyber security, sensor networks and data mining
21
community detection
Embed(d) Embed(k+β) Amazon 2.1% 2.8% 3.0% DBLP 4.2% 4.1% 5.6% Table 1: Improvement of modularity
22
community structure ↓)
Embed(d) Embed(k+β) Varying graph sizes 70% 88% 89% Varying μ 68% 86% 88% Table 2: F1 score of anomalies
23
MDS(d) Embed(d) d = 200 11.3% 89.4% d = 400 13.6% 90.6% d = 600 12.7% 89.8% d = 800 7.9% 85.5% d = 1000 11.3% 88.8% Average 11.3% 88.8% Table 3: MDS(d) vs. Embed(d) using F1 measure
24
x : out of memory exception E(k+β)/E(d) E(k+β)/MDS(d) Amazon 35.3% 25.0% DBLP 23.4% 13.1% Synthetic 25.6% 13.2% Table 4: running time comparison
25
inconsistencies and structural holes
communities
and Synthetic data
26
27