[PPT] - Modeling Heterogeneous Statistical Patterns in High- dimensional PowerPoint Presentation

SLIDE 1

Modeling Heterogeneous Statistical Patterns in High- dimensional Data by Adversarial Distributions: An Unsupervised Generative Framework (FIRD)

Han Zhang1 Wenhao Zheng3 Charley Chen1 Kevin Gao1 Yao Hu3 Ling Huang2 Wei Xu1

1Tsinghua University 2AHI Fintech 3Youku Cognitive and Intelligent Lab, Alibaba Group

SLIDE 2

F A F A K K E E

Fake Review Identity Theft Coupon Hunting

Payment Fraud, Merchant Fraud, …

E-commerce Platform

Fraud Hurts E-commerce Platform in Many Ways

Waste over $1,000,000,000 a Year

2

SLIDE 3

Fraud Patterns V.S. Normal Patterns [1, 2]

Fraudsters display synchronized behaviors.
In contrast, normal users are usually randomly distributed.

3

IP: 987.654.32.1 Phone No.: 12345

</>

Similar Control Script Resource Sharing

[1] Girish Keshav Palshikar. 2002. The hidden truth-frauds and their control: A critical application for business intelligence. Intelligent Enterprise 5, 9 (2002), 46–51. [2] S Benson Edwin Raj and A Annie Portia. 2011. Analysis on credit card fraud detection methods. In 2011 International Conference on Computer, Communication and Electrical Technology (ICCCET). IEEE, 152–156.

SLIDE 4

Challenge 1: Fraud pattern changes after exposure.

E-commerce Platform

IP: 987.654.32.1 Phone No.: 12345 IP: 732.198.43.1 Phone No.: 54321

Buy new IP, phone number

IP: 987.654.32.1 Phone No.: 12345

Fraud Labels

Obsolete for training Use Unsupervised Methods!

4

SLIDE 5

Challenge 2: Different Local Clustering Patterns

5

IP Phone No. GPS City Device ID Email

A A A 13.02 95.12 043.7 182.5 72.81 86.14 123 624 492 983 581 458 C C C A A A B B B

is

id mxi 0xa2 0x4b 0x93 B B B 0x7d 0x39 0xfa A A A 3c@a c7@b mi@c C C C Only IP Only GPS City Feature combinations

Select Useful Features!

SLIDE 6

Challenge 3: Noisy Random Normal Users

6

GPS City

GS 1 GS 2 GS 3 GS 4 GS 5

Synchronization due to randomness

Ideally

GS 6

Reality Error! Good Job! Robust to noise!

SLIDE 7

Problem Definition – Clustering + Feature Selection

Discrete feature space.
Given dataset 𝒠 = 𝒚! !"#

$

, where each feature 𝑦!% takes discrete values from 𝑌%& &"#

'! .

Local clustering patterns.
Data points are grouped into clusters 𝒣( ("#

)

.

Within each cluster 𝒣(, there exists a feature subset ℱ

(, such that ∀𝒚, 𝒚* ∈

𝒣(, ∀𝑛 ∈ ℱ

(, 𝑦% = 𝑦% * with high probability.

Goal: find all 𝒣! and ℱ

!, while tolerating the noise.

7

SLIDE 8

Key Results

Applicable to a variety of applications.
Fraud detection + anomaly detection.
Superior fraud detection performance.
18% AUC improvement.
Interpretable results.
Superior anomaly detection performance.
Over 5% AUC improvement in average.
Robust to noise and hyperparameters.

8

SLIDE 9

Feature Selection in Clustering

Idea: delete some feature, then cluster the data.
No feature should be deleted globally.
3 types of methods [3]:
Filter model: filter the low-quality features before clustering.
Wrapper model: enumerate feature combinations and evaluate clustering

performance.

Hybrid model: select features during clustering.
*Suffer from identifiability issue in discrete space.

9

* We provide a proof in our paper. [3] Salem Alelyani, Jiliang Tang, and Huan Liu. Feature Selection for Clustering: A Review. In Data Clustering: Algorithms and Applications 2013. 29–60.

Challenge 2: LOCAL clustering patterns!

SLIDE 10

Dense Block Detection

Idea: high-density blocks in data are potential anomalies [4, 5].
Steps:
1. Greedy search for the block with highest density.
2. Delete the block.
3. Repeat the process on the remaining data.
Normal users with random synchronization significantly affect the

detection performance.

10

[4] Kijung Shin, Bryan Hooi, and Christos Faloutsos. M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees. ECML PKDD 2016. 264–280. [5] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos. D-Cube: Dense-Block Detection in Terabyte-Scale Tensors. WSDM 2017, 681–689.

Challenge 3: Noise!

SLIDE 11

FIRD: A Generative Probabilistic Model

Feature Independence and adveRersarial Distributions.

11

SLIDE 12

Enumerating Possible Feature Combinations?

ⓧ Exponential feature combinations. ⓧ Exponential feature value combinations.

12

IP

Phone No. Active Time GPS City Device ID Email

IP: B IP: C PN: A PN: B PN: C PN: D AT: A AT: B GC: A GC: B GC: C MA: A MA: B MA: C MA: D EM: A EM: B IP: A

SLIDE 13

A Decomposed Way of Feature Selection

ü Conditional feature independence.

l Features are independent within a cluster. l Linear complexity.

ü Recognize clustering pattern on each feature, then combine.

l Using the adversarial distributions to fit the data.

13

SLIDE 14

Fitting Patterns Using Adversarial Distributions in Each Feature

For synchronized features in a cluster
For non-synchronized features in a cluster

14

E D A C

Probability

Sparse

B E D A C

Probability

Nearly Random

B

(B, B, B, B, B, …) (A, D, C, B, E, …)

Solved Challenge 2: Detecting Local Clustering Patterns!

SLIDE 15

Observation Generation Process

Choose a cluster 𝑒!~Multinomial(𝝆)
For each feature 𝑛:
Choose indicator variable 𝑔

!%~𝐶𝑓𝑠𝑜𝑝𝑣𝑚𝑚𝑗(𝝂𝒆𝒐)

If 𝑔

!% = 1, generate observation 𝑦!% from

sparse multinomial distribution.

If 𝑔

!% = 0, generate observation 𝑦!% from

nearly random multinomial distribution.

Head

E D A C Probability B

𝑒!

Tail

E D A C Probability B

Face 𝑔

!"

For each feature

SLIDE 16

Noise Reduction

Noise: outliers that are unsimilar to all clusters.
An information-theoretic rule to recognize an outlier:

𝐽 𝑦" 𝑒" = 𝑕 = − log 𝑞(𝑦"|𝑒" = 𝑕) < 1 + 𝜗 𝐼[𝑞(𝑦"|𝑒" = 𝑕)]

?

𝑞 ( 𝑦

!

| 𝑒

!

= 𝑕 )

16

𝒚𝒐

Solve Challenge 3: Noise from normal users.

SLIDE 17

Probabilistic Inference Based on FIRD

Inferring label ℓ for each observation given the

label of each cluster. ℓ! ≜ 𝔽"! ℓ 𝑦! = &

#$% &

𝑞 ℓ 𝑒! = 𝑕 𝑞(𝑒! = 𝑕|𝑦!)

Label of clusters 𝑞 ℓ 𝑒! = 𝑕 are easier to obtain:
#Clusters << #Observations
Cluster patterns are easier to classify.

17

Cluster A Cluster B Observation

𝑞(𝑒! = 𝑕|𝑦!)

From Clustering to Fraud Label Assignment

SLIDE 18

Experimental Evaluations

Our Cython code of FIRD is available at https://github.com/fingertap/fird.cython.

18

SLIDE 19

Identify Fraud Groups

Dataset
We collect the registration records from an E-commerce platform.
An account is labeled as Fraud if any malicious behavior is observed.
Labels are used only for evaluation.
Objective
Good performance.
High interpretability.

19

SLIDE 20

Identify Fraud Groups - Performance

Compare with dense block detection methods [2, 3]:
N:F is the fraction between normal user and fraudsters.
Higher N:F means larger noise.

20

18% AUC ↑ Robust to noise!

SLIDE 21

Interpretability: Visualize Detected Clusters

500 1000 1500 2000 2500 3000 3500 4000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

User Count Discrete Semantic Representations

Filtered Fraudster Fraudster Filtered Normal Normal

Fraud Groups

Synchronized normal users

Normal Users & Individual Fraudsters

21

SLIDE 22

Interpretability: Visualize One Fraud Cluster

0.4 0.6 0.8 1

Channel Device ID Time IP IP City Phone Phone City OS Type

B 180061 5 737.32.7.7 Coryborough 7037671 West Kristen sim B 405376 5 162.70.28.7 Amandaview 2916214 New Mariafurt android B 861328 5 162.70.28.7 Amandaview 1320211 East Erika sim B 201199 5 848.712.23.7 Port Heather 6571178 Valerieside android B 162176 15 761.326.87.7 Luisstad 2064801 Thompsonbury android B 498726 5 761.326.87.7 Luisstad 7932753 Edwardsfurt android B 893969 5 654.21.270.7 Luisstad 6699477 New Mariafurt android B 195884 5 654.21.270.7 Luisstad 1507813 New Robertland android B 221445 5 654.21.270.7 Luisstad 2611409 West Kellyport android B 148534 5 90.713.87.7 Luisstad 2999196 West Kristen android

Feature Importance (𝝂𝟐) 10 Random Samples

Fraud Signature 3 Fraud groups

22

SLIDE 23

Interpretability: Visualize One Fraud Feature

0.02 0.04 0.06 0.08 0.1 0.12 0.14 50 100 150 200 250 300 350 7 6 1 . 3 2 6 . 8 7 . 7 1 6 2 . 7 . 2 8 . 7 9 . 7 1 3 . 8 7 . 7 8 4 8 . 7 1 2 . 2 3 . 7 6 5 4 . 2 1 . 2 7 . 7 7 3 7 . 3 2 . 7 . 7 9 8 . 6 6 7 . 7 . 7 8 4 8 . 4 5 6 . 7 9 . 7 5 1 1 . 1 7 . 2 7 . 7 9 . 2 6 . 7 9 6 . 7 7 5 1 . 2 7 . 3 9 3 . 7 1 . 4 2 . 3 9 3 . 7 5 1 1 . 1 7 . 7 . 7 1 2 9 . 4 3 8 . 7 . 7 7 3 7 . 8 3 8 . 2 3 . 7 1 1 . 6 . 7 1 2 . 7 7 3 7 . 8 3 8 . 4 7 . 7 1 1 . 6 . 4 2 . 7 5 1 1 . 1 7 . 4 4 . 7 7 5 1 . 2 7 . 6 4 3 . 7

α

User Count

Fraudster Normal User α

Mislabeled fraudster

Synchronized Normal Users

23

SLIDE 24

Anomaly Detection

Assumption: anomalies are distant from the data manifolds [9].
Feature selection idea: subsampling and ensemble.
Still enumerating the exponentially many feature combinations.

24

[9] Yue Zhao, Zain Nasrullah, Maciej K. Hryniewicki, and Zheng Li. LSCP: Locally Selective Combination in Parallel Outlier Ensembles. SDM 2019. 585–593.

Anomaly

IP Phone No. GPS City Device ID Email

IP Phone No. GPS City Phone No. Email

SLIDE 25

Comparison with SOTA Methods

More benchmark results are available at PyOD benchmark.

25

Local Clustering Pattern matters in various cases!

SLIDE 26

Model Analysis – #Clusters: 𝑯

*Dimension Capacity Ratio: the ratio of the parameter G to the ground-truth number of clusters.

Just choose a larger 𝐻

26

SLIDE 27

Model Analysis – Regularizer Weight: 𝝁

Just choose a relatively larger 𝜇

*𝜇(") controls selecting effective clusters. 𝜇($) controls adversarial distributions. *0 < 𝜇("), 𝜇($) < 1, poorer regularization effect near the border (0 and 1).

27

SLIDE 28

Model Analysis – Running Time

*We compare with the K-Means implemented in the Python package Scikit-Learn. *Fix the #samples and the #values in each feature.

Linear running time w.r.t 𝑁

28

SLIDE 29

Conclusion

Fraud groups display synchronized behaviors on a subset of features.
Use adversarial distributions to select useful features by competing.
Identifying local clustering patterns benefits various applications.
Up to 18% increase on fraud detection and 5% on anomaly detection.

29

SLIDE 30