Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Unsupervised Learning Anomaly Detection Overview Frequent itemset and association rule mining Other itemset extensions Clustering Anomaly detection 2
Overview
Frequent itemset and association rule mining Other itemset extensions Clustering Anomaly detection
2
The analytics process
3
Recall
Predictive analytics (supervised learning)
Predict the future based on patterns learnt from past data Classification (categorical) versus regression (continuous) You have a labelled data set at your disposal
Descriptive analytics (unsupervised learning)
Describe patterns in data Clustering, association rules, sequence rules No labelling required
For unsupervised learning, we don't assume a label or target 4
Frequent itemset and association rule mining
5
Introduction
Association rule learning is a method for discovering interesting relations between variables
Interesting? Frequent, rare, costly, strange? Intended to identify strong rules discovered in databases using some measures of interestingness For example, the rule {onions, tomatoes, ketchup} → {burger} found in the sales data of a supermarket would indicate that if a customer buys onions, tomatoes and ketchup together, they are likely to also buy hamburger meat, which can be used e.g. for promotional pricing or product placements Application areas in market basket analysis, web usage mining, intrusion detection, production and manufacturing Association rule learning typically does not consider the order of items either within a transaction (sequence mining does)
Pioneering technique: apriori algorithm (Rakesh Agrawal, 1993) 6
{beer, diapers}?
https://www.itbusiness.ca/news/behind-the-beer-and-diapers-data-mining- legend/136
1992, Thomas Blischok, manager of a retail consulting group at Teradata Prepared an analysis of 1.2 million market baskets from 25 Osco Drug stores Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m., consumers bought beer and diapers" Osco managers did not exploit the beer and diapers relationship by moving the products closer together
7
{lotion, calcium, zinc}?
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a- teen-girl-was-pregnant-before-her-father-did/
Before long some useful patterns emerged Women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc When someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date 25 products that, when analyzed together, allowed to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy “My daughter got this in the mail! She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”
8
Transactional database
Every instance now represents a transaction Features correspond with columns: binary categoricals
- Tr. ID
milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1 9
Mining interesting rules
What consitutes a "good" rule?
To select rules, constraints on various measures of "interest" are used Most known measures/constraints: minimum thresholds on support and confidence
Support(X ⊆ I, T) =
- Tr. ID
milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1
Support({milk, bread, cheese}) = 2/5 = 0.4
∣T∣ ∣t∈T:X⊆t∣
10
Mining interesting rules
What consitutes a "good" rule?
To select rules, constraints on various measures of "interest" are used Most known measures/constraints: minimum thresholds on support and confidence
Confidence(X ⊆ I ⇒ Y ⊆ I, T) =
- Tr. ID
milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1
Confidence({cheese, wine} → {spaghetti}) = 0.4 / 0.6 = 0.66 Can be interpreted as an estimate of the conditional probability P(Y ∣X)
Support(X,T) Support(X∪Y ,T)
11
Mining interesting rules
Other measures exist as well:
Lift(X ⊆ I ⇒ Y ⊆ I, T) =
If lift = 1, the probability of occurrence of the antecedent and consequent are independent of each other If lift > 1, signifies the degree of dependence Considers both the confidence of the rule and the overall data set
Conviction(X ⊆ I ⇒ Y ⊆ I, T) =
Interpreted as the ratio of the expected frequency that X occurs without Y Measure for the frequency that the rule makes an incorrect prediction, e.g. a value of 1.2 would indicate that the rule would be incorrect 1.2 times as often if the assocation between X and Y was by chance
Cost sensitive measures exist here as well ("profit" or "utility" based rule mining)
E.g. Kitts et al., 2000,
ExpectedProfit(X ⊆ I ⇒ Y ⊆ I, T) = Confidence(X ⇒ Y , T) Profit(Y ) and IncrementalProfit(X ⊆ I ⇒ Y ⊆ I, T) = (Confidence(X ⇒ Y , T) − P(Y )) Profit(Y )
Support(X,T)×Support(Y ,T) Support(X∪Y ,T) 1−Confidence(X⇒Y ,T) 1−Support(Y ,T)
∑i
i
∑i
i
12
The apriori algorithm
Algorithm:
- 1. A minimum support threshold is applied to find all frequent itemsets
- 2. A minimum confidence threshold is applied to these frequent itemsets in
- rder to form frequent association rules
Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets The set of possible itemsets is the power set over I and has size 2 (excluding the empty set which is not a valid itemset) Although the size of the power-set grows exponentially in the number of items |I|, efficient search is possible using the downward-closure property of support (or: anti-monotonicity) which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also be infrequent Exploiting this property, efficient algorithms can find all frequent item-sets
∣I∣−1
13
- Tr. ID
milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1
The apriori algorithm
A minimum support threshold is applied to find all frequent itemsets All itemsets with support >= 50%
Itemset ('milk',) has a support of: 0.6 Itemset ('bread',) has a support of: 0.8 Itemset ('cheese',) has a support of: 0.8 Itemset ('wine',) has a support of: 0.6 Itemset ('spaghetti',) has a support of: 0.6 Itemset ('milk', 'bread') has a support of: 0.6 Itemset ('bread', 'cheese') has a support of: 0.6 Itemset ('cheese', 'wine') has a support of: 0.6 Itemset ('cheese', 'spaghetti') has a support of: 0.6 ... brute force leads to 2
= 64 possibilities
For 100 items we'd already have 1 271 427 896 possibilities
∣I∣−1
14
The apriori algorithm
A minimum support threshold is applied to find all frequent itemsets Speeding this up with a "step by step" expansion: we only need to continue expanding itemsets that are above the threshold, only using items above the threshold 15
The apriori algorithm
A minimum support threshold is applied to find all frequent itemsets "Join and prune": an even better way (proposed by apriori)
Say we want to generate candidate 3-itemsets (sets with three items) Look at previous: we only need the (3-1)-itemsets to do so, and only the ones which had enough support: {milk, bread}, {bread, cheese}, {cheese, wine}, {cheese, spaghetti} Join on self: join this set of itemset on itself to generate a list of candidates with length 3:
{milk, bread} x {bread, cheese} = {milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti}
Prune result: prune the candidates containing a (3-1)-subset that did not have enough support (all candidates can be pruned in this case):
{milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti}
This is repeated for every step 16
The apriori algorithm
17
The apriori algorithm
A minimum confidence threshold is applied to these frequent itemsets in order to form frequent association rules
Once the frequent itemset are obtained, association rules are generated as follows For each frequent itemset I, generate all non empty subsets of I For every non, empty, non-equal subset, check its confidence value and retain those above a threshold
∀I ∈ P(I) : I ≠ ∅ ∧ I ≠ I ∧ Confidence(I ⇒ I ∖ I ) > minconf → I ⇒ I ∖ I
E.g. for frequent itemset {cheese, wine, spaghetti}, we'd check {cheese, wine} → {spaghetti} {cheese, spaghetti} → {wine} {wine, spaghetti} → {cheese} {cheese} → {wine, spaghetti} {spaghetti} → {cheese, wine} {wine} → {cheese, spaghetti} ... and keep those with sufficient confidence
s s s s s s s
18
Extensions
19
Categorical and continuous variables
E.g. {browser=“firefox”, pages_visited < 10} → {sale_made=“no”}
Easy approach: convert each level to a binary “item” (see “dummy” codification in pre- processing) and place continuous variables in bins before converting to binary value: {browser- firefox, pages-visited-0-to-20} → {no-sale-made}
Group categorical variables with many levels Drop frequent levels which are not considered interesting, i.e. prevent browser-chrome turning up in rules if 90%
- f visitors use this browser
Binning of continuous variables requires some fine-tuning (otherwise support or confidence too low)
Better statistical methods available, see e.g. R. Rastogi and Kyuseok Shim, "Mining optimized association rules with categorical and numeric attributes," in IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 1, pp. 29- 50 and others 20
Multi-level association rules
E.g. product hierarchies
Using lower levels only? Highest only? Something in-between?
Possible, but items in lower levels might not have enough support Rules overly specific
Srikant R. & Agrawal R., "Mining Generalized Association Rules", In Proc. 1995 Int. Conf. Very Large Data Bases, Zurich, 1995 21
Discriminative association rules
Bring in some "supervision"
E.g. you’re interested in rules with outcome → {beer}, or perhaps → {beer, spirits} Interesting because multiple class labels can be combined in consequent, and consequent can also involve non-class labels to derive interesting correlations Learned patterns can be used as features for other learners, see e.g. “Keivan Kianmehr, Reda
- Alhajj. CARSVM: A class association rule-based classification framework and its application
to gene expression data”
22
Tuning
Possibility to tune “interestingness”
Rare patterns: low support but still interesting
E.g. people buying Rolex watches Mine by setting special group-based support thresholds (context matters)
Negative patterns:
E.g. people buying a Hummer and Tesla together will most likely not occur Negatively correlated, infrequent patterns can be more interesting than positive, frequent patterns
Possibility to include domain knowledge
Block frequent itemsets, e.g. if {cats, mice} occurs as a subset (no matter the support) Or allow certain itemsets, even if the support is low
23
Filtering
Often lots of association rules will be discovered!
Post-processing is a necessity Perform sensitivity analysis using minsup and minconf thresholds Trivial rules
E.g., buy spaghetti and spaghetti sauce
Unexpected / unknown rules
Novel and actionable patterns, potentially interesting! Confidence might not always be the best metric (lift, conviction, rarity, …)
Appropriate visualisation facilities are crucial!
Association rules can be powerful but really requires that you "get to work" with the results! See e.g. arulesViz package for R: https://journal.r-project.org/archive/2017/RJ- 2017-047/RJ-2017-047.pdf 24
Filtering
25
Applications
Market basket analysis
{baby food, diapers} => {stella} Put them closer together in the store? Put them far apart in the store? Package baby food, diapers and stella? Package baby food, diapers and stella with a poorly selling item? Raise the price on one, and lower it on the other? Do not advertise baby food, diapers and stella together? Up, down, cross-selling
Basic recommender systems
Customers who buy this also frequently buy… Can be a very simple though powerful approach
Fraud detection
{Insurer 64, Car Repair Shop A, Police Officer B} as frequent pattern
26
Sequence mining
In standard apriori, the order of items does not matter Instead of item-sets, think now about item-bags and item-sequences
Sets: unordered, each element appears at most once: every transaction is a set of items Bags: unordered, elements can appear multiple times: every transaction is a bag of items Sequences: ordered, every transaction is a sequence of items
27
Sequence mining
Mining of frequent sequences: algorithm very similar to apriori (i.e. GSP, Generalized Sequential Pattern algorithm)
Start with the set of frequent 1-sequences But: expansion (candidate generation) done differently:
E.g. in normal apriori, {A, B} and {A, C} would both be expanded into the same set {A, B, C}
For sequences, suppose we have <{A}, {B}> and <{A}, {C}>, then these are now expanded (joined) into <{A}, {B}, {C}> and <{A}, {C}, {B}> and <{A}, {B, C}> Often modified to suite particular use cases
Common case: just consider sequences with sets containing one item only: e.g. <A,B> and <A,C> expanded into <A,B,C> and <A,C,B>
E.g. in web mining or customer journey analytics
Pruning k-sequences with infrequent k-1 subsequences, only continue with support higher than threshold 28
Sequence mining
Extensions exist that take timing into account 29
Sequence mining
Extension: frequent episode mining
For very long time series Find frequent sequences within the time series
Extension: continuous time series mining
By first binning the continuous time series in categorical levels (similar as with normal apriori)
Extension: discriminative sequence mining
Again: if you know the outcome of interest (i.e. sequences which lead customer to buy a certain product)
See e.g. SPMF : http://www.philippe-fournier-viger.com/spmf/ for a large collection of algorithms 30
Sequence mining
"Sankey diagram" 31
Conclusion
Main power of apriori comes from the fact that it can easily be extended and adapted towards specific settings Also means that you'll probably have to go further than "out of the box" approaches Many other extensions exist (e.g for frequeny sub-graphs)
32
Clustering
33
Clustering
Cluster analysis or clustering is the task of grouping a set of objects
In such a way that objects in the same group (called a cluster) are more similar (in some sense
- r another) to each other than to those in other groups (clusters)
It is a main task of exploratory data mining, and a common technique for statistical data analysis Used in many fields: machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics, customer segmentation, etc…
Organizing data into clusters shows internal structure of the data
Find patterns, structure, etc. Sometimes the partitioning is the goal, e.g. market segmentation As a first step towards predictive techniques, e.g. mine decison tree model on cluster labels to get further insights or even predict group outcome; use labels and distances as features
34
Clustering
35
Two types
Hierarchical clustering:
Create a hierarchical decomposition of the set of objects using some criterion
Connectivity, distance based
Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions
Partitional clustering:
Objective function based Construct various partitions and then evaluate them by some criterion K-means, k++ means, etc
36
Hierarchical clustering
Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters, bottom-up Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions, top-down 37
Hierarchical clustering
In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of (dis)similarity between sets of observations is required … can be quite subjective In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets
38
Hierarchical clustering
The distance metric defines the distance between two observations
Euclidean distance: ∣∣a − b∣∣ = Squared Euclidean distance: ∣∣a − b∣∣ =
(a − b )
Manhattan distance: ∣∣a − b∣∣ =
∣a − b ∣
Maximum distance: ∣∣a − b∣∣
= max ∣a − b ∣
Many more are possible...
Should possess:
Symmetry: d(a,b) = d(b,a) Constancy of self-similarity: d(a,a) = 0 Positivity: d(a,b) ≥ 0 Triangular inequality: d(a,b) ≤ d(a,c) + d(c,b)
2
√ (a − b ) ∑i
i i 2 2 2
∑i
i i 2 1
∑i
i i ∞ i i i
39
Hierarchical clustering
The linkage criterion defines the distance between groups of instances
Note that a group can consist of only one instance
Single linkage (minimum linkage):
D(A, B) = min(d(a, b∣a ∈ A, b ∈ B)
Leads to longer, skinnier clusters Complete linkage (maximum linkage):
D(A, B) = max(d(a, b∣a ∈ A, b ∈ B)
Leads to tight clusters 40
Hierarchical clustering
The linkage criterion defines the distance between groups of instances
Note that a group can consist of only one instance
Average linkage (mean linkage):
D(A, B) = d(a, b)
Favorable in most cases, robust to noise Centroid linkage: based on distance between the centroids: D(A, B) = d(a , b ) Also robust, requires definition of a centroid concept
∣A∣×∣B∣ 1
∑a∈A,b∈B
c c
41
Hierarchical clustering
A hierarchy is obtained over the instances
No need to specify desired amount of clusters in advance Hierarchical structure maps nicely to human intuition for some domains Not very scalable (though fast enough in most cases) Local optima are a problem: cluster hierarchy not necessarily the best one You might want/need to normalize your features first
42
Hierarchical clustering
Represent hierarchy in dendrogram Can be used to decide on number of desired clusters (cut in the dendrogram) 43
Hierarchical clustering
Represent hierarchy in dendrogram Also a good indication of possible outliers or anomalies 44
Agglomerative or divisive
All implementations you'll find will implement agglomerative clustering Divisive clustering turns out the be way more computationally expensive: don't bother! 45
Partitional clustering
Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters
Since only one set of clusters is output, the user normally has to input the desired number of clusters K
Most well-known algorithm: k-means clustering:
- 1. Decide on a value for K
- 2. Initialize K cluster centers (e.g. randomly over the feature space or by
picking some instances randomly)
- 3. Decide the cluster memberships of the instances by assigning them to the
nearest cluster center, using some distance measure
- 4. Recalculate the K cluster centers
- 5. Repeat 3 and 4 until none of the objects changed cluster membership in the
last iteration 46
K-means example
Pick random centers (either in the feature space or by using some randomly chosen instances as starts): 47
K-means example
Calculate the membership of each instance: 48
K-means example
And reposition the cluster centroids: 49
K-means example
Reassign the instances again: 50
K-means example
Recalculate the centroids: No reassignments performed, so we stop here 51
K-means
Recall that a good cluster solution has high intra-cluster (in-cluster) similarity and low inter- cluster (between-cluster) similarity K-means optimizes for high intra-cluster similarity by optimizing towards a minimal total distortion: the sum of distances of points to their cluster centroid
min SSE = ∣∣x − μ ∣∣
Note: an exact optimization of this objective function is hard, k-means is a heuristic approach not necessarily providing the global optimal outcome
Except in the one-dimensional case
k=1
∑
K i=1
∑
nk ki k 2
52
K-means
Strengths
Very simple to implement and debug Intuitive objective function: optimizes intra-cluster similarity Relatively efficient
Weaknesses
Applicable only when mean is defined (to calculate centroid of points) What about categorical data? Often terminates at a local optimum, hence initialization is extremely important try multiple random starts (most implementations do this by default) or use K-means++ (most implementations do so) Need to specify K in advance: use elbow point if unsure (e.g. plot SSE across different values for K) Sensitive to handle noisy data and outliers Again: normalization might be required Not suitable to discover clusters with non-convex shapes
53
K-means and non-convex shapes
54
K-means and local optima
55
K-means and setting the value for K
56
K-means and setting the value for K
Don't be too swayed by your intuition in this case
Often, it pays off to start with a higher setting for K and post-process/inspect accordingly
57
K-means and setting the value for K
Don't be too swayed by your intuition in this case
Often, it pays off to start with a higher setting for K and post-process/inspect accordingly
58
K-means++
An algorithm to pick good initial centroids
- 1. Choose first center uniformly at random (in the feature space or from the
instances)
- 2. For each instance x (or over a grid in the feature space), compute d(x, c): the
distance between x and (nearest) center c that has already been defined
- 3. Choose another center, using a weighted probability distribution where a
point x is chosen with probability proportional to d(x, c)
- 4. Repeat Steps 2 and 3 until K centers have been chosen
- 5. Now proceed using standard K-means clustering (not repeating the
initialization step, of course) Basic idea: spread the initial clusters out from each other: try lower inter-cluster similarity
Most k-means implementations actually implement this variant
2
59
DBSCAN
“Density-based spatial clustering of applications with noise”
Groups together points that are closely packed together (points with many nearby neighbors) Marks as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away) DBSCAN is also one of the most common clustering algorithms Hierarchical versions exist as well
60
DBSCAN
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
61
Expectation-maximization based clustering
Based on Gaussian mixture models
K-means is EM’ish, but makes “hard” assignments of instances to clusters Based on two steps:
E-step: assign probabilistic membership: probability of membership to a cluster given an instance in dataset M-step: re-estimate parameters of model based on the probabilistic membership Which model are we estimating? A mixture of Gaussians
62
Expectation-maximization based clustering
Also local optimization, but nonetheless robust Learns a model for each cluster So you can generate new data points from it (though this is possible with k-means as well: fit a gaussian for each cluster with mean on the centroid and variance derived using min sum of squared errors) Relatively efficient Extensible to other model types (e.g. multinomial models for categorical data, or noise-robust models)
But:
Initialization of the models still important (local optimum problem) Also still need to specify K Also problems with non-convex shapes So basically: a less “hard” version of k-means
63
Validation
How to check the result of a clustering run?
Use k-means objective function (sum of squared errors) Many other measures as well, e.g. (Davies and Bouldin, 1979), (Halkidi, Batistakis, Vazirgiannis 2001)
These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters
E.g. Davis Bouldin Index:
DBI = R R = R with i = 1, ..., N and R =
S : mean distance between instances in cluster i and its centroid D : distance between centroids of clusters i and j
A good result has a low index score
N N 1 ∑i=1 N i i j≠i
max
ij ij Dij S +S
i j
i ij
64
Applications
Market research: customer and market segments, product positioning Social network analysis: recognize communities of people Social science: identify students, employees with similar properties Search: group search results Recommender systems: predict preferences based on user’s cluster Crime analysis: identify areas with similar crimes Image segmentation and color palette detection
65
Domain knowledge
An interesting question for many algorithms is how domain knowledge can be incorporated in the learning step
For clustering, this is often done using must-link and can’t-link constraints: who should be and should not be in the same cluster? Nice as this does not require significant changes to algorithm, but can lead to infeasible solutions if too many constraints are added Another approach is a "warm start" solution by providing a partial solution
66
More on distance metrics
Most metrics we've seen were defined for numerical features, though these exist for textual and categorical data as well
E.g. Levenshtein distance between to text fields Based on the number of edits (changes) performed: deletions, insertions and substitutions Other metrics exist as well, i.e. Jaccard, cosine, Gower distance (https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using- gower-distance-ab89b3aa90d9)
67
Other techniques: self-organizing maps (SOMs)
Can be formalized as a special type of artificial neural networks Not really (only) clustering, but: unsupervised Produces a low-dimensional representation of the input space (just like PCA!) Also called a Kohonen map (Teuvo Kohonen)
https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/
68
Other techniques: t-SNE
Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of very high-dimensional datasets
Not really for clustering, though can be used as a pre-processing step before clustering Or post-hoc to visualize a clustering result (alternative to PCA)
https://lvdmaaten.github.io/tsne/
69
Anomaly detection
70
Anomaly detection
Anomaly detection (also outlier detection or novelty detection) is the identification of instances which do not conform to an expected pattern or other items in a dataset
Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text Anomalies: outliers, novelties, noise, deviations, exceptions Uses a combination of unsupervised, supervised, statistical techniques
At this stage, it's wortwhile to have a look at some common approaches in this setting 71
Statistical approaches
Histograms, distributions, box plots: spotting outliers
Easy to use but univariate in many cases
Other statistical methods, e.g. robust covariance 72
Clustering based
Cluster, and check who falls outside the clusters
Distance based Clusters with few instances Based on dendrogram Based on noisy instances in DBSCAN Combine with dimensionality reduction techniques
73
Clustering based
https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00
74
Clustering based
https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00
75
One-class based methods
E.g. "one-class SVMs" Linked to PU learning (positive-unlabeled)
See also: http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.329.6479&rep=rep1&type=pdf
Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive
“ “
76
Isolation forests
Based on random forest idea
But: splits for nodes are chosen completely randomly, not entropy-driven Anomaly is then defined based on average distance from root node to leaf node containing the instance Outliers have a higher chance to be separated quickly
https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py
77
Local outlier factor (LOF)
Based on K-NN
Measure the local deviation of density of a given sample with respect to its neighbors Depends on how isolated the object is with respect to the surrounding neighborhood, locality is given by k-nearest neighbors, whose distance is used to estimate the local density By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers
https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html#sphx-glr-auto-examples-neighbors-plot-lof-
- utlier-detection-py
78
CADE: Classifier-Adjusted Density Estimation for Anomaly Detection and One- Class Classification
Works with any classifier that can produce a probability
Data set is constructed based on original instances (non-outliers) and a synthetic dataset based
- n uniform data generation over the given features (outliers)
Original instances are Y = 0, synthetic ones are Y = 1 Classifier trained on this data set, prediction used to determine original instances that get a high
- utlier score
79
Time series based
Peer Group Analysis (Bolton and Hand, 2001)
Define peer group for each object (e.g. transaction, account, customer) Peer group consists of other objects that behaved similarly in the past Anomalous behavior starts when object starts behaving significantly different from peer group Focus on local instead of global patterns (depending upon the number of peers to consider) Especially suitable to monitor behavior over time in e.g. time series analysis
80
Time series based
Break point analysis (Bolton and Hand, 2001)
Break point is an observation or time where anomalous behavior is detected Operates on the account level by comparing sequences of transactions (amount or frequency) to find a break point Choose a fixed time moving window (new transactions enter; old transactions leave) Need to decide on amount of recent versus old transactions to compare Can use statistical tests to compare new transactions to old transactions Works at account-level so does not build profiles by looking at other accounts
81
Time series based
Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test (extreme Studentized deviate) for detecting anomalies
Can be used to detect both global as well as local anomalies This is achieved by employing time series decomposition and using robust statistical metrics, viz., median together with ESD (detect outliers)
Trend component Cyclical component Seasonal component Irregular component
In addition, for long time series (say, 6 months of minutely data), the algorithm employs piecewise approximation, this is rooted to the fact that trend extraction in the presence of anomalies in non-trivial for anomaly detection
82
Seasonal Hybrid ESD (S-H-ESD)
83
Seasonal Hybrid ESD (S-H-ESD)
84
Seasonal Hybrid ESD (S-H-ESD)
https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series