Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Unsupervised Learning Anomaly Detection Overview Frequent itemset and association rule mining Other itemset extensions Clustering Dimensionality reduction
Overview
Frequent itemset and association rule mining Other itemset extensions Clustering Dimensionality reduction Anomaly detection
2
The analytics process
3
Recall
Supervised learning
You have a labelled data set at your disposal Correlate features to target Common case: predict the future based on patterns observed now (predictive) Classification (categorical) versus regression (continuous)
Unsupervised learning
Describe patterns in data Clustering, association rules, sequence rules No labelling required Common case: descriptive, explanatory
For unsupervised learning, we don’t assume a label or target … but we do need to define what kind of patterns we want 4
Frequent Itemset and Association Rule Mining
5
Introduction
Association rule learning is a method for discovering interesting relations between variables
Intended to identify rules discovered in databases using some measures of interestingness “Interesting?” Frequent, rare, costly, strange? For example, the rule {onions, tomatoes, ketchup} → {burger} found in the sales data of a supermarket would indicate that if a customer buys onions, tomatoes and ketchup together, they are likely to also buy hamburger meat, which can be used e.g. for promotional pricing or product placements Application areas in market basket analysis, web usage mining, intrusion detection, production and manufacturing Association rule learning typically does not consider the order of items either within a transaction (sequence mining does)
Pioneering technique: apriori algorithm (Rakesh Agrawal, 1993) 6
{beer, diapers}?
https://www.itbusiness.ca/news/behind-the-beer-and-diapers-data-mining- legend/136
1992, Thomas Blischok, manager of a retail consulting group at Teradata Prepared an analysis of 1.2 million market baskets from 25 Osco Drug stores Database queries were developed to identify affinities. The analysis “did discover that between 5 and 7 p.m., consumers bought beer and diapers” Osco managers did not exploit the beer and diapers relationship by moving the products closer together
7
{lotion, calcium, zinc}?
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a- teen-girl-was-pregnant-before-her-father-did/
“Before long some useful patterns emerged…” Women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc When someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date 25 products that, when analyzed together, allowed to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy
My daughter got this in the mail! She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?
“ “
8
Transactional database
Every “instance” now represents a transaction Features correspond with items: binary categoricals
- Tr. ID
milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1
9
Mining interesting rules
What consitutes a “good” rule?
To select rules, constraints on various measures of “interest” are used Most known measures/constraints: minimum thresholds on support and confidence
- Tr. ID
milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1
Support({milk, bread, cheese}) = 2/5 = 0.4 Support(X ⊆ I, T) =
|t∈T:X⊆t| |T|
10
Mining interesting rules
What consitutes a “good” rule?
To select rules, constraints on various measures of “interest” are used Most known measures/constraints: minimum thresholds on support and confidence
- Tr. ID
milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1
Confidence({cheese, wine} → {spaghetti}) = 0.4 / 0.6 = 0.66 Can be interpreted as an estimate of the conditional probability Confidence(X ⊆ I ⇒ Y ⊆ I, T) =
Support(X∪Y ,T) Support(X,T)
P(Y |X) 11
Mining interesting rules
Other measures exist as well:
If lift = 1, the probability of occurrence of the antecedent and consequent are independent of each other If lift > 1, signifies the degree of dependence Considers both the confidence of the rule and the overall data set Interpreted as the ratio of the expected frequency that X occurs without Y Measure for the frequency that the rule makes an incorrect prediction, e.g. a value of 1.2 would indicate that the rule would be incorrect 1.2 times as often if the assocation between X and Y was by chance
Cost sensitive measures exist here as well (“profit” or “utility” based rule mining)
E.g. Kitts et al., 2000
Lift(X ⊆ I ⇒ Y ⊆ I, T) =
Support(X∪Y ,T) Support(X,T)×Support(Y ,T)
Conviction(X ⊆ I ⇒ Y ⊆ I, T) =
1−Support(Y ,T) 1−Confidence(X⇒Y ,T)
ExpectedProfit(X ⊆ I ⇒ Y ⊆ I, T) = Confidence(X ⇒ Y , T) ∑i Profit(Yi) IncrementalProfit(X ⊆ I ⇒ Y ⊆ I, T) = (Confidence(X ⇒ Y , T) − P(Y )) ∑i Profit(Yi)
12
The apriori algorithm
Algorithm:
- 1. A minimum support threshold is applied to find all frequent itemsets
- 2. A minimum confidence threshold is applied to these frequent itemsets in
- rder to form frequent association rules
Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets The set of possible itemsets is the power set over I and has size (excluding the empty set which is not a valid itemset) Although the size of the power-set grows exponentially in the number of items , efficient search is possible using the downward-closure property of support (or: anti-monotonicity) which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also be infrequent Exploiting this property, efficient algorithms can find all frequent item-sets 2|I|−1 |I|
13
- Tr. ID
milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1
The apriori algorithm
A minimum support threshold is applied to find all frequent itemsets All itemsets with support >= 50%
Itemset (‘milk’,) has a support of: 0.6 Itemset (‘bread’,) has a support of: 0.8 Itemset (‘cheese’,) has a support of: 0.8 Itemset (‘wine’,) has a support of: 0.6 Itemset (‘spaghetti’,) has a support of: 0.6 Itemset (‘milk’, ‘bread’) has a support of: 0.6 Itemset (‘bread’, ‘cheese’) has a support of: 0.6 Itemset (‘cheese’, ‘wine’) has a support of: 0.6 Itemset (‘cheese’, ‘spaghetti’) has a support of: 0.6 … brute force leads to possibilities
For 100 items we’d already have 1 271 427 896 possibilities
2|I|−1 = 64
14
The apriori algorithm
A minimum support threshold is applied to find all frequent itemsets We can speed this up with a “step by step” expansion: we only need to continue expanding itemsets that are above the threshold, only using items above the threshold 15
The apriori algorithm
A minimum support threshold is applied to find all frequent itemsets “Join and prune”: an even better way (proposed by apriori)
Say we want to generate candidate 3-itemsets (sets with three items) Look at previous: we only need the (3-1)-itemsets to do so, and only the ones which had enough support: {milk, bread}, {bread, cheese}, {cheese, wine}, {cheese, spaghetti} Join on self: join this set of itemset on itself to generate a list of candidates with length 3:
{milk, bread} x {bread, cheese} = {milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti}
Prune result: prune the candidates containing a (3-1)-subset that did not have enough support (all candidates can be pruned in this case):
{milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti}
This is repeated for every step 16
The apriori algorithm
17
The apriori algorithm
A minimum confidence threshold is applied to these frequent itemsets in order to form frequent association rules
Once the frequent itemset are obtained, association rules are generated as follows For each frequent itemset I, generate all non empty subsets of I For every non, empty, non-equal subset, check its confidence value and retain those above a threshold E.g. for frequent itemset {cheese, wine, spaghetti}, we’d check {cheese, wine} → {spaghetti} {cheese, spaghetti} → {wine} {wine, spaghetti} → {cheese} {cheese} → {wine, spaghetti} {spaghetti} → {cheese, wine} {wine} → {cheese, spaghetti} … and keep those with sufficient confidence
∀Is ∈ P(I) : Is ≠ ∅ ∧ Is ≠ I ∧ Confidence(Is ⇒ I ∖ Is) > minconf → Is ⇒ I ∖ Is 18
Extensions
19
Categorical and continuous variables
E.g. {browser=“firefox”, pages_visited < 10} → {sale_made=“no”}
Easy approach: convert each level to a binary “item” (see “dummy” codification in pre- processing) and place continuous variables in bins before converting to binary value: {browser- firefox, pages-visited-0-to-20} → {no-sale-made}
Group categorical variables with many levels Drop frequent levels which are not considered interesting, i.e. prevent browser-chrome turning up in rules if 90%
- f visitors use this browser
Binning of continuous variables requires some fine-tuning (otherwise support or confidence too low)
Better statistical methods are available, see e.g. R. Rastogi and Kyuseok Shim, “Mining optimized association rules with categorical and numeric attributes,” in IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 1, pp. 29- 50 and others 20
Multi-level association rules
E.g. product hierarchies
Using lower levels only? Highest only? Something in-between?
Possible, but items in lower levels might not have enough support Rules overly specific
Srikant R. & Agrawal R., “Mining Generalized Association Rules”, In Proc. 1995 Int. Conf. Very Large Data Bases, Zurich, 1995 21
Discriminative association rules
Bring in some “supervision”
E.g. you’re interested in rules with outcome → {beer}, or perhaps → {beer, spirits} Interesting because multiple class labels can be combined in consequent, and consequent can also involve non-class labels to derive interesting correlations Learned patterns can be used as features for other learners, see e.g. “Keivan Kianmehr, Reda
- Alhajj. CARSVM: A class association rule-based classification framework and its application
to gene expression data”
22
Tuning
Possibility to tune “interestingness”
Rare patterns: low support but still interesting
E.g. people buying Rolex watches Mine by setting special group-based support thresholds (context matters)
Negative patterns:
E.g. people buying a Hummer and Tesla together will most likely not occur Negatively correlated, infrequent patterns can be more interesting than positive, frequent patterns
Possibility to include domain knowledge
Block frequent itemsets, e.g. if {cats, mice} occurs as a subset (no matter the support) Or allow certain itemsets, even if the support is low
23
Filtering
Often lots of association rules will be discovered!
Post-processing is a necessity Perform sensitivity analysis using minsup and minconf thresholds Trivial rules
E.g., buy spaghetti and spaghetti sauce
Unexpected / unknown rules
Novel and actionable patterns, potentially interesting! Confidence might not always be the best metric (lift, conviction, rarity, …)
Appropriate visualisation facilities are crucial!
Association rules can be powerful but really requires that you “get to work” with the results! See e.g. arulesViz package for R: https://journal.r-project.org/archive/2017/RJ- 2017-047/RJ-2017-047.pdf 24
Filtering
25
Applications
Market basket analysis
{baby food, diapers} => {stella} Put them closer together in the store? Put them far apart in the store? Package baby food, diapers and stella? Package baby food, diapers and stella with a poorly selling item? Raise the price on one, and lower it on the other? Do not advertise baby food, diapers and stella together? Up, down, cross-selling
Basic recommender systems
Customers who buy this also frequently buy… Can be a very simple though powerful approach
To generate features
{Insurer 64, Car Repair Shop A, Police Officer B} as frequent pattern in fraud analytics But be wary of data leakage (train/test split applies!)
26
Sequence mining
In standard apriori, the order of items does not matter Instead of item-sets, think now about item-bags and item-sequences
Sets: unordered, each element appears at most once: every transaction is a set of items Bags: unordered, elements can appear multiple times: every transaction is a bag of items Sequences: ordered, every transaction is a sequence of items
27
Sequence mining
Mining of frequent sequences: algorithm very similar to apriori (i.e. GSP, Generalized Sequential Pattern algorithm)
Start with the set of frequent 1-sequences But: expansion (candidate generation) done differently:
E.g. in normal apriori, {A, B} and {A, C} would both be expanded into the same set {A, B, C}
For sequences, suppose we have <{A}, {B}> and <{A}, {C}>, then these are now expanded (joined) into <{A}, {B}, {C}> and <{A}, {C}, {B}> and <{A}, {B, C}> Often modified to suite particular use cases
Common case: just consider sequences with sets containing one item only: e.g. and expanded into and
E.g. in web mining or customer journey analytics
Pruning k-sequences with infrequent k-1 subsequences, only continue with support higher than threshold 28
Sequence mining
Extensions exist that take timing into account 29
Sequence mining
Extension: frequent episode mining
For very long time series Find frequent sequences within the time series
Extension: continuous time series mining
By first binning the continuous time series in categorical levels (similar as with normal apriori)
Extension: discriminative sequence mining
Again: if you know the outcome of interest (i.e. sequences which lead customer to buy a certain product)
See e.g. SPMF : http://www.philippe-fournier-viger.com/spmf/ for a large collection of algorithms 30
Sequence mining
“Sankey diagram” 31
Conclusion
Main power of apriori comes from the fact that it can easily be extended and adapted towards specific settings Also means that you’ll probably have to go further than “out of the box” approaches Keep in mind post-disovery filtering needs to be applied Keep in mind the possibility to make this more supervised, or use it as a feature-generating tool Many other extensions exist (e.g for frequeny sub-graphs)
32
Clustering
33
Clustering
Cluster analysis or clustering is the task of grouping a set of objects
In such a way that objects in the same group (called a cluster) are more similar (in some sense
- r another) to each other than to those in other groups (clusters)
It is a main task of exploratory data mining, and a common technique for statistical data analysis Used in many fields: machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics, customer segmentation, etc…
Organizing data into clusters shows internal structure of the data
Find patterns, structure, etc. Sometimes the partitioning is the goal, e.g. market segmentation As a first step towards predictive techniques, e.g. mine decison tree model on cluster labels to get further insights or even predict group outcome; use labels and distances as features
34
Clustering
35
Two types
Hierarchical clustering:
Create a hierarchical decomposition of the set of objects using some criterion
Connectivity, distance based
Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions
Partitional clustering:
Objective function based Construct various partitions and then evaluate them by some criterion K-means, k++ means, etc
36
Hierarchical clustering
Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters, bottom-up Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions, top-down 37
Hierarchical clustering
In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of (dis)similarity between sets of observations is required … can be quite subjective In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets
38
Hierarchical clustering
The distance metric defines the distance between two observations
Euclidean distance: Squared Euclidean distance: Manhattan distance: Maximum distance: Many more are possible…
Should possess:
Symmetry: Constancy of self-similarity: Positivity: Triangular inequality: ||a − b||2 = √∑i(ai − bi)2 ||a − b||2
2 = ∑i(ai − bi)2
||a − b||1 = ∑i |ai − bi| ||a − b||∞ = maxi |ai − bi| d(a, b) = d(b, a) d(a, a) = 0 d(a, b) ≥ 0 d(a, b) ≤ d(a, c) + d(c, b)
39
Hierarchical clustering
The linkage criterion defines the distance between groups of instances
Note that a group can consist of only one instance
Single linkage (minimum linkage):
Leads to longer, skinnier clusters
Complete linkage (maximum linkage):
Leads to tight clusters
D(A, B) = min(d(a, b|a ∈ A, b ∈ B) D(A, B) = max(d(a, b|a ∈ A, b ∈ B) 40
Hierarchical clustering
The linkage criterion defines the distance between groups of instances
Note that a group can consist of only one instance
Average linkage (mean linkage):
Favorable in most cases, robust to noise
Centroid linkage: based on distance between the centroids:
Also robust, requires definition of a centroid concept
D(A, B) = ∑a∈A,b∈B d(a, b)
1 |A|×|B|
D(A, B) = d(ac, bc) 41
Hierarchical clustering
A hierarchy is obtained over the instances
No need to specify desired amount of clusters in advance Hierarchical structure maps nicely to human intuition for some domains Not very scalable (though fast enough in most cases) Local optima are a problem: cluster hierarchy not necessarily the best one You might want/need to normalize your features first
42
Hierarchical clustering
Represent hierarchy in dendrogram Can be used to decide on number of desired clusters (a “cut” in the dendrogram) 43
Hierarchical clustering
Represent hierarchy in dendrogram Also a good indication of possible outliers or anomalies 44
Agglomerative or divisive?
All implementations you’ll find will implement agglomerative clustering Divisive clustering turns out the be way more computationally expensive: don’t bother! 45
Partitional clustering
Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters
Since only one set of clusters is output, the user normally has to input the desired number of clusters k
Most well-known algorithm: k-means clustering:
- 1. Decide on a value for k
- 2. Initialize k cluster centers (e.g. randomly over the feature space or by
picking some instances randomly)
- 3. Decide the cluster memberships of the instances by assigning them to the
nearest cluster center, using some distance measure
- 4. Recalculate the k cluster centers
- 5. Repeat 3 and 4 until none of the objects changed cluster membership in the
last iteration 46
K-means example
Pick random centers (either in the feature space or by using some randomly chosen instances as starts): 47
K-means example
Calculate the membership of each instance: 48
K-means example
And reposition the cluster centroids: 49
K-means example
Reassign the instances again: 50
K-means example
Recalculate the centroids: No reassignments performed, so we stop here 51
K-means
Recall that a good cluster solution has high intra-cluster (in-cluster) similarity and low inter- cluster (between-cluster) similarity K-means optimizes for high intra-cluster similarity by optimizing towards a minimal total distortion: the sum of square distances of points to their cluster centroid
Note: an exact optimization of this objective function is hard, k-means is a heuristic approach not necessarily providing the global optimal outcome
Except in the one-dimensional case
min SSE =
K
∑
k=1 nk
∑
i=1
||xki − μk||2 52
K-means
Strengths
Very simple to implement and debug Intuitive objective function: optimizes intra-cluster similarity Relatively efficient
Weaknesses
Applicable only when mean is defined (to calculate centroid of points) What about categorical data? Often terminates at a local optimum, hence initialization is extremely important try multiple random starts (most implementations do this by default) or use K-means++ (most implementations do so) Need to specify k in advance: use elbow point if unsure (e.g. plot SSE across different values for k) Sensitive to handle noisy data and outliers Again: normalization might be required Not suitable to discover clusters with non-convex shapes
53
K-means and non-convex shapes
54
K-means and local optima
55
K-means and setting the value for k
56
K-means and setting the value for k
Don’t be too swayed by your intuition in this case
Often, it pays off to start with a higher setting for k and post-process/inspect accordingly, even if you already have a number of optimal clusters in mind
57
K-means and setting the value for k
Don’t be too swayed by your intuition in this case
Often, it pays off to start with a higher setting for k and post-process/inspect accordingly, even if you already have a number of optimal clusters in mind
58
K-means++
An algorithm to pick good initial centroids
- 1. Choose first center uniformly at random (in the feature space or from the instances)
- 2. For each instance x (over a grid in the feature space), compute
: the distance between x and (nearest) center c that has already been defined
- 3. Choose another center, using a weighted probability distribution where a point x is chosen with
probability proportional to
- 4. Repeat Steps 2 and 3 until k centers have been chosen
- 5. Now proceed using standard k-means clustering (not repeating the initialization step, of course)
Basic idea: spread the initial clusters out from each other: try lower inter-cluster similarity
Most k-means implementations actually implement this variant d(x, c) d(x, c)2
59
DBSCAN
“Density-based spatial clustering of applications with noise”
Groups together points that are closely packed together (points with many nearby neighbors) Marks as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away) DBSCAN is also one of the most common clustering algorithms Hierarchical versions exist as well
60
DBSCAN
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
61
Expectation-maximization based clustering
Based on Gaussian mixture models
K-means is EM’ish, but makes “hard” assignments of instances to clusters Based on two steps:
E-step: assign probabilistic membership: probability
- f membership to a cluster given an instance in
dataset M-step: re-estimate parameters of model based on the probabilistic membership
Which model are we estimating? A mixture of Gaussians
62
Expectation-maximization based clustering
63
Expectation-maximization based clustering
Also local optimization, but nonetheless robust Learns a model for each cluster So you can generate new data points from it (though this is possible with k-means as well: just fit a gaussian for each cluster with mean on the centroid and variance derived using min sum of squared errors) Relatively efficient Extensible to other model types (e.g. multinomial models for categorical data, or noise-robust models)
But:
Initialization of the models still important (local optimum problem) Also still need to specify k Also problems with non-convex shapes So basically: a less “hard” version of k-means
64
Mean-shift clustering
Mean shift builds upon the concept of kernel density estimation (KDE): asumme data was sampled from a probability distribution. KDE is a method to estimate the underlying distribution Works by placing a kernel on each point in the data set (kernel here: a weighting function); most popular one is the Gaussian kernel Adding all of the individual kernels up generates a probability surface (e.g., density function)
65
Mean-shift clustering
https://spin.atomicobject.com/2015/05/26/mean-shift-clustering/
66
Mean-shift clustering
Mean shift uses KDE idea: how would the points move if they climbed up using the nearest peak of de KDE surface: iteratively shifting each point uphill until it reaches a peak Depending on the kernel bandwidth used, the KDE surface (and clustering result) will be different
E.g. for tall skinny kernels (e.g., a small kernel bandwidth), the resultant KDE surface will have a peak for each
- point. This will result in each point being placed into its own cluster
For an extremely short wide kernels (e.g., a large kernel bandwidth), this will result in a wide smooth KDE surface with one peak that all of the points will climb up to, resulting in one cluster
67
Validation
How to check the result of a clustering run?
Use k-means objective function (sum of squared errors) Many other measures as well, e.g. (Davies and Bouldin, 1979), (Halkidi, Batistakis, Vazirgiannis 2001) These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters
E.g. Davies Bouldin Index:
: mean distance between instances in cluster i and its centroid : distance between centroids of clusters i and j A good result has a low index score
But: given the descriptive nature, validation here can be somewhat subjective DBIN = ∑N
i=1 Ri 1 N
Ri = max
j≠i Rij with i = 1, . . . , N and Rij = Si+Sj Dij
Si Dij
68
Applications
Market research: customer and market segments, product positioning Social network analysis: recognize communities of people Social science: identify students, employees with similar properties Search: group search results Recommender systems: predict preferences based on user’s cluster Crime analysis: identify areas with similar crimes Image segmentation and color palette detection (e.g. recall LIME on images)
69
Domain knowledge
An interesting question for many algorithms is how domain knowledge can be incorporated in the learning step
For clustering, this is often done using must-link and can’t-link constraints: who should be and should not be in the same cluster? Nice as this does not require significant changes to algorithm, but can lead to infeasible solutions if too many constraints are added Another approach is a “warm start” solution by providing a partial solution
70
More on distance metrics
Most metrics we’ve seen were defined for numerical features, though these exist for textual and categorical data as well
E.g. Levenshtein distance between to text fields Based on the number of edits (changes) performed: deletions, insertions and substitutions Other metrics exist as well, i.e. Jaccard, cosine, Gower distance (https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using- gower-distance-ab89b3aa90d9)
71
Self-organizing maps (SOMs)
Can be formalized as a special type of artificial neural networks Not really (only) clustering, but: unsupervised Produces a low-dimensional representation of the input space (just like PCA!) Also called a Kohonen map (Teuvo Kohonen)
https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/
Not that often used anymore, but this brings us to…
72
Dimensionality Reduction
73
PCA
http://setosa.io/ev/principal-component-analysis/
Principal components calculated by making use of eigenvector decomposition
- f the covariance matrix of the data
(Eigenvalue represents the explained variance)
Pro: powerful data reduction tool and principal components are uncorrelated Con: PCA may be difficult to interpret, linear approach
PCj = e′
jX = ej1X1 + ej2X2 + ⋯ + ejpXp
λj 74
t-SNE
t-Distributed Stochastic Neighbor Embedding
L.J.P. van der Maaten and G.E. Hinton, Visualizing High-Dimensional Data Using t-SNE, Journal of Machine Learning Research, 9(Nov):2579-2605, 2008 t-SNE is a dimensionality reduction technique
Comparable to PCA t-SNE seeks to preserve local similarities (small pairwise distances)
t-SNE is a non-linear dimensionality reduction technique based on manifold learning
Assumes data points lie on embedded non-linear manifold within higher-dimensional space Manifold is topological space that locally resembles Euclidean space near each data point Example:
A surface as a 2D manifold, locally resembling Euclidean plane near each data point A 3D manifold which can be described by collection of 2D manifolds
Higher dimensional space can thus be well “embedded” in lower dimensional space
Other manifold dimensionality reduction techniques
Multi Dimensional Scaling (MDS) Isomap Locally linear embedding (LLE) Auto-encoders
75
t-SNE
Works in two steps:
- 1. Probability distribution representing similarity measure over pairs of high-
dimensional data points is constructed
- 2. Similar probability distribution over data points in low-dimensional map is
constructed
“Similar” using Kullback–Leibler divergence (aka information gain, relative entropy): divergence between two distributions is minimized
with N dimensionality of data Assumption: pj|i =
e
−||xi−xj||2 2σ2 i
∑k≠i e
−||xi−xk||2 2σ2 i
pij =
pj|i+pi|j 2N
pi|i = pii = 0 76
t-SNE
Step 1: measure similarities between data points in the original (high) dimensional space
Similarity of to is conditional probability, that would pick as its neighbor if neighbors were picked in proportion to their probability density assuming a Gaussian kernel centered at Then measure density of all other data points under Gaussian distribution and normalize xj xi pj|i xi xj xi
77
t-SNE
with N=2 Suppose we have 3 data points: , and
We then compute the conditional and joint probabilities using the formulas to the left Obviously a toy example as “reducing” two dimensions doesn’t make a lot of sense
pj|i =
e
−||xi−xj||2 2σ2 i
∑k≠i e
−||xi−xk||2 2σ2 i
pij =
pj|i+pi|j 2N
xi xj xk 78
t-SNE
with N=2 Nominator: Gaussian distribution centered at pj|i =
e
−||xi−xj||2 2σ2 i
∑k≠i e
−||xi−xk||2 2σ2 i
pij =
pj|i+pi|j 2N
xi
pj|i = 0.78/z pk|i = 0.60/z
79
t-SNE
with N=2 Denominator: these similarity measures are normalized against all points, except itself pj|i =
e
−||xi−xj||2 2σ2 i
∑k≠i e
−||xi−xk||2 2σ2 i
pij =
pj|i+pi|j 2N
xi
pj|i = 0.78/55.62
80
t-SNE
with N=2 Finally, we can compute the joint probabilities
← more similar
pj|i =
e
−||xi−xj||2 2σ2 i
∑k≠i e
−||xi−xk||2 2σ2 i
pij =
pj|i+pi|j 2N
pij = 0.0069 pik = 0.0049
81
t-SNE
Based on Euclidean distance (bandwith of the Gaussian kernel) is data point dependent! Set based upon the perplexity which is a measure to estimate how well the distribution predicts a sample is then set in such a way that the perplexity of the conditional distribution equals a predefined perplexity As a result, the bandwidth parameter is adapted to the density of the data: smaller values are used in denser parts of the data space This is one of the key user-specified hyperparameters of t-SNE
pj|i =
e
−||xi−xj||2 2σ2 i
∑k≠i e
−||xi−xk||2 2σ2 i
σi σi σi
82
t-SNE
Step 2: measure similarities between data points in low dimensional space
insteaf of between mapped points
Student t-distribution used to measure similarities with degrees of freedom = dimensionality in mapped space – 1 Student t-distribution has fatter tails than Gaussian distribution Assumption: No perplexity parameter
qij pij yi, yj, …
qi|i = qii = 0
83
t-SNE
Next, the distances between and are considered
- bviously depends on how we place the data points in the mapped low dimensional space
So the locations are determined by minimizing Kullback-Leibler divergence: Optimized using standard gradient descent
pij qij
qij
KL(P||Q) = ∑
i≠j
pijlog pij qij 84
t-SNE
t-SNE shines when dealing with high dimensional data
E.g. images or word documents
85
t-SNE
t-SNE itself is not a clustering technique
A 2nd-level clustering can however be easily performed on the mapped space, using for example, k-means clustering, DBSCAN or other clustering techniques Of course, you can also use the mapped coordinates as new instance features But: be careful – https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of- t-sne/264647#264647
86
t-SNE
No theoretical reason, but most implementations only allow lower-dimensional space to be 2D
- r 3D (computational costs)
Also, t-SNE learns non-parametric mapping
There is no explicit function that maps data from input space to map Not possible to embed unseen test points in existing map – so featurization is difficult, and t-SNE less suitable as dimensionality reduction technique in predictive setup Extensions exist however that learn multivariate regressor to predict map location from input data or construct regressor that minimizes t-SNE loss directly (“parametric t-SNE”)
Also, you provide your own pairwise similarity matrix and do KL-minimization instead of using built-in conditional probability based similarity measure
Diagonal elements should be 0 and should be normalized to sum to 1 Distance matrix can also be used: similarity = 1 - distance This avoids having to tune the perplexity parameter (but you will need to decide on a similarity metric)
87
t-SNE
As t-SNE uses a gradient descent based approach, remarks regarding learning rates and initialization of mapped points apply
E.g. initialization sometimes done using PCA Defaults usually work well See deep learning session for more on gradient descent and learning rates
Most important hyperparameter is perplexity
Knob that sets number of effective nearest neighbors (similar to k in k-NN) Perplexity value depends on density of data Denser dataset requires larger perplexity Typical values range between 5 and 50
Thinking point: couldn’t the conditional probability be set using a (weighted) k- NN? 88
t-SNE
Impact of perplexity: neighborhood effectively considered! 89
t-SNE
Different perplexity values can lead to very different results
“Size” of “clusters” has no meaning Neither does “distance” between clusters (only locally: manifold) Random data can end up looking “meaningful” More examples at https://distill.pub/2016/misread-tsne/
90
t-SNE
Matlab (original release): https://lvdmaaten.github.io/tsne/
Now built-in in Matlab
R ( tsne package): https://cran.r-project.org/web/packages/tsne/ Python ( scikit-learn ): https://scikit- learn.org/stable/modules/generated/sklearn.manifold.TSNE.html Julia, Java, Torch implementations also available Parametric version: https://github.com/kylemcdonald/Parametric-t-SNE
91
UMAP
Uniform Manifold Approximation and Projection for Dimension Reduction
McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 UMAP is a dimensionality reduction technique Comparable with Principal Component Analysis (PCA) and t-SNE Like t-SNE, UMAP is a non-linear dimensionality reduction technique based on manifold learning Newer (2018) but already well-proven in bioinformatics, materials science and machine learning Performs better in mappings with a dimensionality > 3 Can incorporate supervised labels in the construction of the mapping Parametric by default: better suited as a general-purpose preprocessing technique compared to t-SNE Good performance (scales better than t-SNE, MDS) Nicer properties in terms of interpretation when compared to t-SNE
92
UMAP
Like t-SNE, UMAP is a manifold learner
Recall: a manifold is a topological space that locally resembles Euclidean space near each point UMAP aims to construct a locally-connected Riemannian manifold with a locally constant Riemannian metric Study of Riemannian manifolds is a (hard) research area on its own, in what follows, we present a basic intuitive overview of UMAP
93
UMAP
First, we need a concept from topology: simplicial complexes
A means to construct topological spaces out of simple components Allows to reduce the complexities of dealing with the continuous geometry of topological spaces to relatively simple combinatorics and counting
The basic building block is a simplex: a way to build a k dimensional object A k-simplex is constructed by taking the convex hull of k + 1 independent points Every k-simplex hence describes a simple combinatorial structure
A k-simplex can be regarded as a set of k + 1 objects with faces E.g. tetrahedron (3-simplex) consists of 4 triangles
94
UMAP
To construct topological spaces, simplices can be combined in a “simplicial complex” K
A set of simplices glued together along faces Any face of any simplex in K is also in K (i.e. every sub-simplex of simplices in K are also part of the complex) The intersection of any two simplices in K is a face of both simplices
Next, we will construct a Čech complex (which is a simplicial complex) given an open cover of a topological space
An open cover is some family of sets whose union is the whole space A Čech complex coverts an open cover into a simplicial complex: let each set in the cover be represented with a 0-simplex (a single point). Create a 1-simplex between two such sets if they have a non-empty intersection; create a 2-simplex between three such sets if the triple intersection of all three is non-empty; and so on Topological theory provides guarantees about how well this simple process can produce something that represents the topological space itself in a meaningful way (Nerve theorem)
95
UMAP
Let’s illustrate this on a toy example of a two-dimensional data set
Assume samples are drawn from an underlying topological space To generate an open cover, we can simply create blobs with a fixed radius around each point as an approximation of the topological space (as we need to define intersections)
96
UMAP
We can then construct a Čech complex
Every data point can serve as the 0-simplex to expand from I.e. we get points, lines and triangles Similar in higher dimensional space (but harder to plot)
Note that the simplicial complex relatively well captures the topology of the dataset In fact, most of the work is being done by the 0- and 1-simplices: points and lines You could argue that this will be the same in higher-dimensional space, i.e. why bother with triangles, tetrahedrons, …?
Vietoris-Rips complex is similar to the Čech complex but is only determined by the 0- and 1-simplices: computationally easier and can be used instead! 97
UMAP
The goal is now to construct a lower-dimensional mapping of the data that has a similar topological representation
If we continue with a Vietoris-Rips complex, we basically get a graph with nodes (points) and edges (lines) We could then use any existing graph layout algorithm (e.g. based on spectral methods or force layout) to layout the graph structure to a 2d (or higher-order space) This is simple way to think about the basics of UMAP: construct a graph over the original data set and then reduce it using a layout algorithm to the lower-dimensional space Also see social network session later on UMAP does it differently, however…
98
UMAP
An obvious difficulty is that picking a radius for the blobs around our data points is so far arbitrary
Radius too small → the resulting simplicial complex splits into many connected components Radius too large → simplicial complex turns into just a few very high dimensional simplices and fails to capture the manifold structure anymore
If the data would be uniformly distributed across the underlying topology, picking a radius would be easy and stable: 99
UMAP
This assumption of uniform distribution turns up frequently in manifold learning (Laplacian eigenmaps, Nerve theorem, …)
But real data is not Solution: assume that the data is uniformly distributed, but the notion of distance is varying across the (original) manifold We can compute (or at least approximate) a local notion of distance for each point by making use of Riemannian geometry: a unit blob around a point stretches to the k-th nearest neighbor of the point, where k is the sample size we are using to approximate the local sense of distance Each point is given its own unique distance function, and we construct point-local blobs Similar to point dependent bandwidth around points in t-SNE with fixed perplexity!
E.g. for k = 2:
100
UMAP
We have now replaced choosing the radius of the blobs with choosing a value for k
However it is often easier to pick a scale in terms of number of neighbors than it is pick a radius per point The choice of k determines how locally we wish to estimate the Riemannian metric
Small means a very local interpretation which will capture fine details Large means estimates will be based on larger regions and more broadly accurate across the data set as a whole
Note that the point-local radii can also provide us with a weighting mechanism based on distance to the nearest neighbors (i.e. we could make the definitions of intersections of the open covers fuzzy instead of hard-bound) E.g. shown around two points (again comparable to t-SNE):
101
UMAP
When we map this back to the Vietoris-Rips complex and the resulting graph, you can think of the distances now determining the weights of the edges
Weights are scaled between [0, 1] to be interpreted as the probability that a 1-simplex exists A threshold can be set to remove edges with low weights (improves computational time) Or simply based on the non-fuzzy k-NN approach To prevent isolated points, we ensure connections between nearest neighbors (graph as a whole does not need to be connected, but nodes cannot be completely isolated)
102
UMAP
One issue remains: local fuzzy similarity is not compatible, e.g. might be different from : e.g. for each pair of points we actually have two edges with differing weights Again comparable with the potentially asymmetric conditional probabilistic similarity in t-SNE UMAP sets the final weight to
Summarized: UMAP constructs a weighted graph over the data points using a fuzzy distance metric
We now need to find a good low dimensional representation For any such representational, we can construct the same weighted graph using the same procedure However, contrary to the initial data, we here would like the distances between points to be standard, rather than locally varying, so we can pick a fixed “radius” (“min_dist” hyperparameter to closest neighbor) We need a measure to evaluate the degree of matching between those two representations, i.e. the two weighted graphs d(a, b) d(b, a)
d(a, b) + d(b, a) − d(a, b) × d(b, a) 103
UMAP
In both representations (original and reduced), we have a set of pairwise weights between points which can be interpreted as the probability that a 1-simplex exists between them
Assume these to be Bernoulli variables as ultimately the simplex exists or not We can then use a standard cross-entropy minimization (e.g. using gradient descent as in t- SNE) to have the points settle in the lower dimensional space minimizing the errors between two graphs Efficient implementations use some further shortcuts to make this optimization procedure fast, e.g. Nearest-Neighbor-Descent algorithm, smoothing the loss function and negative sampling Vietoris-Rips complex with 0- and 1- simplices is sufficient, through a full Čech complex could be used as well (“edges” can then involve more than 2 nodes)
104
UMAP
UMAP is quite similar to t-SNE but comes with a sounder theoretical foundation leading to more stable results Only a couple of hyperparameters to tune: number of neighbors (to set fuzzy distance radius), minimum distance between instances in the mapped space, number of dimensions in the mapped space and base distance metric to use between pairwise points (Euclidian or otherwise) Allows to embed unseen points in the mapped space (exact workings are out of scope) Like t-SNE, a second-stage clustering can be applied
Generally easier to interpret as more global structure is preserved: UMAP aims to have distances between clusters
- f points also be meaningful, which is not the case in t-SNE
Labels can be incorporated as well (basically splits up the weighted graph according to labels)
- r semi-supervised (partial labels)
105
UMAP
UMAP can (should, even) be used as a drop-in t-SNE replacement and hence also works very well with high-dimensional data
https://umap-learn.readthedocs.io/en/latest/supervised.html
106
UMAP
Python (reference) implementation (works on top of scikit-learn ): https://umap- learn.readthedocs.io/en/latest/index.html: contains much more information on UMAP R implementations: umap (https://github.com/tkonopka/umap), umapr (https://github.com/ropenscilabs/umapr) and uwot (https://github.com/jlmelville/uwot) packages
107
Anomaly Detection
108
Anomaly detection
Anomaly detection (also called outlier detection or novelty detection) is the identification of instances which do not conform to an expected pattern or other items in a dataset
Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text Anomalies: outliers, novelties, noise, deviations, exceptions Uses a combination of unsupervised, supervised, statistical techniques
109
Statistical approaches
Histograms, distributions, box plots: spotting outliers
Easy to use but univariate in many cases
Other statistical methods, e.g. robust covariance 110
Clustering based
Cluster, and check who falls outside the clusters
Distance based Or clusters with few instances Or based on dendrogram Or based on noisy instances in DBSCAN Can be combined with dimensionality reduction techniques as seen before
111
Clustering based
https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00
112
Clustering based
https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00
113
One-class based methods
E.g. “one-class SVMs” Linked to PU learning (positive-unlabeled)
See also: http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.329.6479&rep=rep1&type=pdf
Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive
“ “
114
Isolation forests
Based on random forest idea
But: splits for nodes are chosen completely at random, not entropy-driven An extreme form of extra randomized trees Anomaly is then defined based on average distance from root node to leaf node containing the instance Outliers have a higher chance to be separated quickly! Very powerful technique
https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py
115
Local outlier factor (LOF)
Based on K-NN
Measure the local deviation of density of a given sample with respect to its neighbors Depends on how isolated the object is with respect to the surrounding neighborhood, locality is given by k-nearest neighbors, whose distance is used to estimate the local density By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers
https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html#sphx-glr-auto-examples-neighbors-plot-lof-
- utlier-detection-py
116
CADE: Classifier-Adjusted Density Estimation for Anomaly Detection and One- Class Classification
Works with any classifier that can produce a probability
Data set is constructed based on original instances (non-outliers) and a synthetic dataset based
- n uniform data generation over the given features (outliers)
Original instances have y = 0, synthetic ones are y = 1 Classifier trained on this data set, prediction used to determine original instances that get a high
- utlier score
117
Some ideas for time series…
Peer Group Analysis (Bolton and Hand, 2001)
Define peer group for each object (e.g. transaction, account, customer) Peer group consists of other objects that behaved similarly in the past Anomalous behavior starts when object starts behaving significantly different from peer group Focus on local instead of global patterns (depending upon the number of peers to consider) Especially suitable to monitor behavior over time in e.g. time series analysis
118
Some ideas for time series…
Break point analysis (Bolton and Hand, 2001)
Break point is an observation or time where anomalous behavior is detected Operates on the account level by comparing sequences of transactions (amount or frequency) to find a break point Choose a fixed time moving window (new transactions enter; old transactions leave) Need to decide on amount of recent versus old transactions to compare Can use statistical tests to compare new transactions to old transactions Works at account-level so does not build profiles by looking at other accounts
119
Time series based
Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test (extreme Studentized deviate) for detecting anomalies
Can be used to detect both global as well as local anomalies This is achieved by employing time series decomposition and using robust statistical metrics, viz., median together with ESD (detect outliers)
Trend component Cyclical component Seasonal component Irregular component
In addition, for long time series (say, 6 months of minutely data), the algorithm employs piecewise approximation, this is rooted to the fact that trend extraction in the presence of anomalies in non-trivial for anomaly detection
120
Seasonal Hybrid ESD (S-H-ESD)
121
Seasonal Hybrid ESD (S-H-ESD)
122
Seasonal Hybrid ESD (S-H-ESD)
https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series