Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

advanced analytics in business d0s07a big data platforms
SMART_READER_LITE
LIVE PREVIEW

Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Unsupervised Learning Anomaly Detection Overview Frequent itemset and association rule mining Other itemset extensions Clustering Dimensionality reduction


slide-1
SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Unsupervised Learning Anomaly Detection

slide-2
SLIDE 2

Overview

Frequent itemset and association rule mining Other itemset extensions Clustering Dimensionality reduction Anomaly detection

2

slide-3
SLIDE 3

The analytics process

3

slide-4
SLIDE 4

Recall

Supervised learning

You have a labelled data set at your disposal Correlate features to target Common case: predict the future based on patterns observed now (predictive) Classification (categorical) versus regression (continuous)

Unsupervised learning

Describe patterns in data Clustering, association rules, sequence rules No labelling required Common case: descriptive, explanatory

For unsupervised learning, we don’t assume a label or target … but we do need to define what kind of patterns we want 4

slide-5
SLIDE 5

Frequent Itemset and Association Rule Mining

5

slide-6
SLIDE 6

Introduction

Association rule learning is a method for discovering interesting relations between variables

Intended to identify rules discovered in databases using some measures of interestingness “Interesting?” Frequent, rare, costly, strange? For example, the rule {onions, tomatoes, ketchup} → {burger} found in the sales data of a supermarket would indicate that if a customer buys onions, tomatoes and ketchup together, they are likely to also buy hamburger meat, which can be used e.g. for promotional pricing or product placements Application areas in market basket analysis, web usage mining, intrusion detection, production and manufacturing Association rule learning typically does not consider the order of items either within a transaction (sequence mining does)

Pioneering technique: apriori algorithm (Rakesh Agrawal, 1993) 6

slide-7
SLIDE 7

{beer, diapers}?

https://www.itbusiness.ca/news/behind-the-beer-and-diapers-data-mining- legend/136

1992, Thomas Blischok, manager of a retail consulting group at Teradata Prepared an analysis of 1.2 million market baskets from 25 Osco Drug stores Database queries were developed to identify affinities. The analysis “did discover that between 5 and 7 p.m., consumers bought beer and diapers” Osco managers did not exploit the beer and diapers relationship by moving the products closer together

7

slide-8
SLIDE 8

{lotion, calcium, zinc}?

http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a- teen-girl-was-pregnant-before-her-father-did/

“Before long some useful patterns emerged…” Women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc When someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date 25 products that, when analyzed together, allowed to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy

My daughter got this in the mail! She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?

“ “

8

slide-9
SLIDE 9

Transactional database

Every “instance” now represents a transaction Features correspond with items: binary categoricals

  • Tr. ID

milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1

9

slide-10
SLIDE 10

Mining interesting rules

What consitutes a “good” rule?

To select rules, constraints on various measures of “interest” are used Most known measures/constraints: minimum thresholds on support and confidence

  • Tr. ID

milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1

Support({milk, bread, cheese}) = 2/5 = 0.4 Support(X ⊆ I, T) =

|t∈T:X⊆t| |T|

10

slide-11
SLIDE 11

Mining interesting rules

What consitutes a “good” rule?

To select rules, constraints on various measures of “interest” are used Most known measures/constraints: minimum thresholds on support and confidence

  • Tr. ID

milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1

Confidence({cheese, wine} → {spaghetti}) = 0.4 / 0.6 = 0.66 Can be interpreted as an estimate of the conditional probability Confidence(X ⊆ I ⇒ Y ⊆ I, T) =

Support(X∪Y ,T) Support(X,T)

P(Y |X) 11

slide-12
SLIDE 12

Mining interesting rules

Other measures exist as well:

If lift = 1, the probability of occurrence of the antecedent and consequent are independent of each other If lift > 1, signifies the degree of dependence Considers both the confidence of the rule and the overall data set Interpreted as the ratio of the expected frequency that X occurs without Y Measure for the frequency that the rule makes an incorrect prediction, e.g. a value of 1.2 would indicate that the rule would be incorrect 1.2 times as often if the assocation between X and Y was by chance

Cost sensitive measures exist here as well (“profit” or “utility” based rule mining)

E.g. Kitts et al., 2000

Lift(X ⊆ I ⇒ Y ⊆ I, T) =

Support(X∪Y ,T) Support(X,T)×Support(Y ,T)

Conviction(X ⊆ I ⇒ Y ⊆ I, T) =

1−Support(Y ,T) 1−Confidence(X⇒Y ,T)

ExpectedProfit(X ⊆ I ⇒ Y ⊆ I, T) = Confidence(X ⇒ Y , T) ∑i Profit(Yi) IncrementalProfit(X ⊆ I ⇒ Y ⊆ I, T) = (Confidence(X ⇒ Y , T) − P(Y )) ∑i Profit(Yi)

12

slide-13
SLIDE 13

The apriori algorithm

Algorithm:

  • 1. A minimum support threshold is applied to find all frequent itemsets
  • 2. A minimum confidence threshold is applied to these frequent itemsets in
  • rder to form frequent association rules

Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets The set of possible itemsets is the power set over I and has size (excluding the empty set which is not a valid itemset) Although the size of the power-set grows exponentially in the number of items , efficient search is possible using the downward-closure property of support (or: anti-monotonicity) which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also be infrequent Exploiting this property, efficient algorithms can find all frequent item-sets 2|I|−1 |I|

13

slide-14
SLIDE 14
  • Tr. ID

milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1

The apriori algorithm

A minimum support threshold is applied to find all frequent itemsets All itemsets with support >= 50%

Itemset (‘milk’,) has a support of: 0.6 Itemset (‘bread’,) has a support of: 0.8 Itemset (‘cheese’,) has a support of: 0.8 Itemset (‘wine’,) has a support of: 0.6 Itemset (‘spaghetti’,) has a support of: 0.6 Itemset (‘milk’, ‘bread’) has a support of: 0.6 Itemset (‘bread’, ‘cheese’) has a support of: 0.6 Itemset (‘cheese’, ‘wine’) has a support of: 0.6 Itemset (‘cheese’, ‘spaghetti’) has a support of: 0.6 … brute force leads to possibilities

For 100 items we’d already have 1 271 427 896 possibilities

2|I|−1 = 64

14

slide-15
SLIDE 15

The apriori algorithm

A minimum support threshold is applied to find all frequent itemsets We can speed this up with a “step by step” expansion: we only need to continue expanding itemsets that are above the threshold, only using items above the threshold 15

slide-16
SLIDE 16

The apriori algorithm

A minimum support threshold is applied to find all frequent itemsets “Join and prune”: an even better way (proposed by apriori)

Say we want to generate candidate 3-itemsets (sets with three items) Look at previous: we only need the (3-1)-itemsets to do so, and only the ones which had enough support: {milk, bread}, {bread, cheese}, {cheese, wine}, {cheese, spaghetti} Join on self: join this set of itemset on itself to generate a list of candidates with length 3:

{milk, bread} x {bread, cheese} = {milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti}

Prune result: prune the candidates containing a (3-1)-subset that did not have enough support (all candidates can be pruned in this case):

{milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti}

This is repeated for every step 16

slide-17
SLIDE 17

The apriori algorithm

17

slide-18
SLIDE 18

The apriori algorithm

A minimum confidence threshold is applied to these frequent itemsets in order to form frequent association rules

Once the frequent itemset are obtained, association rules are generated as follows For each frequent itemset I, generate all non empty subsets of I For every non, empty, non-equal subset, check its confidence value and retain those above a threshold E.g. for frequent itemset {cheese, wine, spaghetti}, we’d check {cheese, wine} → {spaghetti} {cheese, spaghetti} → {wine} {wine, spaghetti} → {cheese} {cheese} → {wine, spaghetti} {spaghetti} → {cheese, wine} {wine} → {cheese, spaghetti} … and keep those with sufficient confidence

∀Is ∈ P(I) : Is ≠ ∅ ∧ Is ≠ I ∧ Confidence(Is ⇒ I ∖ Is) > minconf → Is ⇒ I ∖ Is 18

slide-19
SLIDE 19

Extensions

19

slide-20
SLIDE 20

Categorical and continuous variables

E.g. {browser=“firefox”, pages_visited < 10} → {sale_made=“no”}

Easy approach: convert each level to a binary “item” (see “dummy” codification in pre- processing) and place continuous variables in bins before converting to binary value: {browser- firefox, pages-visited-0-to-20} → {no-sale-made}

Group categorical variables with many levels Drop frequent levels which are not considered interesting, i.e. prevent browser-chrome turning up in rules if 90%

  • f visitors use this browser

Binning of continuous variables requires some fine-tuning (otherwise support or confidence too low)

Better statistical methods are available, see e.g. R. Rastogi and Kyuseok Shim, “Mining optimized association rules with categorical and numeric attributes,” in IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 1, pp. 29- 50 and others 20

slide-21
SLIDE 21

Multi-level association rules

E.g. product hierarchies

Using lower levels only? Highest only? Something in-between?

Possible, but items in lower levels might not have enough support Rules overly specific

Srikant R. & Agrawal R., “Mining Generalized Association Rules”, In Proc. 1995 Int. Conf. Very Large Data Bases, Zurich, 1995 21

slide-22
SLIDE 22

Discriminative association rules

Bring in some “supervision”

E.g. you’re interested in rules with outcome → {beer}, or perhaps → {beer, spirits} Interesting because multiple class labels can be combined in consequent, and consequent can also involve non-class labels to derive interesting correlations Learned patterns can be used as features for other learners, see e.g. “Keivan Kianmehr, Reda

  • Alhajj. CARSVM: A class association rule-based classification framework and its application

to gene expression data”

22

slide-23
SLIDE 23

Tuning

Possibility to tune “interestingness”

Rare patterns: low support but still interesting

E.g. people buying Rolex watches Mine by setting special group-based support thresholds (context matters)

Negative patterns:

E.g. people buying a Hummer and Tesla together will most likely not occur Negatively correlated, infrequent patterns can be more interesting than positive, frequent patterns

Possibility to include domain knowledge

Block frequent itemsets, e.g. if {cats, mice} occurs as a subset (no matter the support) Or allow certain itemsets, even if the support is low

23

slide-24
SLIDE 24

Filtering

Often lots of association rules will be discovered!

Post-processing is a necessity Perform sensitivity analysis using minsup and minconf thresholds Trivial rules

E.g., buy spaghetti and spaghetti sauce

Unexpected / unknown rules

Novel and actionable patterns, potentially interesting! Confidence might not always be the best metric (lift, conviction, rarity, …)

Appropriate visualisation facilities are crucial!

Association rules can be powerful but really requires that you “get to work” with the results! See e.g. arulesViz package for R: https://journal.r-project.org/archive/2017/RJ- 2017-047/RJ-2017-047.pdf 24

slide-25
SLIDE 25

Filtering

25

slide-26
SLIDE 26

Applications

Market basket analysis

{baby food, diapers} => {stella} Put them closer together in the store? Put them far apart in the store? Package baby food, diapers and stella? Package baby food, diapers and stella with a poorly selling item? Raise the price on one, and lower it on the other? Do not advertise baby food, diapers and stella together? Up, down, cross-selling

Basic recommender systems

Customers who buy this also frequently buy… Can be a very simple though powerful approach

To generate features

{Insurer 64, Car Repair Shop A, Police Officer B} as frequent pattern in fraud analytics But be wary of data leakage (train/test split applies!)

26

slide-27
SLIDE 27

Sequence mining

In standard apriori, the order of items does not matter Instead of item-sets, think now about item-bags and item-sequences

Sets: unordered, each element appears at most once: every transaction is a set of items Bags: unordered, elements can appear multiple times: every transaction is a bag of items Sequences: ordered, every transaction is a sequence of items

27

slide-28
SLIDE 28

Sequence mining

Mining of frequent sequences: algorithm very similar to apriori (i.e. GSP, Generalized Sequential Pattern algorithm)

Start with the set of frequent 1-sequences But: expansion (candidate generation) done differently:

E.g. in normal apriori, {A, B} and {A, C} would both be expanded into the same set {A, B, C}

For sequences, suppose we have <{A}, {B}> and <{A}, {C}>, then these are now expanded (joined) into <{A}, {B}, {C}> and <{A}, {C}, {B}> and <{A}, {B, C}> Often modified to suite particular use cases

Common case: just consider sequences with sets containing one item only: e.g. and expanded into and

E.g. in web mining or customer journey analytics

Pruning k-sequences with infrequent k-1 subsequences, only continue with support higher than threshold 28

slide-29
SLIDE 29

Sequence mining

Extensions exist that take timing into account 29

slide-30
SLIDE 30

Sequence mining

Extension: frequent episode mining

For very long time series Find frequent sequences within the time series

Extension: continuous time series mining

By first binning the continuous time series in categorical levels (similar as with normal apriori)

Extension: discriminative sequence mining

Again: if you know the outcome of interest (i.e. sequences which lead customer to buy a certain product)

See e.g. SPMF : http://www.philippe-fournier-viger.com/spmf/ for a large collection of algorithms 30

slide-31
SLIDE 31

Sequence mining

“Sankey diagram” 31

slide-32
SLIDE 32

Conclusion

Main power of apriori comes from the fact that it can easily be extended and adapted towards specific settings Also means that you’ll probably have to go further than “out of the box” approaches Keep in mind post-disovery filtering needs to be applied Keep in mind the possibility to make this more supervised, or use it as a feature-generating tool Many other extensions exist (e.g for frequeny sub-graphs)

32

slide-33
SLIDE 33

Clustering

33

slide-34
SLIDE 34

Clustering

Cluster analysis or clustering is the task of grouping a set of objects

In such a way that objects in the same group (called a cluster) are more similar (in some sense

  • r another) to each other than to those in other groups (clusters)

It is a main task of exploratory data mining, and a common technique for statistical data analysis Used in many fields: machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics, customer segmentation, etc…

Organizing data into clusters shows internal structure of the data

Find patterns, structure, etc. Sometimes the partitioning is the goal, e.g. market segmentation As a first step towards predictive techniques, e.g. mine decison tree model on cluster labels to get further insights or even predict group outcome; use labels and distances as features

34

slide-35
SLIDE 35

Clustering

35

slide-36
SLIDE 36

Two types

Hierarchical clustering:

Create a hierarchical decomposition of the set of objects using some criterion

Connectivity, distance based

Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions

Partitional clustering:

Objective function based Construct various partitions and then evaluate them by some criterion K-means, k++ means, etc

36

slide-37
SLIDE 37

Hierarchical clustering

Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters, bottom-up Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions, top-down 37

slide-38
SLIDE 38

Hierarchical clustering

In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of (dis)similarity between sets of observations is required … can be quite subjective In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets

38

slide-39
SLIDE 39

Hierarchical clustering

The distance metric defines the distance between two observations

Euclidean distance: Squared Euclidean distance: Manhattan distance: Maximum distance: Many more are possible…

Should possess:

Symmetry: Constancy of self-similarity: Positivity: Triangular inequality: ||a − b||2 = √∑i(ai − bi)2 ||a − b||2

2 = ∑i(ai − bi)2

||a − b||1 = ∑i |ai − bi| ||a − b||∞ = maxi |ai − bi| d(a, b) = d(b, a) d(a, a) = 0 d(a, b) ≥ 0 d(a, b) ≤ d(a, c) + d(c, b)

39

slide-40
SLIDE 40

Hierarchical clustering

The linkage criterion defines the distance between groups of instances

Note that a group can consist of only one instance

Single linkage (minimum linkage):

Leads to longer, skinnier clusters

Complete linkage (maximum linkage):

Leads to tight clusters

D(A, B) = min(d(a, b|a ∈ A, b ∈ B) D(A, B) = max(d(a, b|a ∈ A, b ∈ B) 40

slide-41
SLIDE 41

Hierarchical clustering

The linkage criterion defines the distance between groups of instances

Note that a group can consist of only one instance

Average linkage (mean linkage):

Favorable in most cases, robust to noise

Centroid linkage: based on distance between the centroids:

Also robust, requires definition of a centroid concept

D(A, B) = ∑a∈A,b∈B d(a, b)

1 |A|×|B|

D(A, B) = d(ac, bc) 41

slide-42
SLIDE 42

Hierarchical clustering

A hierarchy is obtained over the instances

No need to specify desired amount of clusters in advance Hierarchical structure maps nicely to human intuition for some domains Not very scalable (though fast enough in most cases) Local optima are a problem: cluster hierarchy not necessarily the best one You might want/need to normalize your features first

42

slide-43
SLIDE 43

Hierarchical clustering

Represent hierarchy in dendrogram Can be used to decide on number of desired clusters (a “cut” in the dendrogram) 43

slide-44
SLIDE 44

Hierarchical clustering

Represent hierarchy in dendrogram Also a good indication of possible outliers or anomalies 44

slide-45
SLIDE 45

Agglomerative or divisive?

All implementations you’ll find will implement agglomerative clustering Divisive clustering turns out the be way more computationally expensive: don’t bother! 45

slide-46
SLIDE 46

Partitional clustering

Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters

Since only one set of clusters is output, the user normally has to input the desired number of clusters k

Most well-known algorithm: k-means clustering:

  • 1. Decide on a value for k
  • 2. Initialize k cluster centers (e.g. randomly over the feature space or by

picking some instances randomly)

  • 3. Decide the cluster memberships of the instances by assigning them to the

nearest cluster center, using some distance measure

  • 4. Recalculate the k cluster centers
  • 5. Repeat 3 and 4 until none of the objects changed cluster membership in the

last iteration 46

slide-47
SLIDE 47

K-means example

Pick random centers (either in the feature space or by using some randomly chosen instances as starts): 47

slide-48
SLIDE 48

K-means example

Calculate the membership of each instance: 48

slide-49
SLIDE 49

K-means example

And reposition the cluster centroids: 49

slide-50
SLIDE 50

K-means example

Reassign the instances again: 50

slide-51
SLIDE 51

K-means example

Recalculate the centroids: No reassignments performed, so we stop here 51

slide-52
SLIDE 52

K-means

Recall that a good cluster solution has high intra-cluster (in-cluster) similarity and low inter- cluster (between-cluster) similarity K-means optimizes for high intra-cluster similarity by optimizing towards a minimal total distortion: the sum of square distances of points to their cluster centroid

Note: an exact optimization of this objective function is hard, k-means is a heuristic approach not necessarily providing the global optimal outcome

Except in the one-dimensional case

min SSE =

K

k=1 nk

i=1

||xki − μk||2 52

slide-53
SLIDE 53

K-means

Strengths

Very simple to implement and debug Intuitive objective function: optimizes intra-cluster similarity Relatively efficient

Weaknesses

Applicable only when mean is defined (to calculate centroid of points) What about categorical data? Often terminates at a local optimum, hence initialization is extremely important try multiple random starts (most implementations do this by default) or use K-means++ (most implementations do so) Need to specify k in advance: use elbow point if unsure (e.g. plot SSE across different values for k) Sensitive to handle noisy data and outliers Again: normalization might be required Not suitable to discover clusters with non-convex shapes

53

slide-54
SLIDE 54

K-means and non-convex shapes

54

slide-55
SLIDE 55

K-means and local optima

55

slide-56
SLIDE 56

K-means and setting the value for k

56

slide-57
SLIDE 57

K-means and setting the value for k

Don’t be too swayed by your intuition in this case

Often, it pays off to start with a higher setting for k and post-process/inspect accordingly, even if you already have a number of optimal clusters in mind

57

slide-58
SLIDE 58

K-means and setting the value for k

Don’t be too swayed by your intuition in this case

Often, it pays off to start with a higher setting for k and post-process/inspect accordingly, even if you already have a number of optimal clusters in mind

58

slide-59
SLIDE 59

K-means++

An algorithm to pick good initial centroids

  • 1. Choose first center uniformly at random (in the feature space or from the instances)
  • 2. For each instance x (over a grid in the feature space), compute

: the distance between x and (nearest) center c that has already been defined

  • 3. Choose another center, using a weighted probability distribution where a point x is chosen with

probability proportional to

  • 4. Repeat Steps 2 and 3 until k centers have been chosen
  • 5. Now proceed using standard k-means clustering (not repeating the initialization step, of course)

Basic idea: spread the initial clusters out from each other: try lower inter-cluster similarity

Most k-means implementations actually implement this variant d(x, c) d(x, c)2

59

slide-60
SLIDE 60

DBSCAN

“Density-based spatial clustering of applications with noise”

Groups together points that are closely packed together (points with many nearby neighbors) Marks as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away) DBSCAN is also one of the most common clustering algorithms Hierarchical versions exist as well

60

slide-61
SLIDE 61

DBSCAN

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

61

slide-62
SLIDE 62

Expectation-maximization based clustering

Based on Gaussian mixture models

K-means is EM’ish, but makes “hard” assignments of instances to clusters Based on two steps:

E-step: assign probabilistic membership: probability

  • f membership to a cluster given an instance in

dataset M-step: re-estimate parameters of model based on the probabilistic membership

Which model are we estimating? A mixture of Gaussians

62

slide-63
SLIDE 63

Expectation-maximization based clustering

63

slide-64
SLIDE 64

Expectation-maximization based clustering

Also local optimization, but nonetheless robust Learns a model for each cluster So you can generate new data points from it (though this is possible with k-means as well: just fit a gaussian for each cluster with mean on the centroid and variance derived using min sum of squared errors) Relatively efficient Extensible to other model types (e.g. multinomial models for categorical data, or noise-robust models)

But:

Initialization of the models still important (local optimum problem) Also still need to specify k Also problems with non-convex shapes So basically: a less “hard” version of k-means

64

slide-65
SLIDE 65

Mean-shift clustering

Mean shift builds upon the concept of kernel density estimation (KDE): asumme data was sampled from a probability distribution. KDE is a method to estimate the underlying distribution Works by placing a kernel on each point in the data set (kernel here: a weighting function); most popular one is the Gaussian kernel Adding all of the individual kernels up generates a probability surface (e.g., density function)

65

slide-66
SLIDE 66

Mean-shift clustering

https://spin.atomicobject.com/2015/05/26/mean-shift-clustering/

66

slide-67
SLIDE 67

Mean-shift clustering

Mean shift uses KDE idea: how would the points move if they climbed up using the nearest peak of de KDE surface: iteratively shifting each point uphill until it reaches a peak Depending on the kernel bandwidth used, the KDE surface (and clustering result) will be different

E.g. for tall skinny kernels (e.g., a small kernel bandwidth), the resultant KDE surface will have a peak for each

  • point. This will result in each point being placed into its own cluster

For an extremely short wide kernels (e.g., a large kernel bandwidth), this will result in a wide smooth KDE surface with one peak that all of the points will climb up to, resulting in one cluster

67

slide-68
SLIDE 68

Validation

How to check the result of a clustering run?

Use k-means objective function (sum of squared errors) Many other measures as well, e.g. (Davies and Bouldin, 1979), (Halkidi, Batistakis, Vazirgiannis 2001) These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters

E.g. Davies Bouldin Index:

: mean distance between instances in cluster i and its centroid : distance between centroids of clusters i and j A good result has a low index score

But: given the descriptive nature, validation here can be somewhat subjective DBIN = ∑N

i=1 Ri 1 N

Ri = max

j≠i Rij with i = 1, . . . , N and Rij = Si+Sj Dij

Si Dij

68

slide-69
SLIDE 69

Applications

Market research: customer and market segments, product positioning Social network analysis: recognize communities of people Social science: identify students, employees with similar properties Search: group search results Recommender systems: predict preferences based on user’s cluster Crime analysis: identify areas with similar crimes Image segmentation and color palette detection (e.g. recall LIME on images)

69

slide-70
SLIDE 70

Domain knowledge

An interesting question for many algorithms is how domain knowledge can be incorporated in the learning step

For clustering, this is often done using must-link and can’t-link constraints: who should be and should not be in the same cluster? Nice as this does not require significant changes to algorithm, but can lead to infeasible solutions if too many constraints are added Another approach is a “warm start” solution by providing a partial solution

70

slide-71
SLIDE 71

More on distance metrics

Most metrics we’ve seen were defined for numerical features, though these exist for textual and categorical data as well

E.g. Levenshtein distance between to text fields Based on the number of edits (changes) performed: deletions, insertions and substitutions Other metrics exist as well, i.e. Jaccard, cosine, Gower distance (https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using- gower-distance-ab89b3aa90d9)

71

slide-72
SLIDE 72

Self-organizing maps (SOMs)

Can be formalized as a special type of artificial neural networks Not really (only) clustering, but: unsupervised Produces a low-dimensional representation of the input space (just like PCA!) Also called a Kohonen map (Teuvo Kohonen)

https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/

Not that often used anymore, but this brings us to…

72

slide-73
SLIDE 73

Dimensionality Reduction

73

slide-74
SLIDE 74

PCA

http://setosa.io/ev/principal-component-analysis/

Principal components calculated by making use of eigenvector decomposition

  • f the covariance matrix of the data

(Eigenvalue represents the explained variance)

Pro: powerful data reduction tool and principal components are uncorrelated Con: PCA may be difficult to interpret, linear approach

PCj = e′

jX = ej1X1 + ej2X2 + ⋯ + ejpXp

λj 74

slide-75
SLIDE 75

t-SNE

t-Distributed Stochastic Neighbor Embedding

L.J.P. van der Maaten and G.E. Hinton, Visualizing High-Dimensional Data Using t-SNE, Journal of Machine Learning Research, 9(Nov):2579-2605, 2008 t-SNE is a dimensionality reduction technique

Comparable to PCA t-SNE seeks to preserve local similarities (small pairwise distances)

t-SNE is a non-linear dimensionality reduction technique based on manifold learning

Assumes data points lie on embedded non-linear manifold within higher-dimensional space Manifold is topological space that locally resembles Euclidean space near each data point Example:

A surface as a 2D manifold, locally resembling Euclidean plane near each data point A 3D manifold which can be described by collection of 2D manifolds

Higher dimensional space can thus be well “embedded” in lower dimensional space

Other manifold dimensionality reduction techniques

Multi Dimensional Scaling (MDS) Isomap Locally linear embedding (LLE) Auto-encoders

75

slide-76
SLIDE 76

t-SNE

Works in two steps:

  • 1. Probability distribution representing similarity measure over pairs of high-

dimensional data points is constructed

  • 2. Similar probability distribution over data points in low-dimensional map is

constructed

“Similar” using Kullback–Leibler divergence (aka information gain, relative entropy): divergence between two distributions is minimized

with N dimensionality of data Assumption: pj|i =

e

−||xi−xj||2 2σ2 i

∑k≠i e

−||xi−xk||2 2σ2 i

pij =

pj|i+pi|j 2N

pi|i = pii = 0 76

slide-77
SLIDE 77

t-SNE

Step 1: measure similarities between data points in the original (high) dimensional space

Similarity of to is conditional probability, that would pick as its neighbor if neighbors were picked in proportion to their probability density assuming a Gaussian kernel centered at Then measure density of all other data points under Gaussian distribution and normalize xj xi pj|i xi xj xi

77

slide-78
SLIDE 78

t-SNE

with N=2 Suppose we have 3 data points: , and

We then compute the conditional and joint probabilities using the formulas to the left Obviously a toy example as “reducing” two dimensions doesn’t make a lot of sense

pj|i =

e

−||xi−xj||2 2σ2 i

∑k≠i e

−||xi−xk||2 2σ2 i

pij =

pj|i+pi|j 2N

xi xj xk 78

slide-79
SLIDE 79

t-SNE

with N=2 Nominator: Gaussian distribution centered at pj|i =

e

−||xi−xj||2 2σ2 i

∑k≠i e

−||xi−xk||2 2σ2 i

pij =

pj|i+pi|j 2N

xi

pj|i = 0.78/z pk|i = 0.60/z

79

slide-80
SLIDE 80

t-SNE

with N=2 Denominator: these similarity measures are normalized against all points, except itself pj|i =

e

−||xi−xj||2 2σ2 i

∑k≠i e

−||xi−xk||2 2σ2 i

pij =

pj|i+pi|j 2N

xi

pj|i = 0.78/55.62

80

slide-81
SLIDE 81

t-SNE

with N=2 Finally, we can compute the joint probabilities

← more similar

pj|i =

e

−||xi−xj||2 2σ2 i

∑k≠i e

−||xi−xk||2 2σ2 i

pij =

pj|i+pi|j 2N

pij = 0.0069 pik = 0.0049

81

slide-82
SLIDE 82

t-SNE

Based on Euclidean distance (bandwith of the Gaussian kernel) is data point dependent! Set based upon the perplexity which is a measure to estimate how well the distribution predicts a sample is then set in such a way that the perplexity of the conditional distribution equals a predefined perplexity As a result, the bandwidth parameter is adapted to the density of the data: smaller values are used in denser parts of the data space This is one of the key user-specified hyperparameters of t-SNE

pj|i =

e

−||xi−xj||2 2σ2 i

∑k≠i e

−||xi−xk||2 2σ2 i

σi σi σi

82

slide-83
SLIDE 83

t-SNE

Step 2: measure similarities between data points in low dimensional space

insteaf of between mapped points

Student t-distribution used to measure similarities with degrees of freedom = dimensionality in mapped space – 1 Student t-distribution has fatter tails than Gaussian distribution Assumption: No perplexity parameter

qij pij yi, yj, …

qi|i = qii = 0

83

slide-84
SLIDE 84

t-SNE

Next, the distances between and are considered

  • bviously depends on how we place the data points in the mapped low dimensional space

So the locations are determined by minimizing Kullback-Leibler divergence: Optimized using standard gradient descent

pij qij

qij

KL(P||Q) = ∑

i≠j

pijlog pij qij 84

slide-85
SLIDE 85

t-SNE

t-SNE shines when dealing with high dimensional data

E.g. images or word documents

85

slide-86
SLIDE 86

t-SNE

t-SNE itself is not a clustering technique

A 2nd-level clustering can however be easily performed on the mapped space, using for example, k-means clustering, DBSCAN or other clustering techniques Of course, you can also use the mapped coordinates as new instance features But: be careful – https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of- t-sne/264647#264647

86

slide-87
SLIDE 87

t-SNE

No theoretical reason, but most implementations only allow lower-dimensional space to be 2D

  • r 3D (computational costs)

Also, t-SNE learns non-parametric mapping

There is no explicit function that maps data from input space to map Not possible to embed unseen test points in existing map – so featurization is difficult, and t-SNE less suitable as dimensionality reduction technique in predictive setup Extensions exist however that learn multivariate regressor to predict map location from input data or construct regressor that minimizes t-SNE loss directly (“parametric t-SNE”)

Also, you provide your own pairwise similarity matrix and do KL-minimization instead of using built-in conditional probability based similarity measure

Diagonal elements should be 0 and should be normalized to sum to 1 Distance matrix can also be used: similarity = 1 - distance This avoids having to tune the perplexity parameter (but you will need to decide on a similarity metric)

87

slide-88
SLIDE 88

t-SNE

As t-SNE uses a gradient descent based approach, remarks regarding learning rates and initialization of mapped points apply

E.g. initialization sometimes done using PCA Defaults usually work well See deep learning session for more on gradient descent and learning rates

Most important hyperparameter is perplexity

Knob that sets number of effective nearest neighbors (similar to k in k-NN) Perplexity value depends on density of data Denser dataset requires larger perplexity Typical values range between 5 and 50

Thinking point: couldn’t the conditional probability be set using a (weighted) k- NN? 88

slide-89
SLIDE 89

t-SNE

Impact of perplexity: neighborhood effectively considered! 89

slide-90
SLIDE 90

t-SNE

Different perplexity values can lead to very different results

“Size” of “clusters” has no meaning Neither does “distance” between clusters (only locally: manifold) Random data can end up looking “meaningful” More examples at https://distill.pub/2016/misread-tsne/

90

slide-91
SLIDE 91

t-SNE

Matlab (original release): https://lvdmaaten.github.io/tsne/

Now built-in in Matlab

R ( tsne package): https://cran.r-project.org/web/packages/tsne/ Python ( scikit-learn ): https://scikit- learn.org/stable/modules/generated/sklearn.manifold.TSNE.html Julia, Java, Torch implementations also available Parametric version: https://github.com/kylemcdonald/Parametric-t-SNE

91

slide-92
SLIDE 92

UMAP

Uniform Manifold Approximation and Projection for Dimension Reduction

McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 UMAP is a dimensionality reduction technique Comparable with Principal Component Analysis (PCA) and t-SNE Like t-SNE, UMAP is a non-linear dimensionality reduction technique based on manifold learning Newer (2018) but already well-proven in bioinformatics, materials science and machine learning Performs better in mappings with a dimensionality > 3 Can incorporate supervised labels in the construction of the mapping Parametric by default: better suited as a general-purpose preprocessing technique compared to t-SNE Good performance (scales better than t-SNE, MDS) Nicer properties in terms of interpretation when compared to t-SNE

92

slide-93
SLIDE 93

UMAP

Like t-SNE, UMAP is a manifold learner

Recall: a manifold is a topological space that locally resembles Euclidean space near each point UMAP aims to construct a locally-connected Riemannian manifold with a locally constant Riemannian metric Study of Riemannian manifolds is a (hard) research area on its own, in what follows, we present a basic intuitive overview of UMAP

93

slide-94
SLIDE 94

UMAP

First, we need a concept from topology: simplicial complexes

A means to construct topological spaces out of simple components Allows to reduce the complexities of dealing with the continuous geometry of topological spaces to relatively simple combinatorics and counting

The basic building block is a simplex: a way to build a k dimensional object A k-simplex is constructed by taking the convex hull of k + 1 independent points Every k-simplex hence describes a simple combinatorial structure

A k-simplex can be regarded as a set of k + 1 objects with faces E.g. tetrahedron (3-simplex) consists of 4 triangles

94

slide-95
SLIDE 95

UMAP

To construct topological spaces, simplices can be combined in a “simplicial complex” K

A set of simplices glued together along faces Any face of any simplex in K is also in K (i.e. every sub-simplex of simplices in K are also part of the complex) The intersection of any two simplices in K is a face of both simplices

Next, we will construct a Čech complex (which is a simplicial complex) given an open cover of a topological space

An open cover is some family of sets whose union is the whole space A Čech complex coverts an open cover into a simplicial complex: let each set in the cover be represented with a 0-simplex (a single point). Create a 1-simplex between two such sets if they have a non-empty intersection; create a 2-simplex between three such sets if the triple intersection of all three is non-empty; and so on Topological theory provides guarantees about how well this simple process can produce something that represents the topological space itself in a meaningful way (Nerve theorem)

95

slide-96
SLIDE 96

UMAP

Let’s illustrate this on a toy example of a two-dimensional data set

Assume samples are drawn from an underlying topological space To generate an open cover, we can simply create blobs with a fixed radius around each point as an approximation of the topological space (as we need to define intersections)

96

slide-97
SLIDE 97

UMAP

We can then construct a Čech complex

Every data point can serve as the 0-simplex to expand from I.e. we get points, lines and triangles Similar in higher dimensional space (but harder to plot)

Note that the simplicial complex relatively well captures the topology of the dataset In fact, most of the work is being done by the 0- and 1-simplices: points and lines You could argue that this will be the same in higher-dimensional space, i.e. why bother with triangles, tetrahedrons, …?

Vietoris-Rips complex is similar to the Čech complex but is only determined by the 0- and 1-simplices: computationally easier and can be used instead! 97

slide-98
SLIDE 98

UMAP

The goal is now to construct a lower-dimensional mapping of the data that has a similar topological representation

If we continue with a Vietoris-Rips complex, we basically get a graph with nodes (points) and edges (lines) We could then use any existing graph layout algorithm (e.g. based on spectral methods or force layout) to layout the graph structure to a 2d (or higher-order space) This is simple way to think about the basics of UMAP: construct a graph over the original data set and then reduce it using a layout algorithm to the lower-dimensional space Also see social network session later on UMAP does it differently, however…

98

slide-99
SLIDE 99

UMAP

An obvious difficulty is that picking a radius for the blobs around our data points is so far arbitrary

Radius too small → the resulting simplicial complex splits into many connected components Radius too large → simplicial complex turns into just a few very high dimensional simplices and fails to capture the manifold structure anymore

If the data would be uniformly distributed across the underlying topology, picking a radius would be easy and stable: 99

slide-100
SLIDE 100

UMAP

This assumption of uniform distribution turns up frequently in manifold learning (Laplacian eigenmaps, Nerve theorem, …)

But real data is not Solution: assume that the data is uniformly distributed, but the notion of distance is varying across the (original) manifold We can compute (or at least approximate) a local notion of distance for each point by making use of Riemannian geometry: a unit blob around a point stretches to the k-th nearest neighbor of the point, where k is the sample size we are using to approximate the local sense of distance Each point is given its own unique distance function, and we construct point-local blobs Similar to point dependent bandwidth around points in t-SNE with fixed perplexity!

E.g. for k = 2:

100

slide-101
SLIDE 101

UMAP

We have now replaced choosing the radius of the blobs with choosing a value for k

However it is often easier to pick a scale in terms of number of neighbors than it is pick a radius per point The choice of k determines how locally we wish to estimate the Riemannian metric

Small means a very local interpretation which will capture fine details Large means estimates will be based on larger regions and more broadly accurate across the data set as a whole

Note that the point-local radii can also provide us with a weighting mechanism based on distance to the nearest neighbors (i.e. we could make the definitions of intersections of the open covers fuzzy instead of hard-bound) E.g. shown around two points (again comparable to t-SNE):

101

slide-102
SLIDE 102

UMAP

When we map this back to the Vietoris-Rips complex and the resulting graph, you can think of the distances now determining the weights of the edges

Weights are scaled between [0, 1] to be interpreted as the probability that a 1-simplex exists A threshold can be set to remove edges with low weights (improves computational time) Or simply based on the non-fuzzy k-NN approach To prevent isolated points, we ensure connections between nearest neighbors (graph as a whole does not need to be connected, but nodes cannot be completely isolated)

102

slide-103
SLIDE 103

UMAP

One issue remains: local fuzzy similarity is not compatible, e.g. might be different from : e.g. for each pair of points we actually have two edges with differing weights Again comparable with the potentially asymmetric conditional probabilistic similarity in t-SNE UMAP sets the final weight to

Summarized: UMAP constructs a weighted graph over the data points using a fuzzy distance metric

We now need to find a good low dimensional representation For any such representational, we can construct the same weighted graph using the same procedure However, contrary to the initial data, we here would like the distances between points to be standard, rather than locally varying, so we can pick a fixed “radius” (“min_dist” hyperparameter to closest neighbor) We need a measure to evaluate the degree of matching between those two representations, i.e. the two weighted graphs d(a, b) d(b, a)

d(a, b) + d(b, a) − d(a, b) × d(b, a) 103

slide-104
SLIDE 104

UMAP

In both representations (original and reduced), we have a set of pairwise weights between points which can be interpreted as the probability that a 1-simplex exists between them

Assume these to be Bernoulli variables as ultimately the simplex exists or not We can then use a standard cross-entropy minimization (e.g. using gradient descent as in t- SNE) to have the points settle in the lower dimensional space minimizing the errors between two graphs Efficient implementations use some further shortcuts to make this optimization procedure fast, e.g. Nearest-Neighbor-Descent algorithm, smoothing the loss function and negative sampling Vietoris-Rips complex with 0- and 1- simplices is sufficient, through a full Čech complex could be used as well (“edges” can then involve more than 2 nodes)

104

slide-105
SLIDE 105

UMAP

UMAP is quite similar to t-SNE but comes with a sounder theoretical foundation leading to more stable results Only a couple of hyperparameters to tune: number of neighbors (to set fuzzy distance radius), minimum distance between instances in the mapped space, number of dimensions in the mapped space and base distance metric to use between pairwise points (Euclidian or otherwise) Allows to embed unseen points in the mapped space (exact workings are out of scope) Like t-SNE, a second-stage clustering can be applied

Generally easier to interpret as more global structure is preserved: UMAP aims to have distances between clusters

  • f points also be meaningful, which is not the case in t-SNE

Labels can be incorporated as well (basically splits up the weighted graph according to labels)

  • r semi-supervised (partial labels)

105

slide-106
SLIDE 106

UMAP

UMAP can (should, even) be used as a drop-in t-SNE replacement and hence also works very well with high-dimensional data

https://umap-learn.readthedocs.io/en/latest/supervised.html

106

slide-107
SLIDE 107

UMAP

Python (reference) implementation (works on top of scikit-learn ): https://umap- learn.readthedocs.io/en/latest/index.html: contains much more information on UMAP R implementations: umap (https://github.com/tkonopka/umap), umapr (https://github.com/ropenscilabs/umapr) and uwot (https://github.com/jlmelville/uwot) packages

107

slide-108
SLIDE 108

Anomaly Detection

108

slide-109
SLIDE 109

Anomaly detection

Anomaly detection (also called outlier detection or novelty detection) is the identification of instances which do not conform to an expected pattern or other items in a dataset

Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text Anomalies: outliers, novelties, noise, deviations, exceptions Uses a combination of unsupervised, supervised, statistical techniques

109

slide-110
SLIDE 110

Statistical approaches

Histograms, distributions, box plots: spotting outliers

Easy to use but univariate in many cases

Other statistical methods, e.g. robust covariance 110

slide-111
SLIDE 111

Clustering based

Cluster, and check who falls outside the clusters

Distance based Or clusters with few instances Or based on dendrogram Or based on noisy instances in DBSCAN Can be combined with dimensionality reduction techniques as seen before

111

slide-112
SLIDE 112

Clustering based

https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00

112

slide-113
SLIDE 113

Clustering based

https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00

113

slide-114
SLIDE 114

One-class based methods

E.g. “one-class SVMs” Linked to PU learning (positive-unlabeled)

See also: http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.329.6479&rep=rep1&type=pdf

Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive

“ “

114

slide-115
SLIDE 115

Isolation forests

Based on random forest idea

But: splits for nodes are chosen completely at random, not entropy-driven An extreme form of extra randomized trees Anomaly is then defined based on average distance from root node to leaf node containing the instance Outliers have a higher chance to be separated quickly! Very powerful technique

https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py

115

slide-116
SLIDE 116

Local outlier factor (LOF)

Based on K-NN

Measure the local deviation of density of a given sample with respect to its neighbors Depends on how isolated the object is with respect to the surrounding neighborhood, locality is given by k-nearest neighbors, whose distance is used to estimate the local density By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers

https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html#sphx-glr-auto-examples-neighbors-plot-lof-

  • utlier-detection-py

116

slide-117
SLIDE 117

CADE: Classifier-Adjusted Density Estimation for Anomaly Detection and One- Class Classification

Works with any classifier that can produce a probability

Data set is constructed based on original instances (non-outliers) and a synthetic dataset based

  • n uniform data generation over the given features (outliers)

Original instances have y = 0, synthetic ones are y = 1 Classifier trained on this data set, prediction used to determine original instances that get a high

  • utlier score

117

slide-118
SLIDE 118

Some ideas for time series…

Peer Group Analysis (Bolton and Hand, 2001)

Define peer group for each object (e.g. transaction, account, customer) Peer group consists of other objects that behaved similarly in the past Anomalous behavior starts when object starts behaving significantly different from peer group Focus on local instead of global patterns (depending upon the number of peers to consider) Especially suitable to monitor behavior over time in e.g. time series analysis

118

slide-119
SLIDE 119

Some ideas for time series…

Break point analysis (Bolton and Hand, 2001)

Break point is an observation or time where anomalous behavior is detected Operates on the account level by comparing sequences of transactions (amount or frequency) to find a break point Choose a fixed time moving window (new transactions enter; old transactions leave) Need to decide on amount of recent versus old transactions to compare Can use statistical tests to compare new transactions to old transactions Works at account-level so does not build profiles by looking at other accounts

119

slide-120
SLIDE 120

Time series based

Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test (extreme Studentized deviate) for detecting anomalies

Can be used to detect both global as well as local anomalies This is achieved by employing time series decomposition and using robust statistical metrics, viz., median together with ESD (detect outliers)

Trend component Cyclical component Seasonal component Irregular component

In addition, for long time series (say, 6 months of minutely data), the algorithm employs piecewise approximation, this is rooted to the fact that trend extraction in the presence of anomalies in non-trivial for anomaly detection

120

slide-121
SLIDE 121

Seasonal Hybrid ESD (S-H-ESD)

121

slide-122
SLIDE 122

Seasonal Hybrid ESD (S-H-ESD)

122

slide-123
SLIDE 123

Seasonal Hybrid ESD (S-H-ESD)

https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series

See: https://github.com/twitter/AnomalyDetection and https://facebook.github.io/prophet/ (e.g. using changepoint detection) 123

slide-124
SLIDE 124

Prophet

124