Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

advanced analytics in business d0s07a big data platforms
SMART_READER_LITE
LIVE PREVIEW

Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Unsupervised Learning Anomaly Detection Overview Frequent itemset and association rule mining Other itemset extensions Clustering Anomaly detection 2


slide-1
SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Unsupervised Learning Anomaly Detection

slide-2
SLIDE 2

Overview

Frequent itemset and association rule mining Other itemset extensions Clustering Anomaly detection

2

slide-3
SLIDE 3

The analytics process

3

slide-4
SLIDE 4

Recall

Predictive analytics (supervised learning)

Predict the future based on patterns learnt from past data Classification (categorical) versus regression (continuous) You have a labelled data set at your disposal

Descriptive analytics (unsupervised learning)

Describe patterns in data Clustering, association rules, sequence rules No labelling required

For unsupervised learning, we don't assume a label or target 4

slide-5
SLIDE 5

Frequent itemset and association rule mining

5

slide-6
SLIDE 6

Introduction

Association rule learning is a method for discovering interesting relations between variables

Interesting? Frequent, rare, costly, strange? Intended to identify strong rules discovered in databases using some measures of interestingness For example, the rule {onions, tomatoes, ketchup} → {burger} found in the sales data of a supermarket would indicate that if a customer buys onions, tomatoes and ketchup together, they are likely to also buy hamburger meat, which can be used e.g. for promotional pricing or product placements Application areas in market basket analysis, web usage mining, intrusion detection, production and manufacturing Association rule learning typically does not consider the order of items either within a transaction (sequence mining does)

Pioneering technique: apriori algorithm (Rakesh Agrawal, 1993) 6

slide-7
SLIDE 7

{beer, diapers}?

https://www.itbusiness.ca/news/behind-the-beer-and-diapers-data-mining- legend/136

1992, Thomas Blischok, manager of a retail consulting group at Teradata Prepared an analysis of 1.2 million market baskets from 25 Osco Drug stores Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m., consumers bought beer and diapers" Osco managers did not exploit the beer and diapers relationship by moving the products closer together

7

slide-8
SLIDE 8

{lotion, calcium, zinc}?

http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a- teen-girl-was-pregnant-before-her-father-did/

Before long some useful patterns emerged Women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc When someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date 25 products that, when analyzed together, allowed to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy “My daughter got this in the mail! She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

8

slide-9
SLIDE 9

Transactional database

Every instance now represents a transaction Features correspond with columns: binary categoricals

  • Tr. ID

milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1 9

slide-10
SLIDE 10

Mining interesting rules

What consitutes a "good" rule?

To select rules, constraints on various measures of "interest" are used Most known measures/constraints: minimum thresholds on support and confidence

Support(X ⊆ I, T) =

  • Tr. ID

milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1

Support({milk, bread, cheese}) = 2/5 = 0.4

∣T∣ ∣t∈T:X⊆t∣

10

slide-11
SLIDE 11

Mining interesting rules

What consitutes a "good" rule?

To select rules, constraints on various measures of "interest" are used Most known measures/constraints: minimum thresholds on support and confidence

Confidence(X ⊆ I ⇒ Y ⊆ I, T) =

  • Tr. ID

milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1

Confidence({cheese, wine} → {spaghetti}) = 0.4 / 0.6 = 0.66 Can be interpreted as an estimate of the conditional probability P(Y ∣X)

Support(X,T) Support(X∪Y ,T)

11

slide-12
SLIDE 12

Mining interesting rules

Other measures exist as well:

Lift(X ⊆ I ⇒ Y ⊆ I, T) =

If lift = 1, the probability of occurrence of the antecedent and consequent are independent of each other If lift > 1, signifies the degree of dependence Considers both the confidence of the rule and the overall data set

Conviction(X ⊆ I ⇒ Y ⊆ I, T) =

Interpreted as the ratio of the expected frequency that X occurs without Y Measure for the frequency that the rule makes an incorrect prediction, e.g. a value of 1.2 would indicate that the rule would be incorrect 1.2 times as often if the assocation between X and Y was by chance

Cost sensitive measures exist here as well ("profit" or "utility" based rule mining)

E.g. Kitts et al., 2000,

ExpectedProfit(X ⊆ I ⇒ Y ⊆ I, T) = Confidence(X ⇒ Y , T) Profit(Y ) and IncrementalProfit(X ⊆ I ⇒ Y ⊆ I, T) = (Confidence(X ⇒ Y , T) − P(Y )) Profit(Y )

Support(X,T)×Support(Y ,T) Support(X∪Y ,T) 1−Confidence(X⇒Y ,T) 1−Support(Y ,T)

∑i

i

∑i

i

12

slide-13
SLIDE 13

The apriori algorithm

Algorithm:

  • 1. A minimum support threshold is applied to find all frequent itemsets
  • 2. A minimum confidence threshold is applied to these frequent itemsets in
  • rder to form frequent association rules

Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets The set of possible itemsets is the power set over I and has size 2 (excluding the empty set which is not a valid itemset) Although the size of the power-set grows exponentially in the number of items |I|, efficient search is possible using the downward-closure property of support (or: anti-monotonicity) which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also be infrequent Exploiting this property, efficient algorithms can find all frequent item-sets

∣I∣−1

13

slide-14
SLIDE 14
  • Tr. ID

milk bread beer cheese wine spaghetti 101 1 1 1 102 1 1 1 1 103 1 1 1 1 104 1 1 1 105 1 1 1 1 1

The apriori algorithm

A minimum support threshold is applied to find all frequent itemsets All itemsets with support >= 50%

Itemset ('milk',) has a support of: 0.6 Itemset ('bread',) has a support of: 0.8 Itemset ('cheese',) has a support of: 0.8 Itemset ('wine',) has a support of: 0.6 Itemset ('spaghetti',) has a support of: 0.6 Itemset ('milk', 'bread') has a support of: 0.6 Itemset ('bread', 'cheese') has a support of: 0.6 Itemset ('cheese', 'wine') has a support of: 0.6 Itemset ('cheese', 'spaghetti') has a support of: 0.6 ... brute force leads to 2

= 64 possibilities

For 100 items we'd already have 1 271 427 896 possibilities

∣I∣−1

14

slide-15
SLIDE 15

The apriori algorithm

A minimum support threshold is applied to find all frequent itemsets Speeding this up with a "step by step" expansion: we only need to continue expanding itemsets that are above the threshold, only using items above the threshold 15

slide-16
SLIDE 16

The apriori algorithm

A minimum support threshold is applied to find all frequent itemsets "Join and prune": an even better way (proposed by apriori)

Say we want to generate candidate 3-itemsets (sets with three items) Look at previous: we only need the (3-1)-itemsets to do so, and only the ones which had enough support: {milk, bread}, {bread, cheese}, {cheese, wine}, {cheese, spaghetti} Join on self: join this set of itemset on itself to generate a list of candidates with length 3:

{milk, bread} x {bread, cheese} = {milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti}

Prune result: prune the candidates containing a (3-1)-subset that did not have enough support (all candidates can be pruned in this case):

{milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti}

This is repeated for every step 16

slide-17
SLIDE 17

The apriori algorithm

17

slide-18
SLIDE 18

The apriori algorithm

A minimum confidence threshold is applied to these frequent itemsets in order to form frequent association rules

Once the frequent itemset are obtained, association rules are generated as follows For each frequent itemset I, generate all non empty subsets of I For every non, empty, non-equal subset, check its confidence value and retain those above a threshold

∀I ∈ P(I) : I ≠ ∅ ∧ I ≠ I ∧ Confidence(I ⇒ I ∖ I ) > minconf → I ⇒ I ∖ I

E.g. for frequent itemset {cheese, wine, spaghetti}, we'd check {cheese, wine} → {spaghetti} {cheese, spaghetti} → {wine} {wine, spaghetti} → {cheese} {cheese} → {wine, spaghetti} {spaghetti} → {cheese, wine} {wine} → {cheese, spaghetti} ... and keep those with sufficient confidence

s s s s s s s

18

slide-19
SLIDE 19

Extensions

19

slide-20
SLIDE 20

Categorical and continuous variables

E.g. {browser=“firefox”, pages_visited < 10} → {sale_made=“no”}

Easy approach: convert each level to a binary “item” (see “dummy” codification in pre- processing) and place continuous variables in bins before converting to binary value: {browser- firefox, pages-visited-0-to-20} → {no-sale-made}

Group categorical variables with many levels Drop frequent levels which are not considered interesting, i.e. prevent browser-chrome turning up in rules if 90%

  • f visitors use this browser

Binning of continuous variables requires some fine-tuning (otherwise support or confidence too low)

Better statistical methods available, see e.g. R. Rastogi and Kyuseok Shim, "Mining optimized association rules with categorical and numeric attributes," in IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 1, pp. 29- 50 and others 20

slide-21
SLIDE 21

Multi-level association rules

E.g. product hierarchies

Using lower levels only? Highest only? Something in-between?

Possible, but items in lower levels might not have enough support Rules overly specific

Srikant R. & Agrawal R., "Mining Generalized Association Rules", In Proc. 1995 Int. Conf. Very Large Data Bases, Zurich, 1995 21

slide-22
SLIDE 22

Discriminative association rules

Bring in some "supervision"

E.g. you’re interested in rules with outcome → {beer}, or perhaps → {beer, spirits} Interesting because multiple class labels can be combined in consequent, and consequent can also involve non-class labels to derive interesting correlations Learned patterns can be used as features for other learners, see e.g. “Keivan Kianmehr, Reda

  • Alhajj. CARSVM: A class association rule-based classification framework and its application

to gene expression data”

22

slide-23
SLIDE 23

Tuning

Possibility to tune “interestingness”

Rare patterns: low support but still interesting

E.g. people buying Rolex watches Mine by setting special group-based support thresholds (context matters)

Negative patterns:

E.g. people buying a Hummer and Tesla together will most likely not occur Negatively correlated, infrequent patterns can be more interesting than positive, frequent patterns

Possibility to include domain knowledge

Block frequent itemsets, e.g. if {cats, mice} occurs as a subset (no matter the support) Or allow certain itemsets, even if the support is low

23

slide-24
SLIDE 24

Filtering

Often lots of association rules will be discovered!

Post-processing is a necessity Perform sensitivity analysis using minsup and minconf thresholds Trivial rules

E.g., buy spaghetti and spaghetti sauce

Unexpected / unknown rules

Novel and actionable patterns, potentially interesting! Confidence might not always be the best metric (lift, conviction, rarity, …)

Appropriate visualisation facilities are crucial!

Association rules can be powerful but really requires that you "get to work" with the results! See e.g. arulesViz package for R: https://journal.r-project.org/archive/2017/RJ- 2017-047/RJ-2017-047.pdf 24

slide-25
SLIDE 25

Filtering

25

slide-26
SLIDE 26

Applications

Market basket analysis

{baby food, diapers} => {stella} Put them closer together in the store? Put them far apart in the store? Package baby food, diapers and stella? Package baby food, diapers and stella with a poorly selling item? Raise the price on one, and lower it on the other? Do not advertise baby food, diapers and stella together? Up, down, cross-selling

Basic recommender systems

Customers who buy this also frequently buy… Can be a very simple though powerful approach

Fraud detection

{Insurer 64, Car Repair Shop A, Police Officer B} as frequent pattern

26

slide-27
SLIDE 27

Sequence mining

In standard apriori, the order of items does not matter Instead of item-sets, think now about item-bags and item-sequences

Sets: unordered, each element appears at most once: every transaction is a set of items Bags: unordered, elements can appear multiple times: every transaction is a bag of items Sequences: ordered, every transaction is a sequence of items

27

slide-28
SLIDE 28

Sequence mining

Mining of frequent sequences: algorithm very similar to apriori (i.e. GSP, Generalized Sequential Pattern algorithm)

Start with the set of frequent 1-sequences But: expansion (candidate generation) done differently:

E.g. in normal apriori, {A, B} and {A, C} would both be expanded into the same set {A, B, C}

For sequences, suppose we have <{A}, {B}> and <{A}, {C}>, then these are now expanded (joined) into <{A}, {B}, {C}> and <{A}, {C}, {B}> and <{A}, {B, C}> Often modified to suite particular use cases

Common case: just consider sequences with sets containing one item only: e.g. <A,B> and <A,C> expanded into <A,B,C> and <A,C,B>

E.g. in web mining or customer journey analytics

Pruning k-sequences with infrequent k-1 subsequences, only continue with support higher than threshold 28

slide-29
SLIDE 29

Sequence mining

Extensions exist that take timing into account 29

slide-30
SLIDE 30

Sequence mining

Extension: frequent episode mining

For very long time series Find frequent sequences within the time series

Extension: continuous time series mining

By first binning the continuous time series in categorical levels (similar as with normal apriori)

Extension: discriminative sequence mining

Again: if you know the outcome of interest (i.e. sequences which lead customer to buy a certain product)

See e.g. SPMF : http://www.philippe-fournier-viger.com/spmf/ for a large collection of algorithms 30

slide-31
SLIDE 31

Sequence mining

"Sankey diagram" 31

slide-32
SLIDE 32

Conclusion

Main power of apriori comes from the fact that it can easily be extended and adapted towards specific settings Also means that you'll probably have to go further than "out of the box" approaches Many other extensions exist (e.g for frequeny sub-graphs)

32

slide-33
SLIDE 33

Clustering

33

slide-34
SLIDE 34

Clustering

Cluster analysis or clustering is the task of grouping a set of objects

In such a way that objects in the same group (called a cluster) are more similar (in some sense

  • r another) to each other than to those in other groups (clusters)

It is a main task of exploratory data mining, and a common technique for statistical data analysis Used in many fields: machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics, customer segmentation, etc…

Organizing data into clusters shows internal structure of the data

Find patterns, structure, etc. Sometimes the partitioning is the goal, e.g. market segmentation As a first step towards predictive techniques, e.g. mine decison tree model on cluster labels to get further insights or even predict group outcome; use labels and distances as features

34

slide-35
SLIDE 35

Clustering

35

slide-36
SLIDE 36

Two types

Hierarchical clustering:

Create a hierarchical decomposition of the set of objects using some criterion

Connectivity, distance based

Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions

Partitional clustering:

Objective function based Construct various partitions and then evaluate them by some criterion K-means, k++ means, etc

36

slide-37
SLIDE 37

Hierarchical clustering

Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters, bottom-up Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions, top-down 37

slide-38
SLIDE 38

Hierarchical clustering

In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of (dis)similarity between sets of observations is required … can be quite subjective In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets

38

slide-39
SLIDE 39

Hierarchical clustering

The distance metric defines the distance between two observations

Euclidean distance: ∣∣a − b∣∣ = Squared Euclidean distance: ∣∣a − b∣∣ =

(a − b )

Manhattan distance: ∣∣a − b∣∣ =

∣a − b ∣

Maximum distance: ∣∣a − b∣∣

= max ∣a − b ∣

Many more are possible...

Should possess:

Symmetry: d(a,b) = d(b,a) Constancy of self-similarity: d(a,a) = 0 Positivity: d(a,b) ≥ 0 Triangular inequality: d(a,b) ≤ d(a,c) + d(c,b)

2

√ (a − b ) ∑i

i i 2 2 2

∑i

i i 2 1

∑i

i i ∞ i i i

39

slide-40
SLIDE 40

Hierarchical clustering

The linkage criterion defines the distance between groups of instances

Note that a group can consist of only one instance

Single linkage (minimum linkage):

D(A, B) = min(d(a, b∣a ∈ A, b ∈ B)

Leads to longer, skinnier clusters Complete linkage (maximum linkage):

D(A, B) = max(d(a, b∣a ∈ A, b ∈ B)

Leads to tight clusters 40

slide-41
SLIDE 41

Hierarchical clustering

The linkage criterion defines the distance between groups of instances

Note that a group can consist of only one instance

Average linkage (mean linkage):

D(A, B) = d(a, b)

Favorable in most cases, robust to noise Centroid linkage: based on distance between the centroids: D(A, B) = d(a , b ) Also robust, requires definition of a centroid concept

∣A∣×∣B∣ 1

∑a∈A,b∈B

c c

41

slide-42
SLIDE 42

Hierarchical clustering

A hierarchy is obtained over the instances

No need to specify desired amount of clusters in advance Hierarchical structure maps nicely to human intuition for some domains Not very scalable (though fast enough in most cases) Local optima are a problem: cluster hierarchy not necessarily the best one You might want/need to normalize your features first

42

slide-43
SLIDE 43

Hierarchical clustering

Represent hierarchy in dendrogram Can be used to decide on number of desired clusters (cut in the dendrogram) 43

slide-44
SLIDE 44

Hierarchical clustering

Represent hierarchy in dendrogram Also a good indication of possible outliers or anomalies 44

slide-45
SLIDE 45

Agglomerative or divisive

All implementations you'll find will implement agglomerative clustering Divisive clustering turns out the be way more computationally expensive: don't bother! 45

slide-46
SLIDE 46

Partitional clustering

Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters

Since only one set of clusters is output, the user normally has to input the desired number of clusters K

Most well-known algorithm: k-means clustering:

  • 1. Decide on a value for K
  • 2. Initialize K cluster centers (e.g. randomly over the feature space or by

picking some instances randomly)

  • 3. Decide the cluster memberships of the instances by assigning them to the

nearest cluster center, using some distance measure

  • 4. Recalculate the K cluster centers
  • 5. Repeat 3 and 4 until none of the objects changed cluster membership in the

last iteration 46

slide-47
SLIDE 47

K-means example

Pick random centers (either in the feature space or by using some randomly chosen instances as starts): 47

slide-48
SLIDE 48

K-means example

Calculate the membership of each instance: 48

slide-49
SLIDE 49

K-means example

And reposition the cluster centroids: 49

slide-50
SLIDE 50

K-means example

Reassign the instances again: 50

slide-51
SLIDE 51

K-means example

Recalculate the centroids: No reassignments performed, so we stop here 51

slide-52
SLIDE 52

K-means

Recall that a good cluster solution has high intra-cluster (in-cluster) similarity and low inter- cluster (between-cluster) similarity K-means optimizes for high intra-cluster similarity by optimizing towards a minimal total distortion: the sum of distances of points to their cluster centroid

min SSE = ∣∣x − μ ∣∣

Note: an exact optimization of this objective function is hard, k-means is a heuristic approach not necessarily providing the global optimal outcome

Except in the one-dimensional case

k=1

K i=1

nk ki k 2

52

slide-53
SLIDE 53

K-means

Strengths

Very simple to implement and debug Intuitive objective function: optimizes intra-cluster similarity Relatively efficient

Weaknesses

Applicable only when mean is defined (to calculate centroid of points) What about categorical data? Often terminates at a local optimum, hence initialization is extremely important try multiple random starts (most implementations do this by default) or use K-means++ (most implementations do so) Need to specify K in advance: use elbow point if unsure (e.g. plot SSE across different values for K) Sensitive to handle noisy data and outliers Again: normalization might be required Not suitable to discover clusters with non-convex shapes

53

slide-54
SLIDE 54

K-means and non-convex shapes

54

slide-55
SLIDE 55

K-means and local optima

55

slide-56
SLIDE 56

K-means and setting the value for K

56

slide-57
SLIDE 57

K-means and setting the value for K

Don't be too swayed by your intuition in this case

Often, it pays off to start with a higher setting for K and post-process/inspect accordingly

57

slide-58
SLIDE 58

K-means and setting the value for K

Don't be too swayed by your intuition in this case

Often, it pays off to start with a higher setting for K and post-process/inspect accordingly

58

slide-59
SLIDE 59

K-means++

An algorithm to pick good initial centroids

  • 1. Choose first center uniformly at random (in the feature space or from the

instances)

  • 2. For each instance x (or over a grid in the feature space), compute d(x, c): the

distance between x and (nearest) center c that has already been defined

  • 3. Choose another center, using a weighted probability distribution where a

point x is chosen with probability proportional to d(x, c)

  • 4. Repeat Steps 2 and 3 until K centers have been chosen
  • 5. Now proceed using standard K-means clustering (not repeating the

initialization step, of course) Basic idea: spread the initial clusters out from each other: try lower inter-cluster similarity

Most k-means implementations actually implement this variant

2

59

slide-60
SLIDE 60

DBSCAN

“Density-based spatial clustering of applications with noise”

Groups together points that are closely packed together (points with many nearby neighbors) Marks as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away) DBSCAN is also one of the most common clustering algorithms Hierarchical versions exist as well

60

slide-61
SLIDE 61

DBSCAN

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68

61

slide-62
SLIDE 62

Expectation-maximization based clustering

Based on Gaussian mixture models

K-means is EM’ish, but makes “hard” assignments of instances to clusters Based on two steps:

E-step: assign probabilistic membership: probability of membership to a cluster given an instance in dataset M-step: re-estimate parameters of model based on the probabilistic membership Which model are we estimating? A mixture of Gaussians

62

slide-63
SLIDE 63

Expectation-maximization based clustering

Also local optimization, but nonetheless robust Learns a model for each cluster So you can generate new data points from it (though this is possible with k-means as well: fit a gaussian for each cluster with mean on the centroid and variance derived using min sum of squared errors) Relatively efficient Extensible to other model types (e.g. multinomial models for categorical data, or noise-robust models)

But:

Initialization of the models still important (local optimum problem) Also still need to specify K Also problems with non-convex shapes So basically: a less “hard” version of k-means

63

slide-64
SLIDE 64

Validation

How to check the result of a clustering run?

Use k-means objective function (sum of squared errors) Many other measures as well, e.g. (Davies and Bouldin, 1979), (Halkidi, Batistakis, Vazirgiannis 2001)

These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters

E.g. Davis Bouldin Index:

DBI = R R = R with i = 1, ..., N and R =

S : mean distance between instances in cluster i and its centroid D : distance between centroids of clusters i and j

A good result has a low index score

N N 1 ∑i=1 N i i j≠i

max

ij ij Dij S +S

i j

i ij

64

slide-65
SLIDE 65

Applications

Market research: customer and market segments, product positioning Social network analysis: recognize communities of people Social science: identify students, employees with similar properties Search: group search results Recommender systems: predict preferences based on user’s cluster Crime analysis: identify areas with similar crimes Image segmentation and color palette detection

65

slide-66
SLIDE 66

Domain knowledge

An interesting question for many algorithms is how domain knowledge can be incorporated in the learning step

For clustering, this is often done using must-link and can’t-link constraints: who should be and should not be in the same cluster? Nice as this does not require significant changes to algorithm, but can lead to infeasible solutions if too many constraints are added Another approach is a "warm start" solution by providing a partial solution

66

slide-67
SLIDE 67

More on distance metrics

Most metrics we've seen were defined for numerical features, though these exist for textual and categorical data as well

E.g. Levenshtein distance between to text fields Based on the number of edits (changes) performed: deletions, insertions and substitutions Other metrics exist as well, i.e. Jaccard, cosine, Gower distance (https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using- gower-distance-ab89b3aa90d9)

67

slide-68
SLIDE 68

Other techniques: self-organizing maps (SOMs)

Can be formalized as a special type of artificial neural networks Not really (only) clustering, but: unsupervised Produces a low-dimensional representation of the input space (just like PCA!) Also called a Kohonen map (Teuvo Kohonen)

https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/

68

slide-69
SLIDE 69

Other techniques: t-SNE

Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of very high-dimensional datasets

Not really for clustering, though can be used as a pre-processing step before clustering Or post-hoc to visualize a clustering result (alternative to PCA)

https://lvdmaaten.github.io/tsne/

69

slide-70
SLIDE 70

Anomaly detection

70

slide-71
SLIDE 71

Anomaly detection

Anomaly detection (also outlier detection or novelty detection) is the identification of instances which do not conform to an expected pattern or other items in a dataset

Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text Anomalies: outliers, novelties, noise, deviations, exceptions Uses a combination of unsupervised, supervised, statistical techniques

At this stage, it's wortwhile to have a look at some common approaches in this setting 71

slide-72
SLIDE 72

Statistical approaches

Histograms, distributions, box plots: spotting outliers

Easy to use but univariate in many cases

Other statistical methods, e.g. robust covariance 72

slide-73
SLIDE 73

Clustering based

Cluster, and check who falls outside the clusters

Distance based Clusters with few instances Based on dendrogram Based on noisy instances in DBSCAN Combine with dimensionality reduction techniques

73

slide-74
SLIDE 74

Clustering based

https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00

74

slide-75
SLIDE 75

Clustering based

https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00

75

slide-76
SLIDE 76

One-class based methods

E.g. "one-class SVMs" Linked to PU learning (positive-unlabeled)

See also: http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.329.6479&rep=rep1&type=pdf

Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive

“ “

76

slide-77
SLIDE 77

Isolation forests

Based on random forest idea

But: splits for nodes are chosen completely randomly, not entropy-driven Anomaly is then defined based on average distance from root node to leaf node containing the instance Outliers have a higher chance to be separated quickly

https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py

77

slide-78
SLIDE 78

Local outlier factor (LOF)

Based on K-NN

Measure the local deviation of density of a given sample with respect to its neighbors Depends on how isolated the object is with respect to the surrounding neighborhood, locality is given by k-nearest neighbors, whose distance is used to estimate the local density By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers

https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html#sphx-glr-auto-examples-neighbors-plot-lof-

  • utlier-detection-py

78

slide-79
SLIDE 79

CADE: Classifier-Adjusted Density Estimation for Anomaly Detection and One- Class Classification

Works with any classifier that can produce a probability

Data set is constructed based on original instances (non-outliers) and a synthetic dataset based

  • n uniform data generation over the given features (outliers)

Original instances are Y = 0, synthetic ones are Y = 1 Classifier trained on this data set, prediction used to determine original instances that get a high

  • utlier score

79

slide-80
SLIDE 80

Time series based

Peer Group Analysis (Bolton and Hand, 2001)

Define peer group for each object (e.g. transaction, account, customer) Peer group consists of other objects that behaved similarly in the past Anomalous behavior starts when object starts behaving significantly different from peer group Focus on local instead of global patterns (depending upon the number of peers to consider) Especially suitable to monitor behavior over time in e.g. time series analysis

80

slide-81
SLIDE 81

Time series based

Break point analysis (Bolton and Hand, 2001)

Break point is an observation or time where anomalous behavior is detected Operates on the account level by comparing sequences of transactions (amount or frequency) to find a break point Choose a fixed time moving window (new transactions enter; old transactions leave) Need to decide on amount of recent versus old transactions to compare Can use statistical tests to compare new transactions to old transactions Works at account-level so does not build profiles by looking at other accounts

81

slide-82
SLIDE 82

Time series based

Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test (extreme Studentized deviate) for detecting anomalies

Can be used to detect both global as well as local anomalies This is achieved by employing time series decomposition and using robust statistical metrics, viz., median together with ESD (detect outliers)

Trend component Cyclical component Seasonal component Irregular component

In addition, for long time series (say, 6 months of minutely data), the algorithm employs piecewise approximation, this is rooted to the fact that trend extraction in the presence of anomalies in non-trivial for anomaly detection

82

slide-83
SLIDE 83

Seasonal Hybrid ESD (S-H-ESD)

83

slide-84
SLIDE 84

Seasonal Hybrid ESD (S-H-ESD)

84

slide-85
SLIDE 85

Seasonal Hybrid ESD (S-H-ESD)

https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series

See: https://github.com/twitter/AnomalyDetection and https://facebook.github.io/prophet/ (e.g. using changepoint detection) 85