Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K - - PowerPoint PPT Presentation

anomal y detection
SMART_READER_LITE
LIVE PREVIEW

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K - - PowerPoint PPT Presentation

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor Anomalies and o u tliers S u per v ised Uns u per v ised DESIGNING MACHINE LEARNING WORKFLOWS IN


slide-1
SLIDE 1

Anomaly detection

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

  • Dr. Chris Anagnostopoulos

Honorary Associate Professor

slide-2
SLIDE 2

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Anomalies and outliers

Supervised Unsupervised

slide-3
SLIDE 3

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Anomalies and outliers

One of the two classes is very rare Extreme case of dataset shi Examples: cybersecurity fraud detection anti-money laundering fault detection

slide-4
SLIDE 4

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Unsupervised workflows

How to t an algorithm without labels? How to estimate its performance? Careful use of a handful of labels: too few for training without overing just enough for model selection drop unbiased estimate of accuracy

slide-5
SLIDE 5

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Outlier: a datapoint that lies outside the range of the majority of the data Local outlier: a datapoint that lies in an isolated region without other data

slide-6
SLIDE 6

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Local outlier factor (LoF)

slide-7
SLIDE 7

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Local outlier factor (LoF)

from sklearn.neighbors import LocalOutlierFactor as lof clf = lof() y_pred = clf.fit_predict(X) y_pred[:4] array([ 1, 1, 1, -1]) clf.negative_outlier_factor_[:4] array([-0.99, -1.02, -1.08 , -0.97]) confusion_matrix( y_pred, ground_truth) array([[ 5, 16], [ 0, 184]])

slide-8
SLIDE 8

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Local outlier factor (LoF)

clf = lof(contamination=0.02) y_pred = clf.fit_predict(X) confusion_matrix( y_pred, ground_truth) array([[ 5, 0], [ 0, 200]])

slide-9
SLIDE 9

Who needs labels anyway!

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

slide-10
SLIDE 10

Novelty detection

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

  • Dr. Chris Anagnostopoulos

Honorary Associate Professor

slide-11
SLIDE 11

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

One-class classification

Training data without anomalies: Future / test data with anomalies:

slide-12
SLIDE 12

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Novelty LoF

Workaround

preds = lof().fit_predict( np.concatenate([X_train, X_test])) preds = preds[X_train.shape[0]:]

Novelty LoF

clf = lof(novelty=True) clf.fit(X_train) y_pred = clf.predict(X_test)

slide-13
SLIDE 13

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

One-class Support Vector Machine

clf = OneClassSVM() clf.fit(X_train) y_pred = clf.predict(X_test) y_pred[:4] array([ 1, 1, 1, -1])

slide-14
SLIDE 14

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

One-class Support Vector Machine

clf = OneClassSVM() clf.fit(X_train) y_scores = clf.score_samples(X_test) threshold = np.quantile(y_scores, 0.1) y_pred = y_scores <= threshold

slide-15
SLIDE 15

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Isolation Forests

clf = IsolationForest() clf.fit(X_train) y_scores = clf.score_samples(X_test) clf = LocalOutlierFactor(novelty=True) clf.fit(X_train) y_scores = clf.score_samples(X_test)

slide-16
SLIDE 16

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

clf_lof = LocalOutlierFactor(novelty=True).fit(X_train) clf_isf = IsolationForest().fit(X_train) clf_svm = OneClassSVM().fit(X_train) roc_auc_score(y_test, clf_lof.score_samples(X_test) 0.9897 roc_auc_score(y_test, clf_isf.score_samples(X_test)) 0.9692 roc_auc_score(y_test, clf_svm.score_samples(X_test)) 0.9948

slide-17
SLIDE 17

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

clf_lof = LocalOutlierFactor(novelty=True).fit(X_train) clf_isf = IsolationForest().fit(X_train) clf_svm = OneClassSVM().fit(X_train) accuracy_score(y_test, clf_lof.predict(X_test)) 0.9318 accuracy_score(y_test, clf_isf.predict(X_test)) 0.9545 accuracy_score(y_test, clf_svm.predict(X_test)) 0.5

slide-18
SLIDE 18

What's new?

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

slide-19
SLIDE 19

Distance-based learning

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

  • Dr. Chris Anagnostopoulos

Honorary Associate Professor

slide-20
SLIDE 20

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Distance and similarity

from sklearn.neighbors import DistanceMetric as dm dist = dm.get_metric('euclidean') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0. , 2.82842712, 5. ], [2.82842712, 0. , 3.60555128], [5. , 3.60555128, 0. ]]) X = np.matrix(X) np.sqrt(np.sum(np.square(X[0,:] - X[1,:]))) 2.82842712

slide-21
SLIDE 21

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Non-Euclidean Local Outlier Factor

clf = LocalOutlierFactor( novelty=True, metric='chebyshev') clf.fit(X_train) y_pred = clf.predict(X_test) dist = dm.get_metric('chebyshev') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0., 2., 5.], [2., 0., 3.], [5., 3., 0.]])

slide-22
SLIDE 22

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Are all metrics similar?

Hamming distance matrix:

dist = dm.get_metric('hamming') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0. , 1. , 0.5], [1. , 0. , 1. ], [0.5, 1. , 0. ]])

slide-23
SLIDE 23

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Are all metrics similar?

from scipy.spatial.distance import pdist X = [[0,1], [2,3], [0,6]] pdist(X, 'cityblock') array([4., 5., 5.]) from scipy.spatial.distance import \ squareform squareform(pdist(X, 'cityblock')) array([[0., 4., 5.], [4., 0., 5.], [5., 5., 0.]])

slide-24
SLIDE 24

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

A real-world example

The Hepatitis dataset:

Class AGE SEX STEROID ... 0 2.0 40.0 0.0 0.0 ... 1 2.0 30.0 0.0 0.0 ... 2 1.0 47.0 0.0 1.0 ... hps://archive.ics.uci.edu/ml/datasets/Hepatitis

1

slide-25
SLIDE 25

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

A real-world example

Euclidean distance:

squareform(pdist(X_hep, 'euclidean')) [[ 0. 127. 64.1] [127. 0. 128.2] [ 64.1 128.2 0. ]]

1 nearest to 3: wrong class Hamming distance:

squareform(pdist(X_hep, 'hamming')) [[0. 0.5 0.7] [0.5 0. 0.6] [0.7 0.6 0. ]]

1 nearest to 2: right class

slide-26
SLIDE 26

A bigger toolbox

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

slide-27
SLIDE 27

Unstructured data

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

  • Dr. Chris Anagnostopoulos

Honorary Associate Professor

slide-28
SLIDE 28

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Structured versus unstructured

Class AGE SEX STEROID ... 0 2.0 50.0 2.0 1.0 ... 1 2.0 40.0 1.0 1.0 ... ... label sequence 0 VIRUS AVTVVPDPTCCGTLSFKVPKDAKKGKHLGTFDIRQAIMDYGGLHSQ... 1 IMMUNE SYSTEM QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLE... 2 IMMUNE SYSTEM QAVVTQESALTTSPGETVTLTCRSSTGAVTTSNYANWVQEKPDHLF... 3 VIRUS MSQVTEQSVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIAT... ...

Can we build a detector that ags viruses as anomalous in this data?

slide-29
SLIDE 29

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

import stringdist stringdist.levenshtein('abc', 'acc') 1 stringdist.levenshtein('acc', 'cce') 2 label sequence 169 IMMUNE SYSTEM ILSALVGIV 170 IMMUNE SYSTEM ILSALVGIL stringdist.levenshtein('ILSALVGIV', 'ILSALVGIL') 1

slide-30
SLIDE 30

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Some debugging

# This won't work pdist(proteins['sequence'].iloc[:3], metric=stringdist.levenshtein) Traceback (most recent call last): ValueError: A 2-dimensional array must be passed.

slide-31
SLIDE 31

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Some debugging

sequences = np.array(proteins['sequence'].iloc[:3]).reshape(-1,1) # This won't work for a different reason pdist(sequences, metric=stringdist.levenshtein) Traceback (most recent call last): TypeError: argument 1 must be str, not numpy.ndarray

slide-32
SLIDE 32

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Some debugging

# This one works!! def my_levenshtein(x, y): return stringdist.levenshtein(x[0], y[0]) pdist(sequences, metric=my_levenshtein) array([136., 2., 136.])

slide-33
SLIDE 33

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Protein outliers with precomputed matrices

# This takes 2 minutes for about 1000 examples M = pdist(sequences, my_levenshtein)

LoF detector with a precomputed distance matrix:

# This takes 3 seconds detector = lof(metric='precomputed', contamination=0.1) preds = detector.fit_predict(M) roc_auc_score(proteins['label'] == 'VIRUS', preds == -1) 0.64

slide-34
SLIDE 34

Pick your distance

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

slide-35
SLIDE 35

Concluding remarks

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

  • Dr. Chris Anagnostopoulos

Honorary Associate Professor

slide-36
SLIDE 36

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Concluding remarks

Refresher of supervised learning pipelines: feature engineering model ing model selection Risks of overing Data fusion Noisy labels and heuristics Loss functions costs of false positives vs costs of false negatives

slide-37
SLIDE 37

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Concluding remarks

Unsupervised learning: anomaly detection novelty detection distance metrics unstructured data Real-world use cases: cybersecurity healthcare retail banking

slide-38
SLIDE 38

Congratulations!

D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON