Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K - PowerPoint PPT Presentation

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

Anomalies and o u tliers S u per v ised Uns u per v ised DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Anomalies and o u tliers One of the t w o classes is v er y rare E x treme case of dataset shi � E x amples : c y bersec u rit y fra u d detection anti - mone y la u ndering fa u lt detection DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Uns u per v ised w orkflo w s Caref u l u se of a handf u l of labels : too fe w for training w itho u t o v er � � ing j u st eno u gh for model selection drop u nbiased estimate of acc u rac y Ho w to � t an algorithm w itho u t labels ? Ho w to estimate its performance ? DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

O u tlier : a datapoint that lies o u tside the Local o u tlier : a datapoint that lies in an range of the majorit y of the data isolated region w itho u t other data DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Local o u tlier factor ( LoF ) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Local o u tlier factor ( LoF ) from sklearn.neighbors import confusion_matrix( LocalOutlierFactor as lof y_pred, ground_truth) clf = lof() y_pred = clf.fit_predict(X) array([[ 5, 16], [ 0, 184]]) y_pred[:4] array([ 1, 1, 1, -1]) clf.negative_outlier_factor_[:4] array([-0.99, -1.02, -1.08 , -0.97]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Local o u tlier factor ( LoF ) clf = lof(contamination=0.02) y_pred = clf.fit_predict(X) confusion_matrix( y_pred, ground_truth) array([[ 5, 0], [ 0, 200]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Who needs labels an yw a y! D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

No v elt y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

One - class classification Training data w itho u t anomalies : F u t u re / test data w ith anomalies : DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

No v elt y LoF Workaro u nd No v elt y LoF preds = lof().fit_predict( clf = lof(novelty=True) np.concatenate([X_train, X_test])) clf.fit(X_train) preds = preds[X_train.shape[0]:] y_pred = clf.predict(X_test) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

One - class S u pport Vector Machine clf = OneClassSVM() clf.fit(X_train) y_pred = clf.predict(X_test) y_pred[:4] array([ 1, 1, 1, -1]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

One - class S u pport Vector Machine clf = OneClassSVM() clf.fit(X_train) y_scores = clf.score_samples(X_test) threshold = np.quantile(y_scores, 0.1) y_pred = y_scores <= threshold DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Isolation Forests clf = IsolationForest() clf.fit(X_train) y_scores = clf.score_samples(X_test) clf = LocalOutlierFactor(novelty=True) clf.fit(X_train) y_scores = clf.score_samples(X_test) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

clf_lof = LocalOutlierFactor(novelty=True).fit(X_train) clf_isf = IsolationForest().fit(X_train) clf_svm = OneClassSVM().fit(X_train) roc_auc_score(y_test, clf_lof.score_samples(X_test) 0.9897 roc_auc_score(y_test, clf_isf.score_samples(X_test)) 0.9692 roc_auc_score(y_test, clf_svm.score_samples(X_test)) 0.9948 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

clf_lof = LocalOutlierFactor(novelty=True).fit(X_train) clf_isf = IsolationForest().fit(X_train) clf_svm = OneClassSVM().fit(X_train) accuracy_score(y_test, clf_lof.predict(X_test)) 0.9318 accuracy_score(y_test, clf_isf.predict(X_test)) 0.9545 accuracy_score(y_test, clf_svm.predict(X_test)) 0.5 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

What ' s ne w? D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

Distance - based learning D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

Distance and similarit y from sklearn.neighbors import DistanceMetric as dm dist = dm.get_metric('euclidean') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0. , 2.82842712, 5. ], [2.82842712, 0. , 3.60555128], [5. , 3.60555128, 0. ]]) X = np.matrix(X) np.sqrt(np.sum(np.square(X[0,:] - X[1,:]))) 2.82842712 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Non - E u clidean Local O u tlier Factor clf = LocalOutlierFactor( novelty=True, metric='chebyshev') clf.fit(X_train) y_pred = clf.predict(X_test) dist = dm.get_metric('chebyshev') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0., 2., 5.], [2., 0., 3.], [5., 3., 0.]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Are all metrics similar ? Hamming distance matri x: dist = dm.get_metric('hamming') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0. , 1. , 0.5], [1. , 0. , 1. ], [0.5, 1. , 0. ]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Are all metrics similar ? from scipy.spatial.distance import \ from scipy.spatial.distance import pdist squareform X = [[0,1], [2,3], [0,6]] squareform(pdist(X, 'cityblock')) pdist(X, 'cityblock') array([[0., 4., 5.], array([4., 5., 5.]) [4., 0., 5.], [5., 5., 0.]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

A real -w orld e x ample The Hepatitis dataset : Class AGE SEX STEROID ... 0 2.0 40.0 0.0 0.0 ... 1 2.0 30.0 0.0 0.0 ... 2 1.0 47.0 0.0 1.0 ... 1 h � ps :// archi v e . ics .u ci . ed u/ ml / datasets / Hepatitis DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

A real -w orld e x ample E u clidean distance : Hamming distance : squareform(pdist(X_hep, 'euclidean')) squareform(pdist(X_hep, 'hamming')) [[ 0. 127. 64.1] [[0. 0.5 0.7] [127. 0. 128.2] [0.5 0. 0.6] [ 64.1 128.2 0. ]] [0.7 0.6 0. ]] 1 nearest to 3: w rong class 1 nearest to 2: right class DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

A bigger toolbo x D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

Unstr u ct u red data D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

Str u ct u red v ers u s u nstr u ct u red Class AGE SEX STEROID ... 0 2.0 50.0 2.0 1.0 ... 1 2.0 40.0 1.0 1.0 ... ... label sequence 0 VIRUS AVTVVPDPTCCGTLSFKVPKDAKKGKHLGTFDIRQAIMDYGGLHSQ... 1 IMMUNE SYSTEM QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLE... 2 IMMUNE SYSTEM QAVVTQESALTTSPGETVTLTCRSSTGAVTTSNYANWVQEKPDHLF... 3 VIRUS MSQVTEQSVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIAT... ... Can w e b u ild a detector that � ags v ir u ses as anomalo u s in this data ? DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

import stringdist stringdist.levenshtein('abc', 'acc') 1 stringdist.levenshtein('acc', 'cce') 2 label sequence 169 IMMUNE SYSTEM ILSALVGIV 170 IMMUNE SYSTEM ILSALVGIL stringdist.levenshtein('ILSALVGIV', 'ILSALVGIL') 1 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Some deb u gging # This won't work pdist(proteins['sequence'].iloc[:3], metric=stringdist.levenshtein) Traceback (most recent call last): ValueError: A 2-dimensional array must be passed. DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Some deb u gging sequences = np.array(proteins['sequence'].iloc[:3]).reshape(-1,1) # This won't work for a different reason pdist(sequences, metric=stringdist.levenshtein) Traceback (most recent call last): TypeError: argument 1 must be str, not numpy.ndarray DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Some deb u gging # This one works!! def my_levenshtein(x, y): return stringdist.levenshtein(x[0], y[0]) pdist(sequences, metric=my_levenshtein) array([136., 2., 136.]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Protein o u tliers w ith precomp u ted matrices # This takes 2 minutes for about 1000 examples M = pdist(sequences, my_levenshtein) LoF detector w ith a precomp u ted distance matri x: # This takes 3 seconds detector = lof(metric='precomputed', contamination=0.1) preds = detector.fit_predict(M) roc_auc_score(proteins['label'] == 'VIRUS', preds == -1) 0.64 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Pick y o u r distance D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

Concl u ding remarks D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

Concl u ding remarks Refresher of s u per v ised learning pipelines : feat u re engineering model � � ing model selection Risks of o v er � � ing Data f u sion Nois y labels and he u ristics Loss f u nctions costs of false positi v es v s costs of false negati v es DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Concl u ding remarks Uns u per v ised learning : anomal y detection no v elt y detection distance metrics u nstr u ct u red data Real -w orld u se cases : c y bersec u rit y healthcare retail banking DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Congrat u lations ! D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K - PowerPoint PPT Presentation

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor Anomalies and o u tliers S u per v ised Uns u per v ised DESIGNING MACHINE LEARNING WORKFLOWS IN

2018 REVENUE NET PROFIT 82,113,506 37,549,897 THE FIRST GAME | April 8th 201 1 ANOMAL Y:

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Perimeter Intrusion Detection Mikro Tek Detection Technologies Ltd | +44 (0) 1773 744750 |

Collision Detection Collision detection weaknesses Naive collision detection suffers from 3 known

Local features: detection and description detection and description Kristen Grauman UT Austin

Detection, Segmentation Overview Object Detection deer cat Object Detection as Classification

Pipeline leak detection eLearning Part 1 of 2 Please turn on your speakers Historical

Outlier Detection Motivation: Fraud Detection http://i.imgur.com/ckkoAOp.gif Jian Pei: CMPT

AutoML for Object Detection Xiangyu Zhang MEGVII Research 1 AutoML for Advances in AutoML

People-Tracking-by-Detection and People-Detection-by-Tracking Mykhaylo Andriluka Stefan Roth

Styles of Intrusion Detection Misuse intrusion detection Try to detect things known to be

Collision Detection That Collision Detection That Collision Detection That Really Works Really

Introduction to fraud detection Charlotte Werger Data Scientist DataCamp Fraud Detection in

Face detection and recognition Detection Recognition Sally Face detection &

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

How a MySQL DBA see Postgresql (and why their company should worry about) Marco Tusa Percona

World 2012 Web-based iOS Configuration Management Tim Bell Trinity College, University of

Evonne M. Silva | evonne@codeforamerica.org Government can work for the people by the people in

From w orkflo w s to pipelines D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P

15-388/688 - Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon

Nonlinear Control Lecture # 26 State Feedback Stabilization Nonlinear Control Lecture # 26 State

Using JavaScript with Twine Cool effects to polish your interactive story! The Code Liberation

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K - PowerPoint PPT Presentation

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor Anomalies and o u tliers S u per v ised Uns u per v ised DESIGNING MACHINE LEARNING WORKFLOWS IN

2018 REVENUE NET PROFIT 82,113,506 37,549,897 THE FIRST GAME | April 8th 201 1 ANOMAL Y:

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Perimeter Intrusion Detection Mikro Tek Detection Technologies Ltd | +44 (0) 1773 744750 |

Collision Detection Collision detection weaknesses Naive collision detection suffers from 3 known

Local features: detection and description detection and description Kristen Grauman UT Austin

Detection, Segmentation Overview Object Detection deer cat Object Detection as Classification

Pipeline leak detection eLearning Part 1 of 2 Please turn on your speakers Historical

Outlier Detection Motivation: Fraud Detection http://i.imgur.com/ckkoAOp.gif Jian Pei: CMPT

AutoML for Object Detection Xiangyu Zhang MEGVII Research 1 AutoML for Advances in AutoML

People-Tracking-by-Detection and People-Detection-by-Tracking Mykhaylo Andriluka Stefan Roth

Styles of Intrusion Detection Misuse intrusion detection Try to detect things known to be

Collision Detection That Collision Detection That Collision Detection That Really Works Really

Introduction to fraud detection Charlotte Werger Data Scientist DataCamp Fraud Detection in

Face detection and recognition Detection Recognition Sally Face detection &amp;

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

How a MySQL DBA see Postgresql (and why their company should worry about) Marco Tusa Percona

World 2012 Web-based iOS Configuration Management Tim Bell Trinity College, University of

Evonne M. Silva | evonne@codeforamerica.org Government can work for the people by the people in

From w orkflo w s to pipelines D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P

15-388/688 - Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon

Nonlinear Control Lecture # 26 State Feedback Stabilization Nonlinear Control Lecture # 26 State

Using JavaScript with Twine Cool effects to polish your interactive story! The Code Liberation

Face detection and recognition Detection Recognition Sally Face detection &