anomal y detection
play

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K - PowerPoint PPT Presentation

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor Anomalies and o u tliers S u per v ised Uns u per v ised DESIGNING MACHINE LEARNING WORKFLOWS IN


  1. Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

  2. Anomalies and o u tliers S u per v ised Uns u per v ised DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  3. Anomalies and o u tliers One of the t w o classes is v er y rare E x treme case of dataset shi � E x amples : c y bersec u rit y fra u d detection anti - mone y la u ndering fa u lt detection DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  4. Uns u per v ised w orkflo w s Caref u l u se of a handf u l of labels : too fe w for training w itho u t o v er � � ing j u st eno u gh for model selection drop u nbiased estimate of acc u rac y Ho w to � t an algorithm w itho u t labels ? Ho w to estimate its performance ? DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  5. O u tlier : a datapoint that lies o u tside the Local o u tlier : a datapoint that lies in an range of the majorit y of the data isolated region w itho u t other data DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  6. Local o u tlier factor ( LoF ) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  7. Local o u tlier factor ( LoF ) from sklearn.neighbors import confusion_matrix( LocalOutlierFactor as lof y_pred, ground_truth) clf = lof() y_pred = clf.fit_predict(X) array([[ 5, 16], [ 0, 184]]) y_pred[:4] array([ 1, 1, 1, -1]) clf.negative_outlier_factor_[:4] array([-0.99, -1.02, -1.08 , -0.97]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  8. Local o u tlier factor ( LoF ) clf = lof(contamination=0.02) y_pred = clf.fit_predict(X) confusion_matrix( y_pred, ground_truth) array([[ 5, 0], [ 0, 200]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  9. Who needs labels an yw a y! D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

  10. No v elt y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

  11. One - class classification Training data w itho u t anomalies : F u t u re / test data w ith anomalies : DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  12. No v elt y LoF Workaro u nd No v elt y LoF preds = lof().fit_predict( clf = lof(novelty=True) np.concatenate([X_train, X_test])) clf.fit(X_train) preds = preds[X_train.shape[0]:] y_pred = clf.predict(X_test) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  13. One - class S u pport Vector Machine clf = OneClassSVM() clf.fit(X_train) y_pred = clf.predict(X_test) y_pred[:4] array([ 1, 1, 1, -1]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  14. One - class S u pport Vector Machine clf = OneClassSVM() clf.fit(X_train) y_scores = clf.score_samples(X_test) threshold = np.quantile(y_scores, 0.1) y_pred = y_scores <= threshold DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  15. Isolation Forests clf = IsolationForest() clf.fit(X_train) y_scores = clf.score_samples(X_test) clf = LocalOutlierFactor(novelty=True) clf.fit(X_train) y_scores = clf.score_samples(X_test) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  16. clf_lof = LocalOutlierFactor(novelty=True).fit(X_train) clf_isf = IsolationForest().fit(X_train) clf_svm = OneClassSVM().fit(X_train) roc_auc_score(y_test, clf_lof.score_samples(X_test) 0.9897 roc_auc_score(y_test, clf_isf.score_samples(X_test)) 0.9692 roc_auc_score(y_test, clf_svm.score_samples(X_test)) 0.9948 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  17. clf_lof = LocalOutlierFactor(novelty=True).fit(X_train) clf_isf = IsolationForest().fit(X_train) clf_svm = OneClassSVM().fit(X_train) accuracy_score(y_test, clf_lof.predict(X_test)) 0.9318 accuracy_score(y_test, clf_isf.predict(X_test)) 0.9545 accuracy_score(y_test, clf_svm.predict(X_test)) 0.5 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  18. What ' s ne w? D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

  19. Distance - based learning D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

  20. Distance and similarit y from sklearn.neighbors import DistanceMetric as dm dist = dm.get_metric('euclidean') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0. , 2.82842712, 5. ], [2.82842712, 0. , 3.60555128], [5. , 3.60555128, 0. ]]) X = np.matrix(X) np.sqrt(np.sum(np.square(X[0,:] - X[1,:]))) 2.82842712 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  21. Non - E u clidean Local O u tlier Factor clf = LocalOutlierFactor( novelty=True, metric='chebyshev') clf.fit(X_train) y_pred = clf.predict(X_test) dist = dm.get_metric('chebyshev') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0., 2., 5.], [2., 0., 3.], [5., 3., 0.]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  22. Are all metrics similar ? Hamming distance matri x: dist = dm.get_metric('hamming') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0. , 1. , 0.5], [1. , 0. , 1. ], [0.5, 1. , 0. ]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  23. Are all metrics similar ? from scipy.spatial.distance import \ from scipy.spatial.distance import pdist squareform X = [[0,1], [2,3], [0,6]] squareform(pdist(X, 'cityblock')) pdist(X, 'cityblock') array([[0., 4., 5.], array([4., 5., 5.]) [4., 0., 5.], [5., 5., 0.]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  24. A real -w orld e x ample The Hepatitis dataset : Class AGE SEX STEROID ... 0 2.0 40.0 0.0 0.0 ... 1 2.0 30.0 0.0 0.0 ... 2 1.0 47.0 0.0 1.0 ... 1 h � ps :// archi v e . ics .u ci . ed u/ ml / datasets / Hepatitis DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  25. A real -w orld e x ample E u clidean distance : Hamming distance : squareform(pdist(X_hep, 'euclidean')) squareform(pdist(X_hep, 'hamming')) [[ 0. 127. 64.1] [[0. 0.5 0.7] [127. 0. 128.2] [0.5 0. 0.6] [ 64.1 128.2 0. ]] [0.7 0.6 0. ]] 1 nearest to 3: w rong class 1 nearest to 2: right class DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  26. A bigger toolbo x D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

  27. Unstr u ct u red data D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

  28. Str u ct u red v ers u s u nstr u ct u red Class AGE SEX STEROID ... 0 2.0 50.0 2.0 1.0 ... 1 2.0 40.0 1.0 1.0 ... ... label sequence 0 VIRUS AVTVVPDPTCCGTLSFKVPKDAKKGKHLGTFDIRQAIMDYGGLHSQ... 1 IMMUNE SYSTEM QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLE... 2 IMMUNE SYSTEM QAVVTQESALTTSPGETVTLTCRSSTGAVTTSNYANWVQEKPDHLF... 3 VIRUS MSQVTEQSVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIAT... ... Can w e b u ild a detector that � ags v ir u ses as anomalo u s in this data ? DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  29. import stringdist stringdist.levenshtein('abc', 'acc') 1 stringdist.levenshtein('acc', 'cce') 2 label sequence 169 IMMUNE SYSTEM ILSALVGIV 170 IMMUNE SYSTEM ILSALVGIL stringdist.levenshtein('ILSALVGIV', 'ILSALVGIL') 1 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  30. Some deb u gging # This won't work pdist(proteins['sequence'].iloc[:3], metric=stringdist.levenshtein) Traceback (most recent call last): ValueError: A 2-dimensional array must be passed. DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  31. Some deb u gging sequences = np.array(proteins['sequence'].iloc[:3]).reshape(-1,1) # This won't work for a different reason pdist(sequences, metric=stringdist.levenshtein) Traceback (most recent call last): TypeError: argument 1 must be str, not numpy.ndarray DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  32. Some deb u gging # This one works!! def my_levenshtein(x, y): return stringdist.levenshtein(x[0], y[0]) pdist(sequences, metric=my_levenshtein) array([136., 2., 136.]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  33. Protein o u tliers w ith precomp u ted matrices # This takes 2 minutes for about 1000 examples M = pdist(sequences, my_levenshtein) LoF detector w ith a precomp u ted distance matri x: # This takes 3 seconds detector = lof(metric='precomputed', contamination=0.1) preds = detector.fit_predict(M) roc_auc_score(proteins['label'] == 'VIRUS', preds == -1) 0.64 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  34. Pick y o u r distance D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

  35. Concl u ding remarks D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor

  36. Concl u ding remarks Refresher of s u per v ised learning pipelines : feat u re engineering model � � ing model selection Risks of o v er � � ing Data f u sion Nois y labels and he u ristics Loss f u nctions costs of false positi v es v s costs of false negati v es DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  37. Concl u ding remarks Uns u per v ised learning : anomal y detection no v elt y detection distance metrics u nstr u ct u red data Real -w orld u se cases : c y bersec u rit y healthcare retail banking DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

  38. Congrat u lations ! D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend