introduction in ml with scikit learn
play

Introduction in ML with scikit- learn Professor Patrick McDaniel - PowerPoint PPT Presentation

Introduction in ML with scikit- learn Professor Patrick McDaniel Jonathan Price Fall 2015 Features Attributes in a data set Individual measurable property of phenomenon being observed Choosing/discovering features is a


  1. Introduction in ML with scikit- learn Professor Patrick McDaniel Jonathan Price Fall 2015

  2. Features • Attributes in a data set • “Individual measurable property of phenomenon being observed” • Choosing/discovering features is a crucial part of ML • Ex: ‣ Character Recognition: histograms of pixels ‣ Speech Recognition: Sound length, power, frequency ‣ Malware Detection: Function use count, byte counts Page

  3. Supervised Learning • Inferring a function from labeled training data • The features are selected by the developer • As such, it requires the developer to know something about the dataset to infer good features • Based on pairs of input objects and output values • Ex: ‣ Regression – Predict values ‣ Classification – Predict groupings Page

  4. Unsupervised Learning • Find hidden structure or patterns in unlabled data • Requires no prior knowledge of the nature of data • Not limited by biases inherent in feature selection • Ex: ‣ K-means ‣ Clustering ‣ Neural networks Page

  5. Scikit-learn • The easy way to do data mining and data analysis • Its all Python scripts (yay) • Built on NumPy, SciPy, and matplotlib • Okay, lets get it: ‣ pip install numpy scipy scikit-learn Page

  6. Lets do one • Classification of digits problem • Classify images of drawn numbers Page

  7. Before We Start • What can we use about the image of a character to solve this problem? Page

  8. Dataset • Dataset object in scikit-learn is a dictionary-like object that holds all data (and some metadata). • Actual data is stored as a N_sampes, N_features array • Lets get the digit dataset: >>> from sklearn import datasets >>> digits = datasets.load_digits() Page

  9. Dataset Page

  10. Dataset • “digit database by collecting 250 samples from 44 writers. The samples written by 30 writers are used for training, cross-validation and writer dependent testing, and the digits written by the other 14 are used for writer independent testing” • 500 x 500 pixel characters, compressed to form this (and then a feature vector of length=64): Page

  11. Lets Do Some Estimating • We’re going to use support vector classification (SVC). We’ll explain later. • This code sets up the classifier clf: >>> from sklearn import svm >>> clf = svm.SVC(gamma=0.001, C=100.) • We will also treat this as a black box and come back to the gamma/C values later Page

  12. Fit And Predict • To fit the classifier: >>> clf.fit(digits.data[:-1], digits.target[:-1]) • Now, we predict! >>> clf.predict(digits.data[-1]) array([8]) • Which is apparently this from before: Page

  13. Its (Sort of) That Easy! • We glossed over a couple details, but this shows how easy scikit learn makes the actual implementation • Lets talk about some of the concepts we skipped over earlier Page

  14. SVC’s • We are NOT going into implementation details. • Used for classification, regression, and detecting outliers • Advantages: ‣ Works in high-dimensional spaces ‣ Memory efficient ‣ Versatile • Disadvantages ‣ Bad when # of features > # of samples ‣ Don’t directly provide probability Page

  15. SVC: Graphically Page

  16. Next Week • Next, we will go over a security usage of data analysis: a malware classification Kaggle challenge from Microsoft • See the course site for supplemental readings and setup instructions Page

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend