Uns u per v ised Learning U N SU P E R VISE D L E AR N IN G IN P - PowerPoint PPT Presentation

Uns u per v ised Learning U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io

Uns u per v ised learning Uns u per v ised learning � nds pa � erns in data E . g ., cl u stering c u stomers b y their p u rchases Compressing the data u sing p u rchase pa � erns ( dimension red u ction ) UNSUPERVISED LEARNING IN PYTHON

S u per v ised v s u ns u per v ised learning S u per v ised learning � nds pa � erns for a prediction task E . g ., classif y t u mors as benign or cancero u s ( labels ) Uns u per v ised learning � nds pa � erns in data ... b u t w itho u t a speci � c prediction task in mind UNSUPERVISED LEARNING IN PYTHON

Iris dataset Meas u rements of man y iris plants Three species of iris : setosa v ersicolor v irginica Petal length , petal w idth , sepal length , sepal w idth ( the feat u res of the dataset ) 1 h � p :// scikit - learn . org / stable / mod u les / generated / sklearn . datasets . load _ iris . html / UNSUPERVISED LEARNING IN PYTHON

Arra y s , feat u res & samples 2 D N u mP y arra y Col u mns are meas u rements ( the feat u res ) Ro w s represent iris plants ( the samples ) UNSUPERVISED LEARNING IN PYTHON

Iris data is 4- dimensional Iris samples are points in 4 dimensional space Dimension = n u mber of feat u res Dimension too high to v is u ali z e ! ... b u t u ns u per v ised learning gi v es insight UNSUPERVISED LEARNING IN PYTHON

k - means cl u stering Finds cl u sters of samples N u mber of cl u sters m u st be speci � ed Implemented in sklearn (" scikit - learn ") UNSUPERVISED LEARNING IN PYTHON

print(samples) [[ 5. 3.3 1.4 0.2] [ 5. 3.5 1.3 0.3] ... [ 7.2 3.2 6. 1.8]] from sklearn.cluster import KMeans model = KMeans(n_clusters=3) model.fit(samples) KMeans(algorithm='auto', ...) labels = model.predict(samples) print(labels) [0 0 1 1 0 1 2 1 0 1 ...] UNSUPERVISED LEARNING IN PYTHON

Cl u ster labels for ne w samples Ne w samples can be assigned to e x isting cl u sters k - means remembers the mean of each cl u ster ( the " centroids ") Finds the nearest centroid to each ne w sample UNSUPERVISED LEARNING IN PYTHON

Cl u ster labels for ne w samples print(new_samples) [[ 5.7 4.4 1.5 0.4] [ 6.5 3. 5.5 1.8] [ 5.8 2.7 5.1 1.9]] new_labels = model.predict(new_samples) print(new_labels) [0 2 1] UNSUPERVISED LEARNING IN PYTHON

Scatter plots Sca � er plot of sepal length v s . petal length Each point represents an iris sample Color points b y cl u ster labels P y Plot ( matplotlib.pyplot ) UNSUPERVISED LEARNING IN PYTHON

Scatter plots import matplotlib.pyplot as plt xs = samples[:,0] ys = samples[:,2] plt.scatter(xs, ys, c=labels) plt.show() UNSUPERVISED LEARNING IN PYTHON

Let ' s practice ! U N SU P E R VISE D L E AR N IN G IN P YTH ON

E v al u ating a cl u stering U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io

E v al u ating a cl u stering Can check correspondence w ith e . g . iris species ... b u t w hat if there are no species to check against ? Meas u re q u alit y of a cl u stering Informs choice of ho w man y cl u sters to look for UNSUPERVISED LEARNING IN PYTHON

Iris : cl u sters v s species k - means fo u nd 3 cl u sters amongst the iris samples Do the cl u sters correspond to the species ? species setosa versicolor virginica labels 0 0 2 36 1 50 0 0 2 0 48 14 UNSUPERVISED LEARNING IN PYTHON

Cross tab u lation w ith pandas Cl u sters v s species is a " cross - tab u lation " Use the pandas librar y Gi v en the species of each sample as a list species print(species) ['setosa', 'setosa', 'versicolor', 'virginica', ... ] UNSUPERVISED LEARNING IN PYTHON

Aligning labels and species import pandas as pd df = pd.DataFrame({'labels': labels, 'species': species}) print(df) labels species 0 1 setosa 1 1 setosa 2 2 versicolor 3 2 virginica 4 1 setosa ... UNSUPERVISED LEARNING IN PYTHON

Crosstab of labels and species ct = pd.crosstab(df['labels'], df['species']) print(ct) species setosa versicolor virginica labels 0 0 2 36 1 50 0 0 2 0 48 14 Ho w to e v al u ate a cl u stering , if there w ere no species information ? UNSUPERVISED LEARNING IN PYTHON

Meas u ring cl u stering q u alit y Using onl y samples and their cl u ster labels A good cl u stering has tight cl u sters Samples in each cl u ster b u nched together UNSUPERVISED LEARNING IN PYTHON

Inertia meas u res cl u stering q u alit y Meas u res ho w spread o u t the cl u sters are ( lo w er is be � er ) Distance from each sample to centroid of its cl u ster A � er fit() , a v ailable as a � rib u te inertia_ k - means a � empts to minimi z e the inertia w hen choosing cl u sters from sklearn.cluster import KMeans model = KMeans(n_clusters=3) model.fit(samples) print(model.inertia_) 78.9408414261 UNSUPERVISED LEARNING IN PYTHON

The n u mber of cl u sters Cl u sterings of the iris dataset w ith di � erent n u mbers of cl u sters More cl u sters means lo w er inertia What is the best n u mber of cl u sters ? UNSUPERVISED LEARNING IN PYTHON

Ho w man y cl u sters to choose ? A good cl u stering has tight cl u sters ( so lo w inertia ) ... b u t not too man y cl u sters ! Choose an " elbo w" in the inertia plot Where inertia begins to decrease more slo w l y E . g ., for iris dataset , 3 is a good choice UNSUPERVISED LEARNING IN PYTHON

Transforming feat u res for better cl u sterings U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io

Piedmont w ines dataset 178 samples from 3 distinct v arieties of red w ine : Barolo , Grignolino and Barbera Feat u res meas u re chemical composition e . g . alcohol content Vis u al properties like " color intensit y" 1 So u rce : h � ps :// archi v e . ics .u ci . ed u/ ml / datasets / Wine UNSUPERVISED LEARNING IN PYTHON

Cl u stering the w ines from sklearn.cluster import KMeans model = KMeans(n_clusters=3) labels = model.fit_predict(samples) UNSUPERVISED LEARNING IN PYTHON

Cl u sters v s . v arieties df = pd.DataFrame({'labels': labels, 'varieties': varieties}) ct = pd.crosstab(df['labels'], df['varieties']) print(ct) varieties Barbera Barolo Grignolino labels 0 29 13 20 1 0 46 1 2 19 0 50 UNSUPERVISED LEARNING IN PYTHON

Feat u re v ariances The w ine feat u res ha v e v er y di � erent v ariances ! Variance of a feat u re meas u res spread of its v al u es feature variance alcohol 0.65 malic_acid 1.24 ... od280 0.50 proline 99166.71 UNSUPERVISED LEARNING IN PYTHON

StandardScaler In kmeans : feat u re v ariance = feat u re in �u ence StandardScaler transforms each feat u re to ha v e mean 0 and v ariance 1 Feat u res are said to be " standardi z ed " UNSUPERVISED LEARNING IN PYTHON

sklearn StandardScaler from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(samples) StandardScaler(copy=True, with_mean=True, with_std=True) samples_scaled = scaler.transform(samples) UNSUPERVISED LEARNING IN PYTHON

Similar methods StandardScaler and KMeans ha v e similar methods Use fit() / transform() w ith StandardScaler Use fit() / predict() w ith KMeans UNSUPERVISED LEARNING IN PYTHON

StandardScaler , then KMeans Need to perform t w o steps : StandardScaler , then KMeans Use sklearn pipeline to combine m u ltiple steps Data � o w s from one step into the ne x t UNSUPERVISED LEARNING IN PYTHON

Pipelines combine m u ltiple steps from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans scaler = StandardScaler() kmeans = KMeans(n_clusters=3) from sklearn.pipeline import make_pipeline pipeline = make_pipeline(scaler, kmeans) pipeline.fit(samples) Pipeline(steps=...) labels = pipeline.predict(samples) UNSUPERVISED LEARNING IN PYTHON

Feat u re standardi z ation impro v es cl u stering With feat u re standardi z ation : varieties Barbera Barolo Grignolino labels 0 0 59 3 1 48 0 3 2 0 0 65 Witho u t feat u re standardi z ation w as v er y bad : varieties Barbera Barolo Grignolino labels 0 29 13 20 1 0 46 1 2 19 0 50 UNSUPERVISED LEARNING IN PYTHON

sklearn preprocessing steps StandardScaler is a " preprocessing " step MaxAbsScaler and Normalizer are other e x amples UNSUPERVISED LEARNING IN PYTHON

Uns u per v ised Learning U N SU P E R VISE D L E AR N IN G IN P - PowerPoint PPT Presentation

Uns u per v ised Learning U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io Uns u per v ised learning Uns u per v ised learning nds pa erns in data E . g ., cl u stering c u stomers b y

library .uns.ac.id digilib.uns.ac.id CHAPTER II LITERATURE REVIEW A. The Nature of Information

Handling missing data P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN

Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr .

library .uns.ac.id digilib.uns.ac.id CHAPTER I INTRODUCTION A. Background of Study The aim of

88 library .uns.ac.id digilib.uns.ac.id REFERENCES Altas, B. (2014). A Case Study of Multimodal

perpustakaan.uns.ac.id digilib.uns.ac.id ENHANCING STUDENTS VOCABULARY MASTERY BY USING PPP

A SOFTWARE ENGINEERING CASE Gordana Raki, goca@dmi.uns.ac.rs Zoran Budimac, zjb@dmi.uns.ac.rs

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

Codes on Random Geometric Graphs Dejan Vukobratovi Associate Professor, DEET-UNS University of

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant

Dimensionalit y red u ction : feat u re e x traction P R AC TIC IN G MAC H IN E L E AR N IN G

ISED Update DGSO Presentation to RABC January 16, 2019 Agenda The purpose of this

By : Ro Romi misaa saa Ad Adel , S l , SIV IV Major or chemistry mistry Supervis ised by:

The W OEST Trial: First random ised trial com paring tw o regim ens w ith and w ithout aspirin in

2015+TRECVID+Workshop: Surveillance+Event+Detection (SED) Retrospective+++Interactive+(rSED+iSED)+

organis ised crim rime Niall Hamilton-Smith, University of Stirling Scottish Association for

AIGRETTE Analyzing Large Scale Geometric Data Collections Kick-Off Chaires IA 09 September

Sustainable Shorelines Tools: Ecology of Shore zones Featuring Dave Strayer and Stuart Findlay

Strategy-proof rules for the choice of multiattribute alternatives Salvador Barber` a

Eleanor Feldman Barbera, PhD Author of The Savvy Residents

One dimensional mechanism design Herve Moulin University of Glasgow April 30, 2015 prior-free

Impact assessment of the EIB support to SMEs Workshop on measuring impact and additionality 30

Color Gradients in Galaxies Out to z~3 arXiv1202.0494 Niraj Welikala ( Institut dAstrophysique

Branching Networks Introduction River Networks Complex Networks, Course 295A, Spring, 2008