uns u per v ised learning
play

Uns u per v ised Learning U N SU P E R VISE D L E AR N IN G IN P - PowerPoint PPT Presentation

Uns u per v ised Learning U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io Uns u per v ised learning Uns u per v ised learning nds pa erns in data E . g ., cl u stering c u stomers b y


  1. Uns u per v ised Learning U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io

  2. Uns u per v ised learning Uns u per v ised learning � nds pa � erns in data E . g ., cl u stering c u stomers b y their p u rchases Compressing the data u sing p u rchase pa � erns ( dimension red u ction ) UNSUPERVISED LEARNING IN PYTHON

  3. S u per v ised v s u ns u per v ised learning S u per v ised learning � nds pa � erns for a prediction task E . g ., classif y t u mors as benign or cancero u s ( labels ) Uns u per v ised learning � nds pa � erns in data ... b u t w itho u t a speci � c prediction task in mind UNSUPERVISED LEARNING IN PYTHON

  4. Iris dataset Meas u rements of man y iris plants Three species of iris : setosa v ersicolor v irginica Petal length , petal w idth , sepal length , sepal w idth ( the feat u res of the dataset ) 1 h � p :// scikit - learn . org / stable / mod u les / generated / sklearn . datasets . load _ iris . html / UNSUPERVISED LEARNING IN PYTHON

  5. Arra y s , feat u res & samples 2 D N u mP y arra y Col u mns are meas u rements ( the feat u res ) Ro w s represent iris plants ( the samples ) UNSUPERVISED LEARNING IN PYTHON

  6. Iris data is 4- dimensional Iris samples are points in 4 dimensional space Dimension = n u mber of feat u res Dimension too high to v is u ali z e ! ... b u t u ns u per v ised learning gi v es insight UNSUPERVISED LEARNING IN PYTHON

  7. k - means cl u stering Finds cl u sters of samples N u mber of cl u sters m u st be speci � ed Implemented in sklearn (" scikit - learn ") UNSUPERVISED LEARNING IN PYTHON

  8. print(samples) [[ 5. 3.3 1.4 0.2] [ 5. 3.5 1.3 0.3] ... [ 7.2 3.2 6. 1.8]] from sklearn.cluster import KMeans model = KMeans(n_clusters=3) model.fit(samples) KMeans(algorithm='auto', ...) labels = model.predict(samples) print(labels) [0 0 1 1 0 1 2 1 0 1 ...] UNSUPERVISED LEARNING IN PYTHON

  9. Cl u ster labels for ne w samples Ne w samples can be assigned to e x isting cl u sters k - means remembers the mean of each cl u ster ( the " centroids ") Finds the nearest centroid to each ne w sample UNSUPERVISED LEARNING IN PYTHON

  10. Cl u ster labels for ne w samples print(new_samples) [[ 5.7 4.4 1.5 0.4] [ 6.5 3. 5.5 1.8] [ 5.8 2.7 5.1 1.9]] new_labels = model.predict(new_samples) print(new_labels) [0 2 1] UNSUPERVISED LEARNING IN PYTHON

  11. Scatter plots Sca � er plot of sepal length v s . petal length Each point represents an iris sample Color points b y cl u ster labels P y Plot ( matplotlib.pyplot ) UNSUPERVISED LEARNING IN PYTHON

  12. Scatter plots import matplotlib.pyplot as plt xs = samples[:,0] ys = samples[:,2] plt.scatter(xs, ys, c=labels) plt.show() UNSUPERVISED LEARNING IN PYTHON

  13. Let ' s practice ! U N SU P E R VISE D L E AR N IN G IN P YTH ON

  14. E v al u ating a cl u stering U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io

  15. E v al u ating a cl u stering Can check correspondence w ith e . g . iris species ... b u t w hat if there are no species to check against ? Meas u re q u alit y of a cl u stering Informs choice of ho w man y cl u sters to look for UNSUPERVISED LEARNING IN PYTHON

  16. Iris : cl u sters v s species k - means fo u nd 3 cl u sters amongst the iris samples Do the cl u sters correspond to the species ? species setosa versicolor virginica labels 0 0 2 36 1 50 0 0 2 0 48 14 UNSUPERVISED LEARNING IN PYTHON

  17. Cross tab u lation w ith pandas Cl u sters v s species is a " cross - tab u lation " Use the pandas librar y Gi v en the species of each sample as a list species print(species) ['setosa', 'setosa', 'versicolor', 'virginica', ... ] UNSUPERVISED LEARNING IN PYTHON

  18. Aligning labels and species import pandas as pd df = pd.DataFrame({'labels': labels, 'species': species}) print(df) labels species 0 1 setosa 1 1 setosa 2 2 versicolor 3 2 virginica 4 1 setosa ... UNSUPERVISED LEARNING IN PYTHON

  19. Crosstab of labels and species ct = pd.crosstab(df['labels'], df['species']) print(ct) species setosa versicolor virginica labels 0 0 2 36 1 50 0 0 2 0 48 14 Ho w to e v al u ate a cl u stering , if there w ere no species information ? UNSUPERVISED LEARNING IN PYTHON

  20. Meas u ring cl u stering q u alit y Using onl y samples and their cl u ster labels A good cl u stering has tight cl u sters Samples in each cl u ster b u nched together UNSUPERVISED LEARNING IN PYTHON

  21. Inertia meas u res cl u stering q u alit y Meas u res ho w spread o u t the cl u sters are ( lo w er is be � er ) Distance from each sample to centroid of its cl u ster A � er fit() , a v ailable as a � rib u te inertia_ k - means a � empts to minimi z e the inertia w hen choosing cl u sters from sklearn.cluster import KMeans model = KMeans(n_clusters=3) model.fit(samples) print(model.inertia_) 78.9408414261 UNSUPERVISED LEARNING IN PYTHON

  22. The n u mber of cl u sters Cl u sterings of the iris dataset w ith di � erent n u mbers of cl u sters More cl u sters means lo w er inertia What is the best n u mber of cl u sters ? UNSUPERVISED LEARNING IN PYTHON

  23. Ho w man y cl u sters to choose ? A good cl u stering has tight cl u sters ( so lo w inertia ) ... b u t not too man y cl u sters ! Choose an " elbo w" in the inertia plot Where inertia begins to decrease more slo w l y E . g ., for iris dataset , 3 is a good choice UNSUPERVISED LEARNING IN PYTHON

  24. Let ' s practice ! U N SU P E R VISE D L E AR N IN G IN P YTH ON

  25. Transforming feat u res for better cl u sterings U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io

  26. Piedmont w ines dataset 178 samples from 3 distinct v arieties of red w ine : Barolo , Grignolino and Barbera Feat u res meas u re chemical composition e . g . alcohol content Vis u al properties like " color intensit y" 1 So u rce : h � ps :// archi v e . ics .u ci . ed u/ ml / datasets / Wine UNSUPERVISED LEARNING IN PYTHON

  27. Cl u stering the w ines from sklearn.cluster import KMeans model = KMeans(n_clusters=3) labels = model.fit_predict(samples) UNSUPERVISED LEARNING IN PYTHON

  28. Cl u sters v s . v arieties df = pd.DataFrame({'labels': labels, 'varieties': varieties}) ct = pd.crosstab(df['labels'], df['varieties']) print(ct) varieties Barbera Barolo Grignolino labels 0 29 13 20 1 0 46 1 2 19 0 50 UNSUPERVISED LEARNING IN PYTHON

  29. Feat u re v ariances The w ine feat u res ha v e v er y di � erent v ariances ! Variance of a feat u re meas u res spread of its v al u es feature variance alcohol 0.65 malic_acid 1.24 ... od280 0.50 proline 99166.71 UNSUPERVISED LEARNING IN PYTHON

  30. Feat u re v ariances The w ine feat u res ha v e v er y di � erent v ariances ! Variance of a feat u re meas u res spread of its v al u es feature variance alcohol 0.65 malic_acid 1.24 ... od280 0.50 proline 99166.71 UNSUPERVISED LEARNING IN PYTHON

  31. StandardScaler In kmeans : feat u re v ariance = feat u re in �u ence StandardScaler transforms each feat u re to ha v e mean 0 and v ariance 1 Feat u res are said to be " standardi z ed " UNSUPERVISED LEARNING IN PYTHON

  32. sklearn StandardScaler from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(samples) StandardScaler(copy=True, with_mean=True, with_std=True) samples_scaled = scaler.transform(samples) UNSUPERVISED LEARNING IN PYTHON

  33. Similar methods StandardScaler and KMeans ha v e similar methods Use fit() / transform() w ith StandardScaler Use fit() / predict() w ith KMeans UNSUPERVISED LEARNING IN PYTHON

  34. StandardScaler , then KMeans Need to perform t w o steps : StandardScaler , then KMeans Use sklearn pipeline to combine m u ltiple steps Data � o w s from one step into the ne x t UNSUPERVISED LEARNING IN PYTHON

  35. Pipelines combine m u ltiple steps from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans scaler = StandardScaler() kmeans = KMeans(n_clusters=3) from sklearn.pipeline import make_pipeline pipeline = make_pipeline(scaler, kmeans) pipeline.fit(samples) Pipeline(steps=...) labels = pipeline.predict(samples) UNSUPERVISED LEARNING IN PYTHON

  36. Feat u re standardi z ation impro v es cl u stering With feat u re standardi z ation : varieties Barbera Barolo Grignolino labels 0 0 59 3 1 48 0 3 2 0 0 65 Witho u t feat u re standardi z ation w as v er y bad : varieties Barbera Barolo Grignolino labels 0 29 13 20 1 0 46 1 2 19 0 50 UNSUPERVISED LEARNING IN PYTHON

  37. sklearn preprocessing steps StandardScaler is a " preprocessing " step MaxAbsScaler and Normalizer are other e x amples UNSUPERVISED LEARNING IN PYTHON

  38. Let ' s practice ! U N SU P E R VISE D L E AR N IN G IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend