Assignment 4 Clustering Languages Verena Blaschke July 04, 2018

Assignment 4 I: Feature extraction II: K-means clustering III: Principal component analysis IV: Evaluation with gold-standard labels V: Calculating distances VI: Hierarchical clustering

I: Feature extraction fin s i l m æ fin k O r V A fin n E n æ fin s u u ... t h U N t ù 1 cmn cmn õ @ n n a I

I: Feature extraction fin s i l m æ fin k O r V A fin n E n æ fin s u u ... t h U N t ù 1 cmn cmn õ @ n n a I � 80 languages × 272 IPA segments

II: K-means clustering � k-means clustering for k in [ 2 , 70 ] � silhouette coefficient

II: K-means clustering � k-means clustering for k in [ 2 , 70 ] � silhouette coefficient � How close (=similar) is each data point to other points from its own cluster compared to other clusters? � [ − 1 , + 1 ] , higher scores are better

Silhouette score by number of clusters. 0.16 0.14 0.12 silhouette score 0.10 0.08 0.06 0.04 0 10 20 30 40 50 60 70 number of clusters

II: K-means clustering � k-means clustering for k in [ 2 , 70 ] � silhouette coefficient � How close (=similar) is each data point to other points from its own cluster compared to other clusters? � [ − 1 , + 1 ] , higher scores are better � error function � sum of squared distances from the closest centroids kmeans.inertia_

II: K-means clustering � k-means clustering for k in [ 2 , 70 ] � silhouette coefficient � How close (=similar) is each data point to other points from its own cluster compared to other clusters? � [ − 1 , + 1 ] , higher scores are better � error function � sum of squared distances from the closest centroids kmeans.inertia_ � What is a good number of clusters? → elbow method

Sum of squared distances from the centroids by number of clusters. 1e8 1.0 0.8 0.6 error 0.4 0.2 0.0 0 10 20 30 40 50 60 70 number of clusters

III: Principal component analysis � remove redundant features

III: Principal component analysis � remove redundant features � remove noise � train machine learning models more quickly

Feature vectors scaled down to 2 dimensions. 1500 che 1000 nio kpv kaz yrk oss pes 500 koi udm eng deu mrj cat nor ron dan rus gle kan sjd kmr spa mhr nld lit bak myv sel por pbu tat chv fra sah lez cym kca azj mdf smn tel ava mns sms ddo uzn ell ita hin 0 bre liv tam ekk hun swe hye ben isl ukr dar enf pol bel lbe ces vep tur bul mal lat 500 slk fin krl sma sme lav olo slv smj hrv 1000 1500 1500 1000 500 0 500 1000 1500 2000

Feature vectors scaled down to 2 dimensions. 1500 Turkic Indo-European che che 1000 Uralic Nakh-Daghestanian nio nio kpv kpv Dravidian kaz kaz yrk yrk oss oss pes pes 500 koi koi udm udm eng eng deu deu mrj mrj cat cat nor nor ron ron dan dan rus rus gle gle kan kan sjd sjd kmr kmr spa spa mhr mhr nld nld lit lit bak bak myv myv sel sel por por pbu pbu tat tat chv chv fra fra sah sah lez lez cym cym kca kca azj azj mdf mdf smn smn tel tel ava ava mns mns sms sms ddo ddo uzn uzn ita ita hin hin ell ell 0 bre bre liv liv tam tam ekk ekk hun hun swe swe hye hye ben ben isl isl ukr ukr dar dar enf enf pol pol bel bel lbe lbe ces ces vep vep tur tur bul bul mal mal lat lat 500 slk slk fin fin krl krl sma sma sme sme lav lav olo olo slv slv smj smj hrv hrv 1000 1500 1500 1000 500 0 500 1000 1500 2000

III: Principal component analysis pca = PCA(features.shape[1]) d = 0 var_explained = 0 while var_explained < 0.9: var_explained += pca.explained_variance_ratio_[d] d += 1 featuresPCA = PCA(d).fit_transform(features)

III: Principal component analysis pca = PCA(features.shape[1]) d = 0 var_explained = 0 while var_explained < 0.9: var_explained += pca.explained_variance_ratio_[d] d += 1 featuresPCA = PCA(d).fit_transform(features) pca = PCA(0.9) print(pca.n_components_)

Variance explained per PCA component. 1.0 variance explained (cumulative) variance explained per component 0.8 variance explained 0.6 0.4 0.2 0.0 0 5 10 15 20 component

IV: Evaluation with gold-standard labels n_fam = len(set(family)) pred_all = KMeans(n_fam).fit_predict(features) pred_pca = KMeans(n_fam).fit_predict(featuresPCA)

IV: Evaluation with gold-standard labels n_fam = len(set(family)) pred_all = KMeans(n_fam).fit_predict(features) pred_pca = KMeans(n_fam).fit_predict(featuresPCA) lang all pca family --------------------------------------------------- kan 2 4 Dravidian tam 3 0 Dravidian tel 4 0 Dravidian mal 4 2 Dravidian bul 0 1 Indo-European ces 0 1 Indo-European ...

IV: Evaluation with gold-standard labels � Homogeneity : Each cluster contains data points of the same gold-standard class.

IV: Evaluation with gold-standard labels � Homogeneity : Each cluster contains data points of the same gold-standard class. � Completeness : All members of a gold-standard class are in the same cluster.

IV: Evaluation with gold-standard labels � Homogeneity : Each cluster contains data points of the same gold-standard class. � Completeness : All members of a gold-standard class are in the same cluster. � V-measure : Harmonic mean of homogeneity and completeness. � [ 0 , 1 ] higher is better

IV: Evaluation with gold-standard labels � Homogeneity : Each cluster contains data points of the same gold-standard class. � Completeness : All members of a gold-standard class are in the same cluster. � V-measure : Harmonic mean of homogeneity and completeness. � [ 0 , 1 ] higher is better all H: 0.1707 C: 0.1461 V: 0.1575 PCA H: 0.1728 C: 0.1572 V: 0.1646

V: Calculating distances A B C D E 123 452 10 572 A 342 370 908 B 127 754 C 23 D E

VI: Hierarchical clustering for m in ['single', 'complete', 'average']: fig, ax = plt.subplots() z = scipy.cluster.hierarchy.linkage(dist, method=m) scipy.cluster.hierarchy.dendrogram(z, labels=languages) fig.savefig('dendrogram-{}.pdf'.format(method))

1000 2000 3000 4000 5000 6000 7000 0 pes olo mal che sjd sel nio oss yrk kpv hin spa dar ava Hierarchical clustering with single linking por tel ddo hye slk bul koi ukr isl tur enf sah mdf bre hun chv bak smn vep azj sma smj nld kmr lbe slv bel lit tam kaz mrj eng sme lav fin krl lez tat ron cat hrv mhr cym fra lat liv ben uzn kca kan nor sms myv gle mns udm ekk ita ell dan rus ces swe pol deu pbu

10000 15000 5000 0 che nio oss sjd sel bre hun smn vep kpv nld kmr Hierarchical clustering with complete linking tur sah mdf enf chv bak pes dan rus ces olo yrk lbe slv bel azj sma smj udm ekk hrv cym myv gle mns ita ell swe pol deu pbu mal tam uzn kca kan fra lat ben sms nor liv mhr dar ava por tel koi ukr isl hye slk bul ddo spa hin sme lav fin krl lez tat ron cat lit kaz mrj eng

10000 2000 4000 6000 8000 0 mal hrv cym myv gle ben uzn kca kan liv sms nor mhr Hierarchical clustering with average linking fra lat olo yrk udm ekk dan rus ces swe pol deu pbu mns ita ell koi ukr isl dar ava por tel hye slk bul ddo spa tam kaz mrj eng lit lez tat ron cat hin sme lav fin krl pes sjd sel che nio oss smn vep kpv nld kmr tur enf sah mdf bre hun chv bak lbe slv bel azj sma smj

Assignment 4 Clustering Languages Verena Blaschke July 04, 2018 - PowerPoint PPT Presentation

Assignment 4 Clustering Languages Verena Blaschke July 04, 2018 Assignment 4 I: Feature extraction II: K-means clustering III: Principal component analysis IV: Evaluation with gold-standard labels V: Calculating distances VI: Hierarchical

DPS915 Presentation Ray Tracing Parallelization Soutrik Barua Faiq Malik Assignment

Objects Announcements for Today Assignment 1 Assignment 2 We are starting grading

Assignment Design Assignment Design Across the Curriculum: Across the Curriculum: Cueing for

Objects Announcements for Today Assignment 1 Assignment 2 We are starting grading

Volunteer Name: State of Origin: Occupation: Assignment Title: SOW NO: Host Organization:

Dedicated Storage Assignment (DSAP) The assignment of items to slots is termed slotting

Announcements Assignment 4 due today. Assignment 5 uploaded to website and Piazza. Will be due

Assignment # 2 So You Want to Write a Physically Based Motion Which is something you may wish

CSE 158/258 Web Mining and Recommender Systems Assignment 2 Assignment 2 Open-ended Due

MCC assignment info Slides will be available in Noppa Assignment assistants: Rasmus Eskola

Assignment 1 Florian Vesting 2012-09-07 Florian Vesting Assignment 1 2012-09-07 1 / 11

Assignment #3 Which is something you may wish to do since it is Assignment #3 So You Want to

JAVASCRIPT PROGRAMMING Functions Examples Homework assignment

Assignment 01 Assignment 01 First Steps Prepare the Android development environment and create

Writing Assignment 2 Polisci 209 Writing Assignment 2 First Draft due on November 16th, Final

CS 2112 Lab 10: Assignment 6 CS 2112 Lab 10: Assignment 6 November 5 / 7, 2018 CS 2112 Lab 10:

Michael Shulman The Madness of Crowds New Normal Investor Options Income Blueprint Short Side

Wolfram Technologies: Graph Theory and Social Networks Martin Hadley Wolfram

Dept of Defense Combatant Commanders 5 Geographic Commands USEUCOM USEUCOM USCENTCOM USCENTCOM

Appraising income inequality data bases in LatinAmerica Franois Bourguignon Paris School of

UVVM - The fastest growing FPGA verification methodology world-wide! Workshop on Open Source

Machine Learning for Trading Financial Investing Part 3 of Course Overview and Introduction So

Quality Differentiation and International Trade Jos e de Sousa and Isabelle Mejean Topics in

Controls: Everything Youve Ever Wanted to Know but Were Afraid to Ask January 23, 2018 Carmen