Assignment 4 Clustering Languages Verena Blaschke July 04, 2018 - - PowerPoint PPT Presentation

assignment 4
SMART_READER_LITE
LIVE PREVIEW

Assignment 4 Clustering Languages Verena Blaschke July 04, 2018 - - PowerPoint PPT Presentation

Assignment 4 Clustering Languages Verena Blaschke July 04, 2018 Assignment 4 I: Feature extraction II: K-means clustering III: Principal component analysis IV: Evaluation with gold-standard labels V: Calculating distances VI: Hierarchical


slide-1
SLIDE 1

Assignment 4

Clustering Languages Verena Blaschke July 04, 2018

slide-2
SLIDE 2

Assignment 4

I: Feature extraction II: K-means clustering III: Principal component analysis IV: Evaluation with gold-standard labels V: Calculating distances VI: Hierarchical clustering

slide-3
SLIDE 3

I: Feature extraction

fin s i l m æ fin k O r V A fin n E n æ fin s u u ... cmn th U N t ù 1 cmn õ @ n n a I

slide-4
SLIDE 4

I: Feature extraction

fin s i l m æ fin k O r V A fin n E n æ fin s u u ... cmn th U N t ù 1 cmn õ @ n n a I

80 languages × 272 IPA segments

slide-5
SLIDE 5

II: K-means clustering

k-means clustering for k in [2,70] silhouette coefficient

slide-6
SLIDE 6

II: K-means clustering

k-means clustering for k in [2,70] silhouette coefficient

How close (=similar) is each data point to other points from

its own cluster compared to other clusters?

[−1,+1], higher scores are better

slide-7
SLIDE 7

10 20 30 40 50 60 70 number of clusters 0.04 0.06 0.08 0.10 0.12 0.14 0.16 silhouette score

Silhouette score by number of clusters.

slide-8
SLIDE 8

II: K-means clustering

k-means clustering for k in [2,70] silhouette coefficient

How close (=similar) is each data point to other points from

its own cluster compared to other clusters?

[−1,+1], higher scores are better

error function

sum of squared distances from the closest centroids

kmeans.inertia_

slide-9
SLIDE 9

II: K-means clustering

k-means clustering for k in [2,70] silhouette coefficient

How close (=similar) is each data point to other points from

its own cluster compared to other clusters?

[−1,+1], higher scores are better

error function

sum of squared distances from the closest centroids

kmeans.inertia_

What is a good number of clusters? → elbow method

slide-10
SLIDE 10

10 20 30 40 50 60 70 number of clusters 0.0 0.2 0.4 0.6 0.8 1.0 error 1e8

Sum of squared distances from the centroids by number of clusters.

slide-11
SLIDE 11

III: Principal component analysis

remove redundant features

slide-12
SLIDE 12

III: Principal component analysis

remove redundant features

remove noise train machine learning models more quickly

slide-13
SLIDE 13

1500 1000 500 500 1000 1500 2000 1500 1000 500 500 1000 1500 smn mhr cym hrv sjd azj pbu slv hun tur koi chv pes rus bul eng tel yrk nio sms liv myv kan vep kpv fin kca lez dan ekk smj kmr sme ben

  • lo

mdf swe ava sah ita bel uzn ell mal

  • ss

hin gle pol krl slk sel lat che por mns nld bre cat spa ron deu bak lav fra ukr ces nor udm kaz mrj lbe dar enf ddo isl tam sma lit hye tat

Feature vectors scaled down to 2 dimensions.

slide-14
SLIDE 14

1500 1000 500 500 1000 1500 2000 1500 1000 500 500 1000 1500 tat liv ekk fra swe myv tam lbe bel ces vep kmr spa cat rus gle sms smn deu hrv mrj mdf yrk nio

  • lo

pol hye che smj sme sjd dar ava ddo fin ben ell lat bre kpv sel azj sah uzn dan lez isl cym

  • ss

mhr kaz slv mal koi lit lav bak ita kca tel nor tur hun kan eng udm sma ron hin nld por bul krl pbu enf chv mns pes slk ukr tat liv ekk fra swe myv tam lbe bel ces vep kmr spa cat rus gle sms smn deu hrv mrj mdf yrk nio

  • lo

pol hye che smj sme sjd dar ava ddo fin ben ell lat bre kpv sel azj sah uzn dan lez isl cym

  • ss

mhr kaz slv mal koi lit lav bak ita kca tel nor tur hun kan eng udm sma ron hin nld por bul krl pbu enf chv mns pes slk ukr

Feature vectors scaled down to 2 dimensions. Turkic Indo-European Uralic Nakh-Daghestanian Dravidian

slide-15
SLIDE 15

III: Principal component analysis

pca = PCA(features.shape[1]) d = 0 var_explained = 0 while var_explained < 0.9: var_explained += pca.explained_variance_ratio_[d] d += 1 featuresPCA = PCA(d).fit_transform(features)

slide-16
SLIDE 16

III: Principal component analysis

pca = PCA(features.shape[1]) d = 0 var_explained = 0 while var_explained < 0.9: var_explained += pca.explained_variance_ratio_[d] d += 1 featuresPCA = PCA(d).fit_transform(features) pca = PCA(0.9) print(pca.n_components_)

slide-17
SLIDE 17

5 10 15 20 component 0.0 0.2 0.4 0.6 0.8 1.0 variance explained

Variance explained per PCA component. variance explained (cumulative) variance explained per component

slide-18
SLIDE 18

IV: Evaluation with gold-standard labels

n_fam = len(set(family)) pred_all = KMeans(n_fam).fit_predict(features) pred_pca = KMeans(n_fam).fit_predict(featuresPCA)

slide-19
SLIDE 19

IV: Evaluation with gold-standard labels

n_fam = len(set(family)) pred_all = KMeans(n_fam).fit_predict(features) pred_pca = KMeans(n_fam).fit_predict(featuresPCA) lang all pca family

  • kan

2 4 Dravidian tam 3 Dravidian tel 4 Dravidian mal 4 2 Dravidian bul 1 Indo-European ces 1 Indo-European ...

slide-20
SLIDE 20

IV: Evaluation with gold-standard labels

Homogeneity: Each cluster contains data points of the same

gold-standard class.

slide-21
SLIDE 21

IV: Evaluation with gold-standard labels

Homogeneity: Each cluster contains data points of the same

gold-standard class.

Completeness: All members of a gold-standard class are in

the same cluster.

slide-22
SLIDE 22

IV: Evaluation with gold-standard labels

Homogeneity: Each cluster contains data points of the same

gold-standard class.

Completeness: All members of a gold-standard class are in

the same cluster.

V-measure: Harmonic mean of homogeneity and

completeness.

[0,1] higher is better

slide-23
SLIDE 23

IV: Evaluation with gold-standard labels

Homogeneity: Each cluster contains data points of the same

gold-standard class.

Completeness: All members of a gold-standard class are in

the same cluster.

V-measure: Harmonic mean of homogeneity and

completeness.

[0,1] higher is better

all H: 0.1707 C: 0.1461 V: 0.1575 PCA H: 0.1728 C: 0.1572 V: 0.1646

slide-24
SLIDE 24

V: Calculating distances

A B C D E 123 452 10 572 A 342 370 908 B 127 754 C 23 D E

slide-25
SLIDE 25

VI: Hierarchical clustering

for m in ['single', 'complete', 'average']: fig, ax = plt.subplots() z = scipy.cluster.hierarchy.linkage(dist, method=m) scipy.cluster.hierarchy.dendrogram(z, labels=languages) fig.savefig('dendrogram-{}.pdf'.format(method))

slide-26
SLIDE 26

pes

  • lo

mal che sjd sel nio

  • ss

yrk kpv hin spa dar ava por tel ddo hye slk bul koi ukr isl tur enf sah mdf bre hun chv bak smn vep azj sma smj nld kmr lbe slv bel lit tam kaz mrj eng sme lav fin krl lez tat ron cat hrv mhr cym fra lat liv ben uzn kca kan nor sms myv gle mns udm ekk ita ell dan rus ces swe pol deu pbu

1000 2000 3000 4000 5000 6000 7000

Hierarchical clustering with single linking

slide-27
SLIDE 27

che nio

  • ss

sjd sel bre hun smn vep kpv nld kmr tur sah mdf enf chv bak pes dan rus ces

  • lo

yrk lbe slv bel azj sma smj udm ekk hrv cym myv gle mns ita ell swe pol deu pbu mal tam uzn kca kan fra lat ben sms nor liv mhr dar ava por tel koi ukr isl hye slk bul ddo spa hin sme lav fin krl lez tat ron cat lit kaz mrj eng

5000 10000 15000

Hierarchical clustering with complete linking

slide-28
SLIDE 28

mal hrv cym myv gle ben uzn kca kan liv sms nor mhr fra lat

  • lo

yrk udm ekk dan rus ces swe pol deu pbu mns ita ell koi ukr isl dar ava por tel hye slk bul ddo spa tam kaz mrj eng lit lez tat ron cat hin sme lav fin krl pes sjd sel che nio

  • ss

smn vep kpv nld kmr tur enf sah mdf bre hun chv bak lbe slv bel azj sma smj

2000 4000 6000 8000 10000

Hierarchical clustering with average linking