Estimating and Visualizing Language Similarities Using Weighted - - PowerPoint PPT Presentation

estimating and visualizing language similarities using
SMART_READER_LITE
LIVE PREVIEW

Estimating and Visualizing Language Similarities Using Weighted - - PowerPoint PPT Presentation

Estimating and Visualizing Language Similarities Using Weighted Alignment and Force-Directed Graph Layout Gerhard J ager April 24, 2012, Avignon joint work with Armin Buch, David Erschler & Andrei Lupas Gerhard J ager (T ubingen)


slide-1
SLIDE 1

Estimating and Visualizing Language Similarities Using Weighted Alignment and Force-Directed Graph Layout

Gerhard J¨ ager

April 24, 2012, Avignon

joint work with Armin Buch, David Erschler & Andrei Lupas

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 1 / 27

slide-2
SLIDE 2

Force Directed Graph Layout

method to visualize graphs or similarity matrices in two or three dimensions simulation of a physical system:

data items ⇔ physical particles pairwise attractive force between particles proportional to their similarity constant repelling force between any pair of particles this is just one of many protocols to determine forces

initially, all particles are placed at random in each time step, each particle is move a small amount along the resulting force vector last step is repeated until a stable state is reached

tends to stabilize in a state where groups of mutually similar items form clusters

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 2 / 27

slide-3
SLIDE 3

CLANS

Cluster Analysis of Sequences developed by bioinformaticians Tancred Frickey and Andrei Lupas as exploratory tool to explore evolutionary relationships among protein sequences (Frickey and Lupas 2004) similarities of proteins is determined via sequence alignment; resulting matrix is visualized using CLANS advantages in comparison to tree-based algorithms:

does not a priori assume a tree like signal (useful when lateral transfer plays a role) fast (esp. in comparison to character based algorithms) robust (noise in data items does not accumulate)

general impression so far (Lupas, p.c.):

tree algorithms are more precise when evolutionary distances are small; CLANS is more sensitive to weak evolutionary signals

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 3 / 27

slide-4
SLIDE 4

The Automated Similarity Judgment Program

Project at MPI EVA in Leipzig around S¨

  • ren Wichmann

covers more than 5,000 languages basic vocabulary of 40 words for each language, in uniform phonetic transcription freely available used concepts: I, you, we, one, two, person, fish, dog, louse, tree, leaf, skin,

blood, bone, horn, ear, eye, nose, tooth, tongue, knee, hand, breast, liver, drink, see, hear, die, come, sun, star, water, stone, fire, path, mountain, night, full, new, name

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 4 / 27

slide-5
SLIDE 5

First shot: Levenshtein Distance

first step: finde minmal edit distance between all translation pairs

  • f the languages to be compared

e.g. German ↔ Latin edit distance = 2 transformation into similarity measure sim(x, y) . = 2(max(l(x), l(y)) − dLev(x, y)) l(x) + l(y) similarity between L1 and L2: average similarity of translation pairs between L1 and L2

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 5 / 27

slide-6
SLIDE 6

First shot: normalized Levenshtein Distance

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 6 / 27

slide-7
SLIDE 7

First shot: normalized Levenshtein Distance

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 7 / 27

slide-8
SLIDE 8

First shot: normalized Levenshtein Distance

basic problem here: the smaller the sound inventories of the languages compared, the higher is the probability of false positives

  • 10

15 20 25 30 0.05 0.10 0.15 0.20 0.25 0.30 0.35 phoneme inventory similarity

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 8 / 27

slide-9
SLIDE 9

Benchmark: LDND measure

Wichmann et al.: doubly normalized Levenshtein distance (Levenshtein Distance Normalized and Divided) normalization for word length nld(x, y) . = dLev(x, y) max(l(x), l(y)) (1) normalization for language specific patterns (including sound inventory size):

normalization factor 1/µ µL1,L2: mean of {nld(x, y)|x ∈ L1, y ∈ L1, x = y}

ldnd(x, y, L1, L2) . = nld(x, y) µL1,L2 ldnd(L1, L1) . =

  • x∈L1,y∈L2{ldnd(x, y, L1, L2) : x = y}

#{x, y : x = y}

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 9 / 27

slide-10
SLIDE 10

Benchmark: LDND measure

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 10 / 27

slide-11
SLIDE 11

Needleman-Wunsch-Algorithmus

Levenshtein distance is somewhat coarse grained simply normalized distance is 0.5 in both cases after second normalization, [hant] even appears somewhat closer to [mano] (ldnd = 0.54) than to [hEnd] (ldnd = 0.55) correspondences a∼E, t∼d are (according to linguistic criteria like place of articulation) much more natural than h∼m or t∼o German appears equidistant to English and Spanish here, even though the distance to English is clearly smaller

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 11 / 27

slide-12
SLIDE 12

Weighted Alignment

Needleman Wunsch Algorithm

similar to computation of Levenshtein distance edit operations are weighted: algorithm finds optimal alignment, that minimizes total weight a ∼ E, d ∼ t should have lower weight than t ∼ o

How to determine these weights?

bioinformatics: log-odds logarithm of the probability of a replacement, divided by probability

  • f chance co-occurrence of molecula pair in question

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 12 / 27

slide-13
SLIDE 13

Weighted Alignment

estimation of correspondence probabilities of two sounds in cognates:

large sample of pairwise related languages replacement operation under Levenshtein alignment of translation pairs are counted substantive part of word pairs considered are true cognates: replacement operations thus reflect genuine language change processes replacement of sounds between non-cognates is randomly distributed and boils down to an additive constant in the logarithmic term

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 13 / 27

slide-14
SLIDE 14

Weighted Alignment

log odds:

d ∼ t: 0.69 a ∼ E: 0.07 h ∼ m: −0.61 t ∼ o: −0.80

x X G q g k 4 N

  • u

a E e 3 i v w m f b p T ! C c 7 h s 8 z n 5 y l r L S Z t d j −5 5 10 velar/uvular vowels labials dental

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 14 / 27

slide-15
SLIDE 15

Weighted Alignment

total value of optimal alignment is interpreted as similarity between strings similarity between languages is computed via p-values: nwpv(x, y, L1, L2) . = #{(x′, y′)|nw(x′, y′) ≥ nw(x, y)} #L1 × #L2 nwpv(L1, L2) . =

  • x∈L1,y∈L2 − log(pv(x, y))

#{(x, y) : x = y}

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 15 / 27

slide-16
SLIDE 16

Weighted Alignment

similarities of English to Dutch 0.60 / 3.38 German: 0.68 / 3.16 Proto-Indoeuropean: 0.86 / 2.33 Latin: 0.88 / 1.85 Spanish: 0.93 / 1.59 Russian: 0.93 / 1.52 Hungarian: 0.95 / 1.30 Turkish: 1.03 / 0.83

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 16 / 27

slide-17
SLIDE 17

Comparison

two reasonable Gold standards for comparing these two similarity/distance measures:

expert judgments on cognacy expert judgments on language classification

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 17 / 27

slide-18
SLIDE 18

Comparison: cognacy

Dyen-Kruskal database: cognacy judgment for 200-item Swadesh lists from 95 Indo-European languages experiment:

extract those items from the Dyen-Kruskal database that occur in ASJP define a cognacy estimator based on LDND by finding the optimal cutoff do the same for NWPV compare

result

LDND: optimally achievable Matthews Correlation Coefficient: 0.547 NWPV: optimally achievable Matthews Correlation Coefficient: 0.574 (+1 means perfect prediction, -1 perfect mis-prediction)

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 18 / 27

slide-19
SLIDE 19

Comparison: language classification

Ethnologue: provides taxonomic classification of virtually all living languages Robinson-Foulds metric:

compares two trees over the same set of leafs returns number of partitions that one of the two trees makes and the other doesn’t

Neighbor Joining Algorithm: bottom up cluster algorithm to extract an unrooted tree from a distance matrix

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 19 / 27

slide-20
SLIDE 20

Comparison: language classification

Experiment:

compute NJ-tree for all languages in ASJP based on LDND and NWPV distances extract sub-tree of Ethnologue tree for the languages in ASJP compute Robinson-Foulds metric between Ethnologue tree and each

  • f the two NJ trees

Outcome:

LDND: 5,522 (4,550 false positives, 972 false negatives) NWPV: 5,476 (4,527 false positives, 949 false negatives)

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 20 / 27

slide-21
SLIDE 21

Qualitative comparison

NJ trees for the languages of Eurasia (left: LDND; right: NWPV)

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 21 / 27

slide-22
SLIDE 22

Qualitative comparison

CLANS visualization for the languages of Eurasia (LDND left, NWPV right)

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 22 / 27

slide-23
SLIDE 23

CLANS and dimensionality reduction

CLANS performs a kind of (non-deterministic) dimensionality reduction How does this relate to more established methods?

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 23 / 27

slide-24
SLIDE 24

CLANS vs. Multi-Dimensional Scaling

MDS applied to NWPV-matrix of the Eurasian languages

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

MDS

AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA Bas IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IEIE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Yka Yka Niv Niv Niv Niv Niv Niv Yen Yen Yen Yen Yen Yen CK CK CK CKCK Ain Jap Jap Jap Jap Jap Jap Jap Jap Jap Kor NWC NWC NWC NWC NWC NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa Krt Krt Krt Krt Brs Brs Brs Brs Brs Brs Brs Brs Brs Nah Nah Kus Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra DraDra Dra ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST STST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST STST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST STST ST STST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuAAuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA Sho

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 24 / 27

slide-25
SLIDE 25

CLANS vs. Principal Component Analysis

PCA applied to NWPV-matrix of the Eurasian languages

−0.20 −0.15 −0.10 −0.05 0.00 0.05 −0.1 0.0 0.1 0.2 0.3

PCA

AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AA AAAA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA Bas IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IEIE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE IE Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Ura Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Alt Yka Yka Niv Niv Niv Niv Niv Niv Yen Yen Yen Yen Yen Yen CK CKCK CKCK Ain JapJap Jap Jap Jap Jap Jap Jap Jap Kor NWC NWC NWC NWC NWC NDa NDa NDa NDa NDa NDa NDa NDaNDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa NDa Krt Krt Krt Krt Brs Brs Brs Brs Brs Brs Brs Brs Brs Nah Nah Kus Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra Dra ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST STST ST ST STST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST STST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST ST HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM HM TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TKTK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TKTK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK TK AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuAAuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA AuA Sho

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 25 / 27

slide-26
SLIDE 26

CLANS and dimensionality reduction

language families massively vary in size MDS and PCA only provide information about the largest families CLANS is sensitive to local patterns

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 26 / 27

slide-27
SLIDE 27

Conclusion

weighted alignment improves results of lexico-statistical language classification more powerful methods from bioinformatics (such as progressive multiple alignment) are likely to lead to further improvement CLANS is a useful exploratory tool

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 27 / 27

slide-28
SLIDE 28

Frickey, T. and A. N. Lupas (2004). Clans: a java application for visualizing protein families based on pairwise similarity. Bioinformatics, 20(18):3702–3704.

Gerhard J¨ ager (T¨ ubingen) Visualizing Language Similarities 4-24-12 27 / 27