Word manifolds John Goldsmith University of Chicago July 15, 2015 - - PowerPoint PPT Presentation

word manifolds
SMART_READER_LITE
LIVE PREVIEW

Word manifolds John Goldsmith University of Chicago July 15, 2015 - - PowerPoint PPT Presentation

Word manifolds John Goldsmith University of Chicago July 15, 2015 John Goldsmith (University of Chicago) Word manifolds July 15, 2015 1 / 49 Goals Goals Visualize the global structure of a language John Goldsmith (University of Chicago)


slide-1
SLIDE 1

Word manifolds

John Goldsmith

University of Chicago

July 15, 2015

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 1 / 49

slide-2
SLIDE 2

Goals

Goals

Visualize the global structure of a language

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 2 / 49

slide-3
SLIDE 3

Goals

Goals

Visualize the global structure of a language Solve a technical problem in the unsupervised learning of morphology (past tenses of English verbs)

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 2 / 49

slide-4
SLIDE 4

Goals

Goals

Visualize the global structure of a language Solve a technical problem in the unsupervised learning of morphology (past tenses of English verbs) Develop a language-independent method

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 2 / 49

slide-5
SLIDE 5

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 3 / 49

slide-6
SLIDE 6

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 4 / 49

slide-7
SLIDE 7

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 5 / 49

slide-8
SLIDE 8

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 6 / 49

slide-9
SLIDE 9

The algorithm is in three steps:

Algorithm

1 Compare all pairs of words to see which words agree on the word

that precedes and follows it. the and my will agree a lot.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 7 / 49

slide-10
SLIDE 10

The algorithm is in three steps:

Algorithm

1 Compare all pairs of words to see which words agree on the word

that precedes and follows it. the and my will agree a lot.

2 Turn this abstract graph into something in a geometric space, so

we can talk about distances.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 7 / 49

slide-11
SLIDE 11

The algorithm is in three steps:

Algorithm

1 Compare all pairs of words to see which words agree on the word

that precedes and follows it. the and my will agree a lot.

2 Turn this abstract graph into something in a geometric space, so

we can talk about distances.

3 In that geometric space of dimension 10, ask each word to find out

what the 6 closest words to it are. Make a graph out of those edges. The graph S can be directly viewed, using data visualization tools such as Gephi, and various clustering techniques can be applied to it as well.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 7 / 49

slide-12
SLIDE 12

The algorithm is in three steps:

Algorithm

1 Determine similarity between all pairs of words, based on a

comparison of word-context, and the creation of a graph C whose edge-weights is determined directly by those similarities. Every pair of words (w1, w2) calculates how many contexts they share in common.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 8 / 49

slide-13
SLIDE 13

The algorithm is in three steps:

Algorithm

1 Determine similarity between all pairs of words, based on a

comparison of word-context, and the creation of a graph C whose edge-weights is determined directly by those similarities.

2 Second, the computation of the K most significant eigenvectors of

the normalized Laplacian of graph C, and the calculation of the coordinates of each of the words in Rk based on these eigenvectors (where K is 10. Why 10? Why not?).

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 8 / 49

slide-14
SLIDE 14

The algorithm is in three steps:

Algorithm

1 Determine similarity between all pairs of words, based on a

comparison of word-context, and the creation of a graph C whose edge-weights is determined directly by those similarities.

2 Second, the computation of the K most significant eigenvectors of

the normalized Laplacian of graph C, and the calculation of the coordinates of each of the words in Rk based on these eigenvectors (where K is 10. Why 10? Why not?).

3 Third, calculation of a new distance d(., .) between all pairs of

words, viewing the words as points in RK; a new graph S is constructed, whose edge weights are directly based on distance in RK. The graph S can be directly viewed, using data visualization tools such as Gephi, and various clustering techniques can be applied to it as well.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 8 / 49

slide-15
SLIDE 15

First step: 1

Property W(-1) = wj the word to the immediately left of w is wj; W(1) = wj the word to the immediately right of w is wj; W(-2) = wj the word two words left of w is wj; etc. W(-2,-1) = (wj,wk) W(-2)=wj and W(-1)=wk. W(-1,1) = (wj,wk) W(-1)=wj and W(1)=wk.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 9 / 49

slide-16
SLIDE 16

the even in — a his their its an this part of— a his their its

  • ur

an my this your —way , a his their its my this in—small a their its

  • ur

my this spirit of— a his its

  • ur

my this

  • f all—

a his their its

  • ur

my this

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 10 / 49

slide-17
SLIDE 17

would that he — could can should must might may will —be taken could can should must might may will maybe I— could can should will didn’t couldn’t he — get could can should might may didn’t couldn’t — be . could can should must might may will — be considered could can should must might will — be , could can should must might may — be a can should must might may will

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 11 / 49

slide-18
SLIDE 18

Step 2

Eigenvector number 1 word coordinate world

  • 0.059

1 problem

  • 0.054

2 family

  • 0.054

3 car

  • 0.054

4 state

  • 0.053

5 same

  • 0.053

6 city

  • 0.052

7 way

  • 0.052

8 man

  • 0.052

9 church

  • 0.051

10 number

  • 0.051

11 house

  • 0.051

12 program

  • 0.050

13 day

  • 0.049

14 company

  • 0.049

15 case

  • 0.049

985 had 0.094 986 as 0.096 987 is 0.100 988 at 0.103 989 was 0.104 990 with 0.104 991 a 0.105 992 that 0.108 993

  • n

0.110 994 and 0.114 995 for 0.115 996

  • f

0.123 997 the 0.125 998 to 0.142 999 in 0.148

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 12 / 49

slide-19
SLIDE 19

Eigenvector number 2 word coordinate the

  • 0.155

1 a

  • 0.129

2 his

  • 0.103

3 this

  • 0.086

4 it

  • 0.086

5 that

  • 0.084

6 to

  • 0.080

7 in

  • 0.079

8 their

  • 0.076

9 an

  • 0.074

10 he

  • 0.071

11

  • ur
  • 0.070

12 its

  • 0.068

13

  • f
  • 0.067

14 for

  • 0.066

15 they

  • 0.065

985 bring 0.118 986 think 0.119 987 tell 0.131 988 say 0.132 989 go 0.134 990 know 0.141 991 give 0.145 992 find 0.161 993 see 0.166 994 do 0.174 995 make 0.177 996 take 0.179 997 get 0.182 998 be 0.190 999 have 0.202

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 13 / 49

slide-20
SLIDE 20

Eigenvector number 3 word coordinate would

  • 0.148

1 was

  • 0.142

2 could

  • 0.140

3 had

  • 0.131

4 is

  • 0.125

5 can

  • 0.123

6 has

  • 0.114

7 must

  • 0.110

8 may

  • 0.110

9 should

  • 0.105

10 might

  • 0.103

11 will

  • 0.100

12 did

  • 0.099

13 didn’t

  • 0.089

14 were

  • 0.085

15

  • f
  • 0.078

985 it 0.107 986 get 0.108 987 its 0.108 988 see 0.111 989 take 0.112 990 them 0.112 991 him 0.119 992 make 0.122 993 be 0.135 994 their 0.136 995 this 0.143 996 her 0.147 997 his 0.171 998 a 0.185 999 the 0.238

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 14 / 49

slide-21
SLIDE 21

Eigenvector number 4

  • f
  • 0.161

1 and

  • 0.156

2 in

  • 0.153

3 to

  • 0.137

4 for

  • 0.130

5 with

  • 0.119

6 is

  • 0.111

7 from

  • 0.109

8 by

  • 0.106

9

  • n
  • 0.100

10 into

  • 0.096

11 was

  • 0.088

12 at

  • 0.086

13

  • r
  • 0.083

14 are

  • 0.074

15 will

  • 0.072

16 would

  • 0.071

984 presented 0.096 985 sent 0.097 986 expected 0.098 987 able 0.099 988

  • btained

0.100 989 said 0.102 990 called 0.105 991 held 0.107 992 asked 0.108 993 been 0.110 994 brought 0.110 995 told 0.113 996 given 0.120 997 done 0.140 998 made 0.142 999 taken 0.147

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 15 / 49

slide-22
SLIDE 22

Eigenvector number 10 them

  • 0.131

1 him

  • 0.128

2 me

  • 0.103

3 himself

  • 0.103

4 years

  • 0.097

5 may

  • 0.095

6 God

  • 0.094

7 dollars

  • 0.093

8 can

  • 0.092

9 should

  • 0.089

10

  • ut
  • 0.089

11 money

  • 0.088

12 must

  • 0.085

13 might

  • 0.082

14 time

  • 0.082

15 discrimination

  • 0.080

16 up

  • 0.076

17 courses

  • 0.075

984 took 0.066 985 Federal 0.066 986 Soviet 0.066 987 its 0.067 988 gave 0.067 989 San 0.068 990 Democratic 0.068 991 General 0.069 992 Hospital 0.069 993 saw 0.076 994 got 0.077 995 had 0.080 996 a 0.087 997 Highway 0.091 998 Health 0.094 999 the 0.113

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 16 / 49

slide-23
SLIDE 23

‘made’ 3-neighbors and 2 generations

  • btained

made built played developed studied engaged expressed created formed presented

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 17 / 49

slide-24
SLIDE 24

First step: 3

Let V be the number of distinct word types in the language. Then there are in principle V features of the type W(-2,-1), and also of the type W(-1,1) and W(1,2). But the number of such features that are actually used is a small subset of the total number. For example, in an English-language encyclopedia composed of 888,000 distinct words, there were 1,689,000 distinct trigrams, of which 1,465,000 (nearly 87%) occur only once.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 18 / 49

slide-25
SLIDE 25

First step: 4

We define f(wi, wj) as the number of distinct features (using the contextual features just defined) shared by words wi and wj. It is natural to think of a graph Cin which the nodes are our words, and the edges are weighted by f(wi, wj). The weight between two nodes indicates how many contexts they share, so all other things being equal, the stronger the weight of the edge between word A and word B, the more similar A and B are concerning their syntactic contexts.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 19 / 49

slide-26
SLIDE 26

Laplacian of a graph is a matrix

Laplacian of a graph

The laplacian of a graph, such as C, is defined as the matrixM in which M(i, j) = f(wi, wj) when i = j. We can think of the edges

  • f the graph as paths through which activation passes from one

node to its neighboring nodes on each of a number of successive iterations. If we think of the graph as a recipe for moving activation from one node to another, then the off-diagonal elements m(i, j) show how much activation unit i sends to unit j For the diagonal elements, we first define d(i) as

k=i M(i, k).

d(i) is the number of times word i appears in the corpus (you see that?). M(i, i) is defined as −1 × d(i). M(i, i) is the sum of the activation that unit

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 20 / 49

slide-27
SLIDE 27

We now have an initial similarity measure between words, but this similarity is not normalized for frequency: high frequency words will be much more similarity to others words that low frequency words will. Even if we normalize for frequency, though, the simplest ways of estimating similarity of distribution between two words on the basis of this data—using the cosine of the angle subtended by vectors pointing to each of the two words—is not as good as we might hope.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 21 / 49

slide-28
SLIDE 28

Second step: 1

A number of researchers have explored the idea of taking a large set of data in a space of very high dimensionality, and finding a subspace of much lower dimensionality which is almost everywhere fairly close to the data. We’ve been especially influenced by the work of Partha Niyogi and Mikhail Belkin in the discussion that follows.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 22 / 49

slide-29
SLIDE 29

Second step: 2

This means finding the eigenvectors of a normalized version of the graph laplacian. The normalized version of M, which we call N, is defined as follows: for all i, N(i, i) = −1, while for (i, j), i = j, we use the d() function defined above to normalize, and say that N(i, j) =

M(i,j)

d(i)d(j).

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 23 / 49

slide-30
SLIDE 30

Second step and third step

We computed the first 11 eigenvectors of this normalized laplacian—those with the lowest eigenvalues, and used the 2nd through the 11th to give us coordinates for each word. Each word is thus associated with a point in R10. We then select, for each word, the k closest words to it in this new space. These are the neighbors that we will explore below.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 24 / 49

slide-31
SLIDE 31

2,000 words of French

d'enseignement bouddhisme consiste été devenir l'organisation paris lors matière aux valeur protestants l'origine neuf voie voix tandis voir l'opposition solution succession importantes actuellement peut-être force chez second machines italiens christ résultats asie conquête chaleur propre philippe populaires juifs espagnol chine met protection mer également interne
  • ù
connaissance naturel américains permis dirigé populaire permit germanique département américaine reste classification végétaux qu'à
  • utre
commerciaux chef-lieu total viêtnam découvert tels portugal leurs marine période considérable assez mois élus s'en berlin semble henri américaines l'esprit henry l'âge mondial avaient céréales produits indien découverte l'institut distingue citer angleterre colonie machine particulière empereur l'élevage particulier autonome l'écriture utilise communes exportations extrêmement l'afrique laquelle puis l'académie fréquence découvertes époque environ intérieure nouvelle peut développé afin
  • ccidentaux
enfants famille réalisé reprises fit dieux , production fin mettre profonde recours était composés métaux l'amérique aucun placé composition musulmans politiques louis énergie l'ouest qu'elle paysans autre catégories jours chargé aurait rivières langue george toute possèdent d'asie siècle nourriture californie appartient actuelle moitié économiques formation l'amour jeune caractères villes longue l'existence l'image massif maladies intérêt hongrie à l'étude responsable combats john histoire convention formant palestine prendre malgré hors nations notamment élément classes telle lumière remplacé physiques troisième chaîne alors
  • industrielle
faveur terrestre monétaire souveraineté totalement cordes doit complexes phase marie the somme alliés points notion bataille musicale devinrent d' scène indépendant radio baie pris rang prit cependant l'organisme mena carbone de danse du bas furent guerres suffrage désert architecture divisé individus année commandant guerre chimique contiennent voyage majeur sorte bien montagne accord nature signé sera formée auteur l'allemagne auprès qu'en naquit davantage museum première entraîna insectes ils règne portant nucléaire scientifique déclin source union d'environ sud-est . puissance revenu parents familles mexique c'est-à-dire d'amérique substance l'ancienne fortement efficace ait chrétien caractéristique l'europe atlantique utilisés reconnaissance tant servit tissus ii il profondément quand industriels art principale quant colonies suivirent elle renaissance américain l'année alimentaires resta rome jean connut connus couleur université
  • btint
repose connue canton célèbre réserves milliers royal d'origine possède collections l'urss conflits enfin turquie nationaliste possible atomique fusion unique mis bande mesure situés divers désigne complexe musique ceux-ci
  • btenu
domaines navires japon critique bord atteint située chasse résistance grèce totale foi campagne décrit littoral décida limitée poèmes dollars l'expérience plus d'arrondissement efforts considérée sept jacques français d'ailleurs sol dès loin doctrine lieu essentiellement jeunes côte avait c'est joseph variété communiste l'administration écrits ligne italien scènes fort pratiquement voire santé française deuxième arabes moderne petit dernières lança réaction étudia créée construite ni révolte ne profondeur commerciale l'un d'espagne dieu longueur proche biens l'angleterre russe lignes indépendante doute religieux celui-ci forment
  • eufs
entièrement petites jusqu'à faisait l'autriche retour signifie zone donner technique cité charge façon un division république lorsqu'il provoqua particules créer sujets dix science dit poste ville situe temples toutefois d'une quelques l'océan étaient
  • uvrage
roches littérature seule femelle international seuls confédération développée principales parallèlement demeure noir l'indépendance grâce cette queue bon l'ancien navigation savoir parmi morale tête allait rendre l'évolution france comprennent films révolutionnaire propres animaux pratique soumis indiens appel militaire conférence car cap supérieure sous règle faune figure construit matières graines parce soleil degré 1 l'île école espèce compositeurs l'objet date gros mal permet nombreux séparation avant né si nous sa l'eau se légende haute jeux d'argent libération police forme d'art
  • rganisation
troubles communauté grande-bretagne main présentent au-dessus non comprend nation liberté devient son telles formé d'années royaume-uni défense james vitesse classiques ancien caractérise afrique en communication frère méthode eu et homme canada générale profit appelé historiques écrit japonais formule universités chacune romans plomb être l'école basse fondateur rayons suisse celle-ci place routes ministre nouvelles marquée centrale pologne l'autorité tableaux apparut revenus contemporains courants
  • nt
droite plaine réalisa prises 2 droits connu besoins poésie moscou trône utilisé population sert réformes branche créé tendance fruits russie and représentent allemands premier allemande ans presque près représentant masse sciences textiles deux impériale douze paix plaines l'ordre preuve doivent pêche traditionnelles vaste fille louisxiv sud-ouest mains rendit l'Église venise trop frontière serait île célèbres l'empire presse sainte administratif jazz monde modèles nouveau l'état leur îles vision appelées nouveaux État commença l'industrie partisans devaient importance 3 chaque l'architecture pouvant xxesiècle apparaît vrai «le vue «la tous tout autour colons moyenne seulement arts grand troupes lorsque qu'il l'intérieur l'air l'apparition quitta industrie concernant direction ceux quatre former n'a couronne parfois situation mère nationales soit ferroviaire habitants soviétique volonté par pas lui-même etc tour élevé publié l'exploitation fondée n'est nombreuses tradition cités faible lequel périodes relief soient l'italie taille successeur suède multiples l'économie instrument péninsule 4 figurent coton aspects certaine dite certains quantités spécifiques élu contenant l'univers participa constitue images turcs composé traités représentation matériaux remarquables écrivain particulièrement reçut celles
  • n
une central
  • f
xiesiècle réforme
  • u
développer représentants après allemand culturel succéda espagne loi italie l'est jusqu'en plusieurs agricole l'unité l'oeuvre s'était sécurité cellules prusse remarquable portée internationale qu'ils russes artistique certain théologie an rapide puissances chambre au déjà av arabe actuel agricoles électriques caractéristiques personnes métal britannique poursuivit richesse peau forêts françaises culturelle important littéraire l'asie catholique démocratique recherches langues l'utilisation fleurs blanche lacs monts pouvoirs devint facilement anglais forte d'or victoire nord l'auteur vers sons sont indépendance pression racines vallées sans présence large sang françois duquel tel auquel naturelle désormais nucléaires xixesiècle reine conscience sud d'être richard mathématiques rejoignit devenu version sur peuvent public publia
  • pposition
enseigna ballet l'importance superficie dépend bible composa connaît verre social action véritable perse médecine consommation vie ressources armes d'autres feuilles xviiiesiècle vit cinéma huit florence produisent voies britanniques mort matériel frères .100 datant consacra
  • fficiellement
lesquels domination monarchie gauche allemagne des coalition l'église sociale v utilisée études définitivement suivant décision court philosophique civilisation ensemble plans petite notre croissance au-delà petits a chute grands trois mammifères anciennes syrie entreprit grande l'armée l'influence civile
  • fficielle
mission pétrole d'abord élevée surtout état maisons termes d'altitude napoléon À exemples industrielles supérieur personne cet ces subit côtes l'extrémité faire porte gothique Âge qu'un compris longues ports annuelle poétique télévision nombres l'Égypte suprême commerciales création quelque l'archipel robert valeurs fondation
  • uest
commencèrent base put christianisme néanmoins selon contient montagnes appartiennent utilisées testament marqué informations dont mise considéré donc chrétiens n'était blanc passer avec d'afrique noms qu'une contre saint destruction conduit étrangère ainsi part prend longtemps atteindre l'emploi produit plateaux double professeur dans grec vesiècle d'autre découvrit aussi couleurs série l'homme l'essentiel constructions jour faisant maison joua suivante blancs tourisme suivi province diverses xivesiècle logique constitution maladie cathédrale thomas mécanique présent fond produire toujours relation font justice ier succès parlement grecque distance ivesiècle mieux mêmes l'autre vingt baroque s'y ses principal influencé peintures dirigea culture sel nommé appartenant femmes hauteur europe constituée effet unité mythologie nord-ouest rares danemark sols bronze recherche classique abrite conséquent collection principalement employé simple s'étend États-unis fibres précipitations moins pu due d'après dut d'où considérablement s'il fut conception puissant général plante l'arrivée progressivement travailla cents von mots principaux composée historique l'espace xviiesiècle textile n'ont l'antiquité scientifiques armées commune liquide permettant entier moment fabrication
  • euvre
entra travers finalement solide entre permettent déterminer charbon y
  • ccidentale
humains cour religieuse premières marque humaine marqua
  • btenir
température continent alpes mandat xvesiècle dynastie siècles
  • ccupée
l'action difficile joue départ l'histoire tabac transformation génération constitué vallée demande uniquement compte l'or nécessaire firent côtés donne donna continua l'on technologie capitale dernier publics universel réaliser autres pierre l'idée étant jamais ligue l'empereur devait peu depuis d'État bonne religion forêt l'exception habituellement directeur s'agit pouvaient européenne exemple socialistes poètes juive densité faits services électrons activité chaînes ancienne devenue
  • ccupa
brut question
  • ccupe
long philosophie vienne chinoise méditerranée ailleurs
  • rientale
proches différents .j affaires actif comporte électronique naissance influence immédiatement tentative elles lesquelles tard que débuts compagnie chacun pape critiques william d'énergie structure relativement livres chimie situé gouverneur passé d'importantes largement peine déclaration jusqu'au compose graves rivière d'angleterre pratiques suite antique
  • bjet
très région espagnole yeux religions manière italienne l'usage s'est différentes quatrième cartes militaires fonction eurent remporta créa ; cela l'agriculture newyork commercial norvège les papier
  • rigines
beaucoup grandes nécessaires rive autant apr londres royale sicile bientôt portugais couche mécaniques commun l'époque nobel feu fer seconde développa débuta entreprises devant trouvent poissons pour espagnols solaire différence réalité femme sir six animal souvent tension probablement l'une jérusalem rencontre fleuve classe correspond issu l'État publique
  • rdre
comme vivent le lettres aucune dispose démocratie vivant eux
  • riginaire
eut hommes industriel l'aide tribus trouve trouva élections importante nationale défaite d'eau électrique l'union destinée tenta l'acide motifs romain jésus cause sauf nuit anglaise temps belgique pendant d'un qui printemps route david dernière qu'au bibliothèque fait celui permettre arbres températures l'activité mondiale bois anciens publication chrétienne européen fonctions banque plupart liée territoires constituent années riches derniers trente noblesse reproduction sculpture modernes encore l'ensemble catholiques maintenir faut unies pièce difficultés jaune concerne étrangères physique contact latine mille mit pacifique s'installa l'université pays-bas combat collaboration romains maintenant comprenant romaine minéraux xiiiesiècle lutte face inde israël l'École noire l'espagne membres charles hollandais noirs rouge passe passa européens connaître durant plantes varie cuivre bases libre magnétique piano présidence monuments qu'elles maritime contribua trouver directement donné secrétaire golfe fédération l'atmosphère circulation partie présente partir partis caoutchouc Égypte frontières national pouvait terrain existe latin situées utilisant amérique reprit dramatique chemin parvint lui même rouges sociaux proximité menée communistes variétés certaines prirent tiers centaines mais la limite rois figures naturels fréquemment soviétiques estimation réduire vent généralement carrière ce rapidement provinces méridionale quantité cent chimiques l'opéra inférieure .-c d'europe religieuses revanche capacité brésil d'entre grecs étrangers revint sucre riche utilisent société valut qualité fonda publiques surface raison prose sociales acides pied représente cinq genres durée suivit premiers chinois personnel position appelés épousa xviesiècle appelée paul armée danses politique monnaie l'énergie née nos écrivit haut ensuite peintres données l'assemblée ayant prise l'enseignement pensée georges l'inde bâtiments crise molécules autrefois flotte autorité socialiste majeure empire transport terres référence est québec vapeur mines nord-est avoir active mode terre importants propriété traditionnelle peinture simples littéraires
  • ndes
pourtant xiiesiècle washington théorie lune toutes mourut
  • iseaux
blé chômage européennes connues l'art construction humain eau majorité naturelles établit allemandes celle sein lieux synthèse l'islam l'étranger hautes économique aujourd'hui programmes fondé seul futur révolution fois membre qu'on plutôt père là

infinitifs noms de villes passe simple

  • adj. de pays

noms feminins noms au pluriel adverbes noms de pays des les ses xvii John Goldsmith (University of Chicago) Word manifolds July 15, 2015 25 / 49

slide-32
SLIDE 32

‘made’ 3-neighbors and 2 generations

  • btained

made built played developed studied engaged expressed created formed presented

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 26 / 49

slide-33
SLIDE 33

‘made’ 3 neighbors and 3 generations

executed followed played developed engaged added revived described built achieved expressed directed extended initiated sold formed imposed presented

  • pened
  • btained

made practiced lost created studied

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 27 / 49

slide-34
SLIDE 34

Help with learning morphology

jump jumps jumped jumping NULL-s-ed-ing walk walks walked walking NULL-s-ed-ing move moves moved moving e-es-ed-ing build builds built building d-ds-t-ding make makes ?? making NULL-s-ing

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 28 / 49

slide-35
SLIDE 35

‘with’ 3 neighbors and 3 generations

  • n

from for into toward with

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 29 / 49

slide-36
SLIDE 36

‘with’ 5-neighbors and 2 generations

  • n

from for that into within against to as through toward with

  • ver

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 30 / 49

slide-37
SLIDE 37

‘with’ 5-neighbors and 3 generations

  • n
  • nly

around for that into within upon near against to as both through but see whose toward with

  • ver

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 31 / 49

slide-38
SLIDE 38

alliance dynasty crusade assembly continent regime style alphabet agency father lake day instance station capital policy marriage basin territory conflict principle province direction junction theory park agreement coast peninsula minister tradition height hall bible language career valley route movement project encyclopedia era action dispute example husband

Figure : ‘language’ 9 neighbors and 2 generations

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 32 / 49

slide-39
SLIDE 39

‘the’ 5 neighbors, 3 generations

a both all his whose her most these many no some two

  • ne

four this

  • ther

every three various the

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 33 / 49

slide-40
SLIDE 40

‘would’ 5 neighbors, 3 generations

contains takes would shall may becomes could remained had took serves should will did became attempted grew began seems

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 34 / 49

slide-41
SLIDE 41

‘pays’ 5 neighbors and 2 generations

fils roi

palais

concept corps pays massif prix procédé progrès peuple désert nom

discours

gaz pape mot terme bras tiers congrès fleuve récit

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 35 / 49

slide-42
SLIDE 42

‘langue’ 3 neighbors and 3 generations

volonté compagnie lutte conception chaîne vallée langue ligne quantité révolte capacité résistance puissance crise domination pensée voie force vision

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 36 / 49

slide-43
SLIDE 43

‘langage’ 3 neighbors and 3 generations

conseil langage travail goût climat texte journal jeu rythme projet château bassin théâtre lac

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 37 / 49

slide-44
SLIDE 44

‘le’ 3 neighbors and 3 generations

«le le «la la notre aucune cette ce son chaque d'une aucun qu'une celui-ci

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 38 / 49

slide-45
SLIDE 45

‘moment’ 4 neighbors, 3 generations

conflit coeur pont massif terrain cercle rythme frère rayonnement village compositeur canal voyage langage peintre climat jeu duc revenu souverain détroit mandat médecin moment département poète golfe mont

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 39 / 49

slide-46
SLIDE 46

petites, 3 neighbors

propres remarquables grands graves nouveaux diverses cents autres anciennes premières principales derniers dernières petits nombreux petites différents

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 40 / 49

slide-47
SLIDE 47

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 41 / 49

slide-48
SLIDE 48

There is a simple connection between minimizing the squared distance between nodes (though we haven’t explained yet what kind of distance we are talking about now) of a weighted graph and the graph’s Laplacian. We assume that no vertex is adjacent to itself. From a purely formal point of view, we could say that we are looking for a vector x in RV which minimizes the expression, where W is the adjacency matrix of the graph, and wi,j are its entries:

  • i,j

(xi − xj)2wi,j (1)

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 42 / 49

slide-49
SLIDE 49

Now we get to the kind of distance we’re talking about: from the point of view of a projection, imagine that the entries wi,j in matrix W express the “similarity” between the ith and the jth

  • element. We are looking for a single vector x, then, which assigns

very similar values to its ith and jth coordinate just in case those two coordinates correspond to elements that are “similar”. We can think of that vector as representing a map from the graph’s nodes to the real line; that is how we will think about it now, for the most part.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 43 / 49

slide-50
SLIDE 50

We define a diagonal matrix D such that dii is the sum of the weights associated with edges adjacent to the ith vertex: dii =

j wij. Then

  • i
  • j

(xi − xj)2wi,j =

  • i
  • j

(x2

i + x2 j − 2xixj)wi,j

(2)

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 44 / 49

slide-51
SLIDE 51

=

  • i
  • j

(x2

i )wi,j +

  • i
  • j

(x2

j)wi,j − 2

  • i
  • j

xixjwi,j (3) =

  • i

x2

i

  • j

wi,j +

  • j
  • i

(x2

j)wi,j − 2

  • i
  • j

xixjwi,j (4) =

  • i

x2

i dii +

  • j

x2

j

  • i

wi,j − 2

  • i
  • j

xixjwi,j (5) =

  • i

x2

i dii +

  • j

x2

jdjj − 2

  • i
  • j

xixjwi,j (6)

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 45 / 49

slide-52
SLIDE 52

The first two terms are identical, and each are equal to XT DX, while the third term is twice XT WX. So

  • i
  • j

(xi − xj)2wi,j = 2(XT DX − XT WX) = 2(XT (D − W)X) (7) It turns out that the matrix D-W has a name: it is the laplacian of the matrix W (or the graph of which W is the adjacency matrix). So we’ll write L = D − W. And there is a more natural way of writing XT (D − W)X, which is to write (X, LX), which we can read as the inner product of the vector X and the vector LX.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 46 / 49

slide-53
SLIDE 53

If we restrict our attention to vectors of unit length, then this quantity (X, LX) is called the Rayleigh quotient. And we can find its maximal and minimal values along the eigenvectors of the

  • laplacian. This is quite remarkable!

Before we get to why that should be the case, we are going to squeeze the matrix so that its major diagonal consists of just 1’s. We do this by defining a normalized laplacian, by dividing each entry lij of L by

1 √dii√ djj . We can write this:

L′ = D− 1

2 LD− 1 2

(8)

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 47 / 49

slide-54
SLIDE 54

If you are following this, you can see that L′ = I − D− 1

2 WD− 1 2 .

The first term is the identity matrix; the second has 0s down the major diagonal, and is symmetric, and has only positive values; let’s call it W ′, because it is the normalized form of W. And we have a better intuitive understanding of a matrix such as W ′, because it can naturally describe an ellipsoid: if we look at points x such that (x, W ′x) is a constant, we get an ellipsoid. Furthermore, W ′x is a vector normal to the surface of that ellipsoid at the point x. If we think about this geometrically, that means that (x, W ′x) will be a local maximum when x and W ′x point in the same direction —-which is the same thing as saying that x is an eigenvector of W ′.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 48 / 49

slide-55
SLIDE 55

So we look at the eigenvectors of W ′, or of L′. If we look at the eigenvectors of W ′, we sort them by decreasing eigenvalue, so λ0 is the largest eigenvalue, and its eigenvector simply reflects the

  • verall frequencies of the graph.

Note: sometimes people start number the eigenvalues at 1, and sometimes at 0, as I have done here.] The second eigenvalue, λ1, is

  • f great importance in graph theory. Here we care about its

eigenvector, though, and we look at the values it assigns to each word.

John Goldsmith (University of Chicago) Word manifolds July 15, 2015 49 / 49

slide-56
SLIDE 56

Towards a new empiricism for linguistics

Introduction

The goal of this chapter

The view of linguistics which we will consider in this chapter is empiricist in the sense explored in Chapter One of this book: it is epistemologically empiricist, rather than psychologically empiricist; in fact, it is a view that is rather agnostic about psychology — ready to cooperate with psychology and psychologists, but from a certain respectful distance. It is empiricist in the belief that the justification

  • f a scientific theory must drive deep into the quantitative measure
  • f real-world data, both experimental and observational, and it is

empiricist in seeing continuity (rather than rupture or discontinuity) between the careful treatment of large-scale data and the desire to develop elegant high-level theories. To put that last point slightly dif- ferently, it is not an empiricism that is skeptical of elegant theories, or worried that the elegance of a theory is a sign of its disconnect from

  • reality. But it is an empiricism that insists on measuring just how an

elegant a theory is, and measuring how well it is (or isn’t) in sync with what we have observed about the world. It is not an empiricism that is afraid of theories that leave observations unexplained, but it is an empiricism that insists that discrepancies between theory and

  • bservation are a sign that more work will be needed, and sooner

rather than later. And it is an empiricism that knows that scientific progress cannot be reduced to mechanistic procedures, and even knows exactly why it cannot. Thus this chapter has four points to make: first, that linguists can and should make an effort to measure explicitly how good the theoretical generalizations of their theories are; second, that linguists must make an effort to measure the distance between their theories’s predictions and our observations; third, that there are actually things we working linguists could do in order to achieve those goals; and fourth, that many of the warnings to the contrary have turned out to be much less compelling than they seemed to be, once upon a time.

slide-57
SLIDE 57

10

j
  • h
n g
  • l
d s m i t h t h e u n i v e r s i t y
  • f
h i a g
  • arbitrarily close to the probability of the whole set. Thus a probability

measure assigned to an infinite set makes it almost as manageable as a finite set, while still remaining resolutely infinite.That is the heart of the matter. We need to be clear right from the start that the use of probabilistic models does not require that we assume that the data itself is in a lin- guistic sense “variable,” or in any sense fuzzy or unclear. I will come back to this point; it is certainly possible within a probabilistic frame- work to deal with data in which the judgments are non-categorical and in which a grammar predicts multiple possibilities. But in order to clarify the fundamental points, I will not assume that the data are anything except categorical and clear. Assume most of what you normally assume about formal gram- mars: they specify an infinite set of linguistic representations, they characterize what is particular about particular languages, and at their most explicit they specify sequences of sounds as well as se- quences of words. It is not altogether unreasonable, then, to say that a grammar essentially is a specification of sounds (or letters) partic- ular to a language, plus a function that assigns to every sequence of sounds a real value: a non-negative value, with the characteristic that the sum of these values is 1.0. To make matters simpler for us, we will assume that we can adopt a universal set of symbols that can be used to describe all languages, and refer to that set as Σ.3

3 I do not really believe this is true, but

it is much easier to express the ideas we are interested in here if we make this assumption. See Robert Ladd, Handbook of Phonological Theory vol. 2 for discussion.

A grammar, then, is a function g with the properties in (1). g : Σ∗ → [0, 1]

s∈Σ∗

g(s) = 1 (1) The grammar assigns a probability (necessarily non-negative, but not necessarily positive) to all strings of segments, and these sum to 1.4

4 If you are concerned about what

happened to trees and the rest of linguistic structure, don’t worry. We typically assign a probability to a structure, and then the probability assigned to the string is the sum of the probabilities assigned to all of the structures that involve the same string.

A theory of grammar is much the same, at a higher level of ab-

  • straction. It is a specification of the set G all possible grammars,

along with a function that maps each grammar to a positive number (which we call its probability), and the sum of these values must be 1.0, as in (2). We use the symbol π to represent such functions, and each one is in essence a particular Universal Grammar. π : G → [0, 1]

g∈G

π(g) = 1 (2) To make thing a bit more concrete, we can look ahead and see that the function π is closely related to grammar complexity: in

slide-58
SLIDE 58 t
  • w
a r d s a n e w e m p i r i i s m f
  • r
l i n g u i s t i s

19 Let’s look more closely at this grammar-compiler, which we will refer to as UG(UTM1): it is a Universal Grammar for UTM1, and for any particular UTM, there can be many such. Each grammar- compiler constitutes a set of recommendations for best practices for writing grammars of natural languages: in short, a linguistic theory. In particular, we define a given UG by an interface, in the follow- ing sense—we need to do this in order to be able to speak naturally about one and the same UG being run on different UTMs (a point we will need to talk about in the next section). A UG specifies how grammars should be written, and it specifies exactly what it costs to write out any particular thing a grammarian might want to put into a

  • grammar. Naturally, for a given UTM, there may be a large number
  • f ways of implementing this, but we care only about the simplest
  • ne, and we will henceforth take it for granted that we can hire some-
  • ne and outsource the problem of finding the implementation of a

particular UG on any particular UTM. Once we have such a grammar, we can make a long tape, consist- ing first of UG(UTM1), followed by a Grammar for English (or what- ever language we’re analyzing), as we have already noted—plus a compressed form of the data, which is a sequence of 0s and 1s which allows the grammar to perfectly reconstruct the original data. It is a basic fact about information theory that if one has a probabilistic grammar, then the number of bits (0s and 1s) that it takes to perfectly reproduce the original data is exactly log2 pr(data). We use that fact here: and we set things up so that the third section of the information passed to the UTM is a sequence of 0s and 1s that perfectly describes the original data, given the Universal Grammar and the grammar of the language in question.

UG1

Grammar Data

Figure 1: Input to Turing machine

As we have already mentioned, there will be many different ways

  • f accomplishing this. Each UTM is consistent with an indefinitely

large number of such universal grammars, so notationally we’ll have to index them; we’ll refer to different Universal Grammars for a given UTM (let’s say it is still UTM1) as UG1(UTM1) and UG2(UTM1),

  • etc. This is no different from the situation we live in currently: there

are different theories of grammar, and each one can be thought of

slide-59
SLIDE 59 t
  • w
a r d s a n e w e m p i r i i s m f
  • r
l i n g u i s t i s

21 device that would produce a grammar, given a natural language cor-

  • pus. In the second model, the formal device would not generate the

grammar, but it would check to insure that in some fashion or other the grammar was (or could be) properly and appropriately deduced

  • r induced from the data. In the third model, linguists would de-

velop a formal model that neither produced nor verified grammars, given data, but rather, the device would take a set of observations, and a set of two (or more) grammars, and determine which one was the more (or most) appropriate for the corpus. Chomsky suggests that the third, the weakest, is good enough, and he expresses doubt that either of the first two are feasible in practice.

Data Discovery device

Correct grammar of corpus

Data Grammar Verification device Yes, or, No Data Grammar 1 Grammar 2 Evaluation metric G1 is better; or, G2 is better.

Figure 2: Chomsky’s three conceptions

  • f linguistic theory

Chomsky believed that we could and should account for grammar selection on the basis of the formal simplicity of the grammar, and that the specifics of how that simplicity should be defined was a matter to be decided by studying actual languages in detail. In the last stage of classical generative grammar, Chomsky went so far as to propose that the specifics of how grammar complexity should be defined is part of our genetic endowment. His argument against Views #1 and #2 was weak, so weak as to perhaps not merit being called an argument: what he wrote was that he thought that neither could be successfully accomplished, based in part on the fact that he had tried for several years, and in addition he felt hemmed in by the kind of grammatical theory that appeared to be necessary to give such perspectives a try. But regardless of whether it was a strong argument, it was convincing.17

17 Chomsky’s view of scientific knowl-

edge was deeply influenced by Nelson Goodman’s view, a view that was rooted in a long braid of thought about the nature of science that has deep roots; without going back too far, we can trace the roots back to Ernst Mach, who emphasized the role of simplicity

  • f data description in the role played by

science, and to the Vienna Circle, which began as a group of scholars interested in developing Mach’s perspectives on knowledge and science. And all of these scholars viewed themselves, quite correctly, as trying to cope with the problem of induction as it was iden- tified by David Hume in the late 18th century: how can anyone be sure of a generalization (especially one with

Chomsky proposed the following methodology, in three steps. First, linguists should develop formal grammars for individual languages, and treat them as scientific theories, whose predictions could be tested against native speaker intuitions among other things. Eventually, in a fashion parallel to the way in which a theory of

slide-60
SLIDE 60

Data

Bootstrap device

G

incremental change

G

Evaluation metric

G∗

Preferred grammar Halt?

No G∗ Yes

Halt!

  • Fig. 1.4: Unsupervised learning of grammars

length (which we would minimize, because in some respects it is in- verted with respect to probability). Given data D, find g = arg maxg∈G pg(D). Given data D, find g = arg maxg∈G[pg(D) − cost(g)]. These are two very different goals! And a person could perfectly well want to work on both problems. Very important: The most important reason that we develop probabilis- tic models is to evaluate and compare different grammars. (It is not in

  • rder to assign probabilities to data that already exists, or that does not

exist yet. We are not rolling dice.) Importance in computational sphere of quantitative measurement of

  • success. There is little emphasis on evaluating a model based on its fit

to a scientist’s intuition. Probability is the quantitative theory of evidence. 14

Chapter 1 Class 1: Overview of information theory and machine learning for

slide-61
SLIDE 61

24

j
  • h
n g
  • l
d s m i t h t h e u n i v e r s i t y
  • f
h i a g
  • My grammar
  • f English

My grammar

  • f Swahili

My grammar

  • f English

My grammar

  • f Swahili

Your grammar

  • f English

Your grammar

  • f Swahili

Your grammar

  • f English

Your grammar

  • f Swahili

My UG: Your UG:

English corpus Swahili corpus

Figure 3: The pre-classical generative problem

slide-62
SLIDE 62

28

j
  • h
n g
  • l
d s m i t h t h e u n i v e r s i t y
  • f
h i a g
  • English grammar 1

Compressed English data English grammar 2 Compressed English data English grammar 3 Compressed English data English corpus Arabic grammar 1 Compressed Arabic data Arabic grammar 2 Compressed Arabic data Arabic grammar 3 Compressed Arabic data Arabic corpus

Figure 4: Generative model with a data term

slide-63
SLIDE 63 t
  • w
a r d s a n e w e m p i r i i s m f
  • r
l i n g u i s t i s

31

UG1

English grammar 1 Compressed data English grammar 2 Compressed data English grammar 3 Compressed data English corpus Arabic grammar 1 Compressed data Arabic grammar 2 Compressed data Arabic grammar 3 Compressed data Arabic corpus

UG2

English grammar 1 Compressed data English grammar 2 Compressed data English grammar 3 Compressed data English corpus Arabic grammar 1 Compressed data Arabic grammar 2 Compressed data Arabic grammar 3 Compressed data Arabic corpus

Figure 5: The importance of measuring the size of the UG

slide-64
SLIDE 64 t
  • w
a r d s a n e w e m p i r i i s m f
  • r
l i n g u i s t i s

33

The limits of conventionalism for UTMs

Join the club

We propose that the solution to the problem is to divide our effort up into four pieces: the selection of the best UTM, the selection of a universal grammar UG∗ among the candidate universal grammars proposed by linguists, the selection of the best grammar g for each corpus, and the compressed length (the plog) of that corpus, given that grammar g: see Figure 7.

UTM UG

English grammar-2 English data Igbo grammar-5 Igbo data Arabic grammar-2 Arabic data

Figure 6: Total model

We assume that the linguists who are engaged in the task of dis- covering the best UG will make progress on that challenge by com- peting to find the best UG and by cooperating to find the best com- mon UTM. In this section, we will describe a method by which they can cooperate to find a best common UTM, which will allow one of them (at any given moment) to unequivocally have the best UG, and hence the best grammar for each of the data sets from the different languages. The concern now, however, is this: we cannot use even an approxi- mation of Kolmogorov complexity in order to help us choose the best

  • UTM. We have to have already chosen a UTM in order to talk about

Kolmogorov complexity. We need to find a different rational solution to the problem of selecting a UTM that we can all agree on. We will now imagine an almost perfect scientific linguistic world in which there is a competition among a certain number of groups

  • f researchers, each particular group defined by sharing a general

formal linguistic theory. The purpose of the community is to play

slide-65
SLIDE 65

34

j
  • h
n g
  • l
d s m i t h t h e u n i v e r s i t y
  • f
h i a g
  • UTM1 → UTM2

UTM1 → UTM3 UTM1 → UTM4 UTM1 → UTMk UTM1 UTM2 → UTM1 UTM2 → UTM3 UTM2 → UTM4 UTM2 → UTMk UTM2 UTM3 → UTM1 UTM3 → UTM2 UTM3 → UTM4 UTM3 → UTMk UTM3

Figure 7: 3 competing UTMs out of k

slide-66
SLIDE 66 t
  • w
a r d s a n e w e m p i r i i s m f
  • r
l i n g u i s t i s

35 UTMα UG1 G1 Data1 UG2 G2 Data2 UTMβ UG1 G1 Data1 UG2 G2 Data2

Figure 8: The effect of using different UTMs

a game by which the best general formal linguistic theory can be encouraged and identified. Who the winner is will probably change

  • ver time as theories change and develop.

The annual winner of the competition will be the one whose total model length (given this year’s UTM choice) is the smallest: the total model length is the size of the team’s UG when coded for the year’s UTM, plus the length of all of the grammars, plus the compressed length of all of the data, given those grammars. Of these terms, only the size of the UG will vary as we consider different UTMs. The win- ning overall team will have an influence, but only a minor influence,

  • n the selection the year’s winning UTM. We will return in just a mo-

ment to a method for selecting the year’s winning UTM; first, we will spell out a bit more of the details of the competition. Let us say that there are N members (that is, N member groups). To be a member of this club, you must subscribe to the following (and let’s suppose you’re in the group i):

  • 1. You adopt an approved Universal Turing machine (UTMα). I

will explain later how a person can propose a new Turing machine and get it approved. But at the beginning, let’s just assume that there is a set of approved UTMs, and each group must adopt one. I will index different UTM’s with superscript lower-case Greek letters, like UTMα: that is a particular approved universal Turing machine; the set of such machines that have already been approved is U. You will probably not be allowed to keep your UTM for the final com- petition, but you might. You have a weak preference for your own UTM, but you recognize that your preference is likely not going to be adopted by the group. The group will jointly try to find the UTM which shows the least bias with respect to the submissions of all of

slide-67
SLIDE 67 t
  • w
a r d s a n e w e m p i r i i s m f
  • r
l i n g u i s t i s

37 UGk Grammar

  • f English

Compressed data

  • f English

Grammaro fSwahili Compressed data

  • f Swahili

b b b b b b

Grammaro f Sierra Miwok Compressed data

  • f Sierra Miwok

Figure 9: What Linguistic Group k wants to minimize

tors’ systems, because of the UTM that they use to find the mini-

  • mum. That is, suppose we are talking about two groups, Group 1

and Group 2, which utilize UTMα and UTMβ. It is perfectly possible (indeed, it is natural) to find that (see Figure 8) |UGi|UTMα + Emp(UGi, {Γl}1, C) < |UGj|UTMα + Emp(UGj, {Γl}2, C) (13) and yet, for a value of β different from α: |UGi|UTMβ + Emp(UGj, {Γl}1, C) > |UGj|UTMβ + Emp(UGj, {Γl}2, C) (14) This is because each group has a vested interest in developing a UTM which makes their Universal Grammar extremely small. This is just a twist, just a variant, on the problem described in the

U n i v e r
  • s
a l G r a m m a r i s F r e e fallacy that I discussed above. Comparison
  • f UGs, grammars, and compressed data is made, relatively easily,

across different groups of researchers, because for these three things, there is a common unit of measurement, the bit. This is not the case, however, for UTMs: we have no common currency with which to measure the length, in any meaningful sense, of a UTM. We need, therefore, a qualitatively different way to reach consensus on UTM across a group of competitors, our research groups.

Which Turing machine? The least biased one.

With all of this bad news about the difficulty of choosing a univer- ally accepted Universal Turing machine, how can we play this game

slide-68
SLIDE 68

40

j
  • h
n g
  • l
d s m i t h t h e u n i v e r s i t y
  • f
h i a g
  • UTMa3

UTM1 → UTM2 UTM1 → UTM3 UTM1 → UTM4 UTM1 → UTM5 UTM1 → UTM6 UTM2 UTM1 → UTM2 UTM1 → UTM3 UTM1 → UTM4 UTM1 → UTM5 UTM1 → UTM6 UTM1 UTM1 → UTM2 UTM1 → UTM3 UTM1 → UTM4 UTM1 → UTM5 UTM1 → UTM6

Figure 10: Competing to be the UTM of the year