Word manifolds
John Goldsmith
University of Chicago
July 15, 2015
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 1 / 49
Word manifolds John Goldsmith University of Chicago July 15, 2015 - - PowerPoint PPT Presentation
Word manifolds John Goldsmith University of Chicago July 15, 2015 John Goldsmith (University of Chicago) Word manifolds July 15, 2015 1 / 49 Goals Goals Visualize the global structure of a language John Goldsmith (University of Chicago)
Word manifolds
John Goldsmith
University of Chicago
July 15, 2015
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 1 / 49
Goals
Goals
Visualize the global structure of a language
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 2 / 49
Goals
Goals
Visualize the global structure of a language Solve a technical problem in the unsupervised learning of morphology (past tenses of English verbs)
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 2 / 49
Goals
Goals
Visualize the global structure of a language Solve a technical problem in the unsupervised learning of morphology (past tenses of English verbs) Develop a language-independent method
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 2 / 49
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 3 / 49
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 4 / 49
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 5 / 49
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 6 / 49
The algorithm is in three steps:
Algorithm
1 Compare all pairs of words to see which words agree on the word
that precedes and follows it. the and my will agree a lot.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 7 / 49
The algorithm is in three steps:
Algorithm
1 Compare all pairs of words to see which words agree on the word
that precedes and follows it. the and my will agree a lot.
2 Turn this abstract graph into something in a geometric space, so
we can talk about distances.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 7 / 49
The algorithm is in three steps:
Algorithm
1 Compare all pairs of words to see which words agree on the word
that precedes and follows it. the and my will agree a lot.
2 Turn this abstract graph into something in a geometric space, so
we can talk about distances.
3 In that geometric space of dimension 10, ask each word to find out
what the 6 closest words to it are. Make a graph out of those edges. The graph S can be directly viewed, using data visualization tools such as Gephi, and various clustering techniques can be applied to it as well.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 7 / 49
The algorithm is in three steps:
Algorithm
1 Determine similarity between all pairs of words, based on a
comparison of word-context, and the creation of a graph C whose edge-weights is determined directly by those similarities. Every pair of words (w1, w2) calculates how many contexts they share in common.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 8 / 49
The algorithm is in three steps:
Algorithm
1 Determine similarity between all pairs of words, based on a
comparison of word-context, and the creation of a graph C whose edge-weights is determined directly by those similarities.
2 Second, the computation of the K most significant eigenvectors of
the normalized Laplacian of graph C, and the calculation of the coordinates of each of the words in Rk based on these eigenvectors (where K is 10. Why 10? Why not?).
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 8 / 49
The algorithm is in three steps:
Algorithm
1 Determine similarity between all pairs of words, based on a
comparison of word-context, and the creation of a graph C whose edge-weights is determined directly by those similarities.
2 Second, the computation of the K most significant eigenvectors of
the normalized Laplacian of graph C, and the calculation of the coordinates of each of the words in Rk based on these eigenvectors (where K is 10. Why 10? Why not?).
3 Third, calculation of a new distance d(., .) between all pairs of
words, viewing the words as points in RK; a new graph S is constructed, whose edge weights are directly based on distance in RK. The graph S can be directly viewed, using data visualization tools such as Gephi, and various clustering techniques can be applied to it as well.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 8 / 49
First step: 1
Property W(-1) = wj the word to the immediately left of w is wj; W(1) = wj the word to the immediately right of w is wj; W(-2) = wj the word two words left of w is wj; etc. W(-2,-1) = (wj,wk) W(-2)=wj and W(-1)=wk. W(-1,1) = (wj,wk) W(-1)=wj and W(1)=wk.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 9 / 49
the even in — a his their its an this part of— a his their its
an my this your —way , a his their its my this in—small a their its
my this spirit of— a his its
my this
a his their its
my this
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 10 / 49
would that he — could can should must might may will —be taken could can should must might may will maybe I— could can should will didn’t couldn’t he — get could can should might may didn’t couldn’t — be . could can should must might may will — be considered could can should must might will — be , could can should must might may — be a can should must might may will
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 11 / 49
Step 2
Eigenvector number 1 word coordinate world
1 problem
2 family
3 car
4 state
5 same
6 city
7 way
8 man
9 church
10 number
11 house
12 program
13 day
14 company
15 case
985 had 0.094 986 as 0.096 987 is 0.100 988 at 0.103 989 was 0.104 990 with 0.104 991 a 0.105 992 that 0.108 993
0.110 994 and 0.114 995 for 0.115 996
0.123 997 the 0.125 998 to 0.142 999 in 0.148
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 12 / 49
Eigenvector number 2 word coordinate the
1 a
2 his
3 this
4 it
5 that
6 to
7 in
8 their
9 an
10 he
11
12 its
13
14 for
15 they
985 bring 0.118 986 think 0.119 987 tell 0.131 988 say 0.132 989 go 0.134 990 know 0.141 991 give 0.145 992 find 0.161 993 see 0.166 994 do 0.174 995 make 0.177 996 take 0.179 997 get 0.182 998 be 0.190 999 have 0.202
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 13 / 49
Eigenvector number 3 word coordinate would
1 was
2 could
3 had
4 is
5 can
6 has
7 must
8 may
9 should
10 might
11 will
12 did
13 didn’t
14 were
15
985 it 0.107 986 get 0.108 987 its 0.108 988 see 0.111 989 take 0.112 990 them 0.112 991 him 0.119 992 make 0.122 993 be 0.135 994 their 0.136 995 this 0.143 996 her 0.147 997 his 0.171 998 a 0.185 999 the 0.238
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 14 / 49
Eigenvector number 4
1 and
2 in
3 to
4 for
5 with
6 is
7 from
8 by
9
10 into
11 was
12 at
13
14 are
15 will
16 would
984 presented 0.096 985 sent 0.097 986 expected 0.098 987 able 0.099 988
0.100 989 said 0.102 990 called 0.105 991 held 0.107 992 asked 0.108 993 been 0.110 994 brought 0.110 995 told 0.113 996 given 0.120 997 done 0.140 998 made 0.142 999 taken 0.147
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 15 / 49
Eigenvector number 10 them
1 him
2 me
3 himself
4 years
5 may
6 God
7 dollars
8 can
9 should
10
11 money
12 must
13 might
14 time
15 discrimination
16 up
17 courses
984 took 0.066 985 Federal 0.066 986 Soviet 0.066 987 its 0.067 988 gave 0.067 989 San 0.068 990 Democratic 0.068 991 General 0.069 992 Hospital 0.069 993 saw 0.076 994 got 0.077 995 had 0.080 996 a 0.087 997 Highway 0.091 998 Health 0.094 999 the 0.113
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 16 / 49
‘made’ 3-neighbors and 2 generations
made built played developed studied engaged expressed created formed presented
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 17 / 49
First step: 3
Let V be the number of distinct word types in the language. Then there are in principle V features of the type W(-2,-1), and also of the type W(-1,1) and W(1,2). But the number of such features that are actually used is a small subset of the total number. For example, in an English-language encyclopedia composed of 888,000 distinct words, there were 1,689,000 distinct trigrams, of which 1,465,000 (nearly 87%) occur only once.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 18 / 49
First step: 4
We define f(wi, wj) as the number of distinct features (using the contextual features just defined) shared by words wi and wj. It is natural to think of a graph Cin which the nodes are our words, and the edges are weighted by f(wi, wj). The weight between two nodes indicates how many contexts they share, so all other things being equal, the stronger the weight of the edge between word A and word B, the more similar A and B are concerning their syntactic contexts.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 19 / 49
Laplacian of a graph is a matrix
Laplacian of a graph
The laplacian of a graph, such as C, is defined as the matrixM in which M(i, j) = f(wi, wj) when i = j. We can think of the edges
node to its neighboring nodes on each of a number of successive iterations. If we think of the graph as a recipe for moving activation from one node to another, then the off-diagonal elements m(i, j) show how much activation unit i sends to unit j For the diagonal elements, we first define d(i) as
k=i M(i, k).
d(i) is the number of times word i appears in the corpus (you see that?). M(i, i) is defined as −1 × d(i). M(i, i) is the sum of the activation that unit
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 20 / 49
We now have an initial similarity measure between words, but this similarity is not normalized for frequency: high frequency words will be much more similarity to others words that low frequency words will. Even if we normalize for frequency, though, the simplest ways of estimating similarity of distribution between two words on the basis of this data—using the cosine of the angle subtended by vectors pointing to each of the two words—is not as good as we might hope.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 21 / 49
Second step: 1
A number of researchers have explored the idea of taking a large set of data in a space of very high dimensionality, and finding a subspace of much lower dimensionality which is almost everywhere fairly close to the data. We’ve been especially influenced by the work of Partha Niyogi and Mikhail Belkin in the discussion that follows.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 22 / 49
Second step: 2
This means finding the eigenvectors of a normalized version of the graph laplacian. The normalized version of M, which we call N, is defined as follows: for all i, N(i, i) = −1, while for (i, j), i = j, we use the d() function defined above to normalize, and say that N(i, j) =
M(i,j)
√
d(i)d(j).
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 23 / 49
Second step and third step
We computed the first 11 eigenvectors of this normalized laplacian—those with the lowest eigenvalues, and used the 2nd through the 11th to give us coordinates for each word. Each word is thus associated with a point in R10. We then select, for each word, the k closest words to it in this new space. These are the neighbors that we will explore below.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 24 / 49
2,000 words of French
d'enseignement bouddhisme consiste été devenir l'organisation paris lors matière aux valeur protestants l'origine neuf voie voix tandis voir l'opposition solution succession importantes actuellement peut-être force chez second machines italiens christ résultats asie conquête chaleur propre philippe populaires juifs espagnol chine met protection mer également interneinfinitifs noms de villes passe simple
noms feminins noms au pluriel adverbes noms de pays des les ses xvii John Goldsmith (University of Chicago) Word manifolds July 15, 2015 25 / 49
‘made’ 3-neighbors and 2 generations
made built played developed studied engaged expressed created formed presented
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 26 / 49
‘made’ 3 neighbors and 3 generations
executed followed played developed engaged added revived described built achieved expressed directed extended initiated sold formed imposed presented
made practiced lost created studied
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 27 / 49
Help with learning morphology
jump jumps jumped jumping NULL-s-ed-ing walk walks walked walking NULL-s-ed-ing move moves moved moving e-es-ed-ing build builds built building d-ds-t-ding make makes ?? making NULL-s-ing
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 28 / 49
‘with’ 3 neighbors and 3 generations
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 29 / 49
‘with’ 5-neighbors and 2 generations
from for that into within against to as through toward with
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 30 / 49
‘with’ 5-neighbors and 3 generations
around for that into within upon near against to as both through but see whose toward with
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 31 / 49
alliance dynasty crusade assembly continent regime style alphabet agency father lake day instance station capital policy marriage basin territory conflict principle province direction junction theory park agreement coast peninsula minister tradition height hall bible language career valley route movement project encyclopedia era action dispute example husband
Figure : ‘language’ 9 neighbors and 2 generations
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 32 / 49
‘the’ 5 neighbors, 3 generations
a both all his whose her most these many no some two
four this
every three various the
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 33 / 49
‘would’ 5 neighbors, 3 generations
contains takes would shall may becomes could remained had took serves should will did became attempted grew began seems
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 34 / 49
‘pays’ 5 neighbors and 2 generations
palais
discours
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 35 / 49
‘langue’ 3 neighbors and 3 generations
volonté compagnie lutte conception chaîne vallée langue ligne quantité révolte capacité résistance puissance crise domination pensée voie force vision
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 36 / 49
‘langage’ 3 neighbors and 3 generations
conseil langage travail goût climat texte journal jeu rythme projet château bassin théâtre lac
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 37 / 49
‘le’ 3 neighbors and 3 generations
«le le «la la notre aucune cette ce son chaque d'une aucun qu'une celui-ci
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 38 / 49
‘moment’ 4 neighbors, 3 generations
conflit coeur pont massif terrain cercle rythme frère rayonnement village compositeur canal voyage langage peintre climat jeu duc revenu souverain détroit mandat médecin moment département poète golfe mont
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 39 / 49
petites, 3 neighbors
propres remarquables grands graves nouveaux diverses cents autres anciennes premières principales derniers dernières petits nombreux petites différents
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 40 / 49
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 41 / 49
There is a simple connection between minimizing the squared distance between nodes (though we haven’t explained yet what kind of distance we are talking about now) of a weighted graph and the graph’s Laplacian. We assume that no vertex is adjacent to itself. From a purely formal point of view, we could say that we are looking for a vector x in RV which minimizes the expression, where W is the adjacency matrix of the graph, and wi,j are its entries:
(xi − xj)2wi,j (1)
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 42 / 49
Now we get to the kind of distance we’re talking about: from the point of view of a projection, imagine that the entries wi,j in matrix W express the “similarity” between the ith and the jth
very similar values to its ith and jth coordinate just in case those two coordinates correspond to elements that are “similar”. We can think of that vector as representing a map from the graph’s nodes to the real line; that is how we will think about it now, for the most part.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 43 / 49
We define a diagonal matrix D such that dii is the sum of the weights associated with edges adjacent to the ith vertex: dii =
j wij. Then
(xi − xj)2wi,j =
(x2
i + x2 j − 2xixj)wi,j
(2)
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 44 / 49
=
(x2
i )wi,j +
(x2
j)wi,j − 2
xixjwi,j (3) =
x2
i
wi,j +
(x2
j)wi,j − 2
xixjwi,j (4) =
x2
i dii +
x2
j
wi,j − 2
xixjwi,j (5) =
x2
i dii +
x2
jdjj − 2
xixjwi,j (6)
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 45 / 49
The first two terms are identical, and each are equal to XT DX, while the third term is twice XT WX. So
(xi − xj)2wi,j = 2(XT DX − XT WX) = 2(XT (D − W)X) (7) It turns out that the matrix D-W has a name: it is the laplacian of the matrix W (or the graph of which W is the adjacency matrix). So we’ll write L = D − W. And there is a more natural way of writing XT (D − W)X, which is to write (X, LX), which we can read as the inner product of the vector X and the vector LX.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 46 / 49
If we restrict our attention to vectors of unit length, then this quantity (X, LX) is called the Rayleigh quotient. And we can find its maximal and minimal values along the eigenvectors of the
Before we get to why that should be the case, we are going to squeeze the matrix so that its major diagonal consists of just 1’s. We do this by defining a normalized laplacian, by dividing each entry lij of L by
1 √dii√ djj . We can write this:
L′ = D− 1
2 LD− 1 2
(8)
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 47 / 49
If you are following this, you can see that L′ = I − D− 1
2 WD− 1 2 .
The first term is the identity matrix; the second has 0s down the major diagonal, and is symmetric, and has only positive values; let’s call it W ′, because it is the normalized form of W. And we have a better intuitive understanding of a matrix such as W ′, because it can naturally describe an ellipsoid: if we look at points x such that (x, W ′x) is a constant, we get an ellipsoid. Furthermore, W ′x is a vector normal to the surface of that ellipsoid at the point x. If we think about this geometrically, that means that (x, W ′x) will be a local maximum when x and W ′x point in the same direction —-which is the same thing as saying that x is an eigenvector of W ′.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 48 / 49
So we look at the eigenvectors of W ′, or of L′. If we look at the eigenvectors of W ′, we sort them by decreasing eigenvalue, so λ0 is the largest eigenvalue, and its eigenvector simply reflects the
Note: sometimes people start number the eigenvalues at 1, and sometimes at 0, as I have done here.] The second eigenvalue, λ1, is
eigenvector, though, and we look at the values it assigns to each word.
John Goldsmith (University of Chicago) Word manifolds July 15, 2015 49 / 49
Introduction
The goal of this chapter
The view of linguistics which we will consider in this chapter is empiricist in the sense explored in Chapter One of this book: it is epistemologically empiricist, rather than psychologically empiricist; in fact, it is a view that is rather agnostic about psychology — ready to cooperate with psychology and psychologists, but from a certain respectful distance. It is empiricist in the belief that the justification
empiricist in seeing continuity (rather than rupture or discontinuity) between the careful treatment of large-scale data and the desire to develop elegant high-level theories. To put that last point slightly dif- ferently, it is not an empiricism that is skeptical of elegant theories, or worried that the elegance of a theory is a sign of its disconnect from
elegant a theory is, and measuring how well it is (or isn’t) in sync with what we have observed about the world. It is not an empiricism that is afraid of theories that leave observations unexplained, but it is an empiricism that insists that discrepancies between theory and
rather than later. And it is an empiricism that knows that scientific progress cannot be reduced to mechanistic procedures, and even knows exactly why it cannot. Thus this chapter has four points to make: first, that linguists can and should make an effort to measure explicitly how good the theoretical generalizations of their theories are; second, that linguists must make an effort to measure the distance between their theories’s predictions and our observations; third, that there are actually things we working linguists could do in order to achieve those goals; and fourth, that many of the warnings to the contrary have turned out to be much less compelling than they seemed to be, once upon a time.
10
jmeasure assigned to an infinite set makes it almost as manageable as a finite set, while still remaining resolutely infinite.That is the heart of the matter. We need to be clear right from the start that the use of probabilistic models does not require that we assume that the data itself is in a lin- guistic sense “variable,” or in any sense fuzzy or unclear. I will come back to this point; it is certainly possible within a probabilistic frame- work to deal with data in which the judgments are non-categorical and in which a grammar predicts multiple possibilities. But in order to clarify the fundamental points, I will not assume that the data are anything except categorical and clear. Assume most of what you normally assume about formal gram- mars: they specify an infinite set of linguistic representations, they characterize what is particular about particular languages, and at their most explicit they specify sequences of sounds as well as se- quences of words. It is not altogether unreasonable, then, to say that a grammar essentially is a specification of sounds (or letters) partic- ular to a language, plus a function that assigns to every sequence of sounds a real value: a non-negative value, with the characteristic that the sum of these values is 1.0. To make matters simpler for us, we will assume that we can adopt a universal set of symbols that can be used to describe all languages, and refer to that set as Σ.3
3 I do not really believe this is true, but
it is much easier to express the ideas we are interested in here if we make this assumption. See Robert Ladd, Handbook of Phonological Theory vol. 2 for discussion.
A grammar, then, is a function g with the properties in (1). g : Σ∗ → [0, 1]
∑
s∈Σ∗
g(s) = 1 (1) The grammar assigns a probability (necessarily non-negative, but not necessarily positive) to all strings of segments, and these sum to 1.4
4 If you are concerned about what
happened to trees and the rest of linguistic structure, don’t worry. We typically assign a probability to a structure, and then the probability assigned to the string is the sum of the probabilities assigned to all of the structures that involve the same string.
A theory of grammar is much the same, at a higher level of ab-
along with a function that maps each grammar to a positive number (which we call its probability), and the sum of these values must be 1.0, as in (2). We use the symbol π to represent such functions, and each one is in essence a particular Universal Grammar. π : G → [0, 1]
∑
g∈G
π(g) = 1 (2) To make thing a bit more concrete, we can look ahead and see that the function π is closely related to grammar complexity: in
19 Let’s look more closely at this grammar-compiler, which we will refer to as UG(UTM1): it is a Universal Grammar for UTM1, and for any particular UTM, there can be many such. Each grammar- compiler constitutes a set of recommendations for best practices for writing grammars of natural languages: in short, a linguistic theory. In particular, we define a given UG by an interface, in the follow- ing sense—we need to do this in order to be able to speak naturally about one and the same UG being run on different UTMs (a point we will need to talk about in the next section). A UG specifies how grammars should be written, and it specifies exactly what it costs to write out any particular thing a grammarian might want to put into a
particular UG on any particular UTM. Once we have such a grammar, we can make a long tape, consist- ing first of UG(UTM1), followed by a Grammar for English (or what- ever language we’re analyzing), as we have already noted—plus a compressed form of the data, which is a sequence of 0s and 1s which allows the grammar to perfectly reconstruct the original data. It is a basic fact about information theory that if one has a probabilistic grammar, then the number of bits (0s and 1s) that it takes to perfectly reproduce the original data is exactly log2 pr(data). We use that fact here: and we set things up so that the third section of the information passed to the UTM is a sequence of 0s and 1s that perfectly describes the original data, given the Universal Grammar and the grammar of the language in question.
UG1
Grammar Data
Figure 1: Input to Turing machine
As we have already mentioned, there will be many different ways
large number of such universal grammars, so notationally we’ll have to index them; we’ll refer to different Universal Grammars for a given UTM (let’s say it is still UTM1) as UG1(UTM1) and UG2(UTM1),
are different theories of grammar, and each one can be thought of
21 device that would produce a grammar, given a natural language cor-
grammar, but it would check to insure that in some fashion or other the grammar was (or could be) properly and appropriately deduced
velop a formal model that neither produced nor verified grammars, given data, but rather, the device would take a set of observations, and a set of two (or more) grammars, and determine which one was the more (or most) appropriate for the corpus. Chomsky suggests that the third, the weakest, is good enough, and he expresses doubt that either of the first two are feasible in practice.
Data Discovery device
Correct grammar of corpus
Data Grammar Verification device Yes, or, No Data Grammar 1 Grammar 2 Evaluation metric G1 is better; or, G2 is better.
Figure 2: Chomsky’s three conceptions
Chomsky believed that we could and should account for grammar selection on the basis of the formal simplicity of the grammar, and that the specifics of how that simplicity should be defined was a matter to be decided by studying actual languages in detail. In the last stage of classical generative grammar, Chomsky went so far as to propose that the specifics of how grammar complexity should be defined is part of our genetic endowment. His argument against Views #1 and #2 was weak, so weak as to perhaps not merit being called an argument: what he wrote was that he thought that neither could be successfully accomplished, based in part on the fact that he had tried for several years, and in addition he felt hemmed in by the kind of grammatical theory that appeared to be necessary to give such perspectives a try. But regardless of whether it was a strong argument, it was convincing.17
17 Chomsky’s view of scientific knowl-
edge was deeply influenced by Nelson Goodman’s view, a view that was rooted in a long braid of thought about the nature of science that has deep roots; without going back too far, we can trace the roots back to Ernst Mach, who emphasized the role of simplicity
science, and to the Vienna Circle, which began as a group of scholars interested in developing Mach’s perspectives on knowledge and science. And all of these scholars viewed themselves, quite correctly, as trying to cope with the problem of induction as it was iden- tified by David Hume in the late 18th century: how can anyone be sure of a generalization (especially one with
Chomsky proposed the following methodology, in three steps. First, linguists should develop formal grammars for individual languages, and treat them as scientific theories, whose predictions could be tested against native speaker intuitions among other things. Eventually, in a fashion parallel to the way in which a theory of
Data
Bootstrap device
G
incremental change
G
′
Evaluation metric
G∗
Preferred grammar Halt?
No G∗ Yes
Halt!
length (which we would minimize, because in some respects it is in- verted with respect to probability). Given data D, find g = arg maxg∈G pg(D). Given data D, find g = arg maxg∈G[pg(D) − cost(g)]. These are two very different goals! And a person could perfectly well want to work on both problems. Very important: The most important reason that we develop probabilis- tic models is to evaluate and compare different grammars. (It is not in
exist yet. We are not rolling dice.) Importance in computational sphere of quantitative measurement of
to a scientist’s intuition. Probability is the quantitative theory of evidence. 14
Chapter 1 Class 1: Overview of information theory and machine learning for
24
jMy grammar
My grammar
My grammar
Your grammar
Your grammar
Your grammar
Your grammar
English corpus Swahili corpus
Figure 3: The pre-classical generative problem
28
jCompressed English data English grammar 2 Compressed English data English grammar 3 Compressed English data English corpus Arabic grammar 1 Compressed Arabic data Arabic grammar 2 Compressed Arabic data Arabic grammar 3 Compressed Arabic data Arabic corpus
Figure 4: Generative model with a data term
31
English grammar 1 Compressed data English grammar 2 Compressed data English grammar 3 Compressed data English corpus Arabic grammar 1 Compressed data Arabic grammar 2 Compressed data Arabic grammar 3 Compressed data Arabic corpus
English grammar 1 Compressed data English grammar 2 Compressed data English grammar 3 Compressed data English corpus Arabic grammar 1 Compressed data Arabic grammar 2 Compressed data Arabic grammar 3 Compressed data Arabic corpus
Figure 5: The importance of measuring the size of the UG
33
The limits of conventionalism for UTMs
Join the club
We propose that the solution to the problem is to divide our effort up into four pieces: the selection of the best UTM, the selection of a universal grammar UG∗ among the candidate universal grammars proposed by linguists, the selection of the best grammar g for each corpus, and the compressed length (the plog) of that corpus, given that grammar g: see Figure 7.
English grammar-2 English data Igbo grammar-5 Igbo data Arabic grammar-2 Arabic data
Figure 6: Total model
We assume that the linguists who are engaged in the task of dis- covering the best UG will make progress on that challenge by com- peting to find the best UG and by cooperating to find the best com- mon UTM. In this section, we will describe a method by which they can cooperate to find a best common UTM, which will allow one of them (at any given moment) to unequivocally have the best UG, and hence the best grammar for each of the data sets from the different languages. The concern now, however, is this: we cannot use even an approxi- mation of Kolmogorov complexity in order to help us choose the best
Kolmogorov complexity. We need to find a different rational solution to the problem of selecting a UTM that we can all agree on. We will now imagine an almost perfect scientific linguistic world in which there is a competition among a certain number of groups
formal linguistic theory. The purpose of the community is to play
34
jUTM1 → UTM3 UTM1 → UTM4 UTM1 → UTMk UTM1 UTM2 → UTM1 UTM2 → UTM3 UTM2 → UTM4 UTM2 → UTMk UTM2 UTM3 → UTM1 UTM3 → UTM2 UTM3 → UTM4 UTM3 → UTMk UTM3
Figure 7: 3 competing UTMs out of k
35 UTMα UG1 G1 Data1 UG2 G2 Data2 UTMβ UG1 G1 Data1 UG2 G2 Data2
Figure 8: The effect of using different UTMs
a game by which the best general formal linguistic theory can be encouraged and identified. Who the winner is will probably change
The annual winner of the competition will be the one whose total model length (given this year’s UTM choice) is the smallest: the total model length is the size of the team’s UG when coded for the year’s UTM, plus the length of all of the grammars, plus the compressed length of all of the data, given those grammars. Of these terms, only the size of the UG will vary as we consider different UTMs. The win- ning overall team will have an influence, but only a minor influence,
ment to a method for selecting the year’s winning UTM; first, we will spell out a bit more of the details of the competition. Let us say that there are N members (that is, N member groups). To be a member of this club, you must subscribe to the following (and let’s suppose you’re in the group i):
will explain later how a person can propose a new Turing machine and get it approved. But at the beginning, let’s just assume that there is a set of approved UTMs, and each group must adopt one. I will index different UTM’s with superscript lower-case Greek letters, like UTMα: that is a particular approved universal Turing machine; the set of such machines that have already been approved is U. You will probably not be allowed to keep your UTM for the final com- petition, but you might. You have a weak preference for your own UTM, but you recognize that your preference is likely not going to be adopted by the group. The group will jointly try to find the UTM which shows the least bias with respect to the submissions of all of
37 UGk Grammar
Compressed data
Grammaro fSwahili Compressed data
b b b b b b
Grammaro f Sierra Miwok Compressed data
Figure 9: What Linguistic Group k wants to minimize
tors’ systems, because of the UTM that they use to find the mini-
and Group 2, which utilize UTMα and UTMβ. It is perfectly possible (indeed, it is natural) to find that (see Figure 8) |UGi|UTMα + Emp(UGi, {Γl}1, C) < |UGj|UTMα + Emp(UGj, {Γl}2, C) (13) and yet, for a value of β different from α: |UGi|UTMβ + Emp(UGj, {Γl}1, C) > |UGj|UTMβ + Emp(UGj, {Γl}2, C) (14) This is because each group has a vested interest in developing a UTM which makes their Universal Grammar extremely small. This is just a twist, just a variant, on the problem described in the
U n i v e racross different groups of researchers, because for these three things, there is a common unit of measurement, the bit. This is not the case, however, for UTMs: we have no common currency with which to measure the length, in any meaningful sense, of a UTM. We need, therefore, a qualitatively different way to reach consensus on UTM across a group of competitors, our research groups.
Which Turing machine? The least biased one.
With all of this bad news about the difficulty of choosing a univer- ally accepted Universal Turing machine, how can we play this game
40
jUTM1 → UTM2 UTM1 → UTM3 UTM1 → UTM4 UTM1 → UTM5 UTM1 → UTM6 UTM2 UTM1 → UTM2 UTM1 → UTM3 UTM1 → UTM4 UTM1 → UTM5 UTM1 → UTM6 UTM1 UTM1 → UTM2 UTM1 → UTM3 UTM1 → UTM4 UTM1 → UTM5 UTM1 → UTM6
Figure 10: Competing to be the UTM of the year