MDL and the complexity of natural language John Goldsmith - PowerPoint PPT Presentation

MDL and the complexity of natural language John Goldsmith University of Chicago/CNRS MoDyCo January 2007

Thanks • Carl de Marcken, Partha Niyogi, Antonio Galves, Jesus Garcia, Yu Hu…

The word segmentation problem Input: noprincípioeraaquelequeéapalavra Language- independent device Output: no princípio era aquele que é a palavra

Naïve model of language There exists an alphabet A = {a…z}, and a finite lexicon W ⊂ A*, where A* is the set of all strings of elements of A. There exist a (potentially unbounded) set of sentences of a language, L ⊂ W*. An utterance is a set (or string) of sentences, that is, an element of L*.

Picture of naïve view Sentences L * : all strings of words in Lexicon Alphabet Lexicon L A A * : all strings of letters in Alphabet

“Naïve” view? The naïve view is still interesting – even if it is a great simplification. We can ask: if we embed the naïve view inside an MDL framework, do the results resemble known words (in English, Italian, etc.)? What if we apply it to DNA or protein sequences?

Word segmentation Work by Michael Brent and by Carl de Marcken in the mid-1990s at MIT. A lexicon L is a pair of objects (L, p L ): a set L ⊂ A *, and a probability distribution p L that is defined on A * for which L is the support of p L . We call L the words. • We insist that A ⊂ L: all individual letters are words. • We define a language as a subset of L*; its members are sentences. • Each sentence can be uniquely associated with an utterance (an element in A *) by a mapping F:

L * : all strings of words in Lexicon in principio era Sentences il verbo F: L * � A * inprincipioerailverbo Alphabet Lexicon L L ~ A p L A * : all strings of letters in Alphabet

L * : all strings of words in Lexicon in principio era Sentences il verbo F: L * � A * S in principio e r a il ver bo If F(S) = U inprincipioerailverbo then we say that Lexicon L U S is a parse of U. L ~ p L A * : all strings of letters in Alphabet

L * : all strings of words in Lexicon in principio era Sentences il verbo F: L * � A * S in principio e r a il ver bo ) = inprincipioerailverbo ( ) p s ∏ λ (| |) ( [ ] ) s pr s i Lexicon L U L ~ p L We pull back the measure from the space of letters to the space of words. A * : all strings of letters in Alphabet

Different lexicons lead to different probabilities of the data Given an utterance U ) = ( | ) arg max ( ) p U L p q L L { } ∈ ( ) q parses U The probability of a string of letters is the probability assigned to its best parse.

Class of models originally studied in the word segmentation problem [eventually we will come to regret the limitations of this class…] Our data is a finite string (“corpus”), generated by a finite alphabet; We find the best parse for the string; The probability of the parse is the product of the probability of its words; The words are assigned a maximum likelihood probability of the simplest sort.

A little example, to fix ideas How do these two multigram models of English compare? Why is Number 2 better? Lexicon 1: Lexicon 2: {a,b,…,h,…,s, t, {a,b,…,h,…s, t, th, u…z} u…z}

A bit of notation Notation : [t] = count of t Log probability of corpus: [h] = count of h [ ] m ∑ [th] = count of th [ ] log m Z Z = total number of m in lexicon words (tokens) ∑ = [ ] Z l ∈ l lexicon

[ ] t [ ] t 1 [ ] log t 2 [ ] log 1 t Z 2 1 Z 2 [ ] m ∑ [ ] [ ] log h m [ ] h + 1 [ ] log + h 2 [ ] log Z h 1 m in lexicon Z 2 1 Z 2 [ ] ∑ m ∑ = + [ ] m ∑ [ ] where Z l [ ] log + m [ ] log m Z ∈ Z l lexicon ≠ , m t h ≠ 2 m t , h 1 Log prob [ ] th + 2 [ ] log th of sentence C 2 All letters Z 2 are separate th is treated = − [ ] [ ] [ ] t t th as a separate 2 1 = − [ ] [ ] [ ] h h th chunk 2 1 = − [ ] [ ] [ ] Z Z th 2 1

[ ] t [ ] t th is treated 2 [ ] log t 1 [ ] log t 2 1 Z Z 2 as a separate 1 [ ] h [ ] h + + 2 1 [ ] log [ ] log h h chunk 1 2 Z Z 1 2 [ ] m ∑ + [ ] log [ ] m ∑ m + [ ] log m Z ≠ , m t h 1 Z ≠ , m t h 2 [ ] th + 2 [ ] log th All letters 2 Z 2 are separate f Δ Δ = 2 log ; ( ) define f as then pr C f 1 ( ) pr th − Δ + Δ + Δ + 2 [ ] [ ] [ ] log Z Z t t h h th 1 1 1 ( ) ( ) pr t pr h 2 2 This is positive if Lexicon 2 is better

Effect of having fewer “words” altogether f Δ Δ = 2 log ; ( ) define f as then pr C f 1 ( ) pr th − Δ + Δ + Δ + 2 [ ] [ ] [ ] log Z Z t t h h th 1 1 1 ( ) ( ) pr t pr h 2 2 This is positive if Lexicon 2 is better

Effect of frequency of /t/ and /h/ decreasing f Δ Δ = 2 log ; ( ) define f as then pr C f 1 ( ) pr th − Δ + Δ + Δ + 2 [ ] [ ] [ ] log Z Z t t h h th 1 1 1 ( ) ( ) pr t pr h 2 2 This is positive if Lexicon 2 is better

Effect /th/ being treated as a unit rather than separate pieces f Δ Δ = 2 log ; ( ) define f as then pr C f 1 ( ) pr th − Δ + Δ + Δ + 2 [ ] [ ] [ ] log Z Z t t h h th 1 1 1 ( ) ( ) pr t pr h 2 2 This is positive if Lexicon 2 is better

Description Length We need to account for the increase in length of the Lexicon, which is our model of the data. We add “th” to the lexicon: Z Z + = − 2 2 log log log( ( ) ( )) pr t pr h 2 2 [ ] [ ] t h ( ) pr th − Δ + Δ + Δ + − 2 [ ] [ ] [ ] log log( ( ) ( )) Z Z t t h h th pr t pr h 1 1 1 2 2 ( ) ( ) pr t pr h 2 2 This is the generic form of the MDL criterion for adding a new word to the lexicon.

Results • The Fulton County Grand Ju ry s aid Friday an investi gation of At l anta 's recent prim ary e lection produc ed no e videnc e that any ir regul ar it i e s took place . • Thejury further s aid in term - end present ment s thatthe City Ex ecutive Commit t e e ,which had over - all charg e ofthe e lection , d e serv e s the pra is e and than k softhe City of At l anta forthe man ner in whichthe e lection was conduc ted. Chunks are too big Chunks are too small

Start with: BREVES INSTRUCÇÕES AOS CORRESPONDENTES DA ACADEMIA DAS SCIENCIAS DE LISBOA 1781 As relações, por mais exactas e completas que sejão, nunca chegão a dar-nos huma idéa tão perfeita das coisas, como a sua mesma presença: por esta causa se tem occupado os Sabios, particularmente neste seculo, em ajuntar com a protecção dos Principes os exemplares de varios individuos das diversas especies de Animaes, Vegetaes e Mineraes, que se encontrão em differentes paizes, para apresentarem do modo possivel á vista dos curiosos hum como compendio das principaes maravilhas da Natureza.—

Remove spaces • Asrelações,pormaisexactasecompletasquesejão,n uncachegãoadar- noshumaidéatãoperfeitadascoisas,comoasuames mapresença:porestacausasetemoccupadoosSabio s,particularmentenesteseculo,emajuntarcomapro tecçãodosPrincipesosexemplaresdevariosindivid uosdasdiversasespeciesdeAnimaes,VegetaeseMi neraes,queseencontrãoemdifferentespaizes,para apresentaremdomodopossivelávistadoscuriosos humcomocompendiodasprincipaesmaravilhasd aNatureza.—

• As relações ,pormais exacta—se complet—as que sejão , nunca che—gão a da—r-nos humaidéa tão perfeita das coisas, como asu—a mes—ma-presenç—a : por esta caus—a setem occupa—do os S—abios, particula—r— mente neste seculo , em ajuntar coma prote—cção dos Principes os exemplaresde varios individuos dasdivers—asespeciesde An—imaes, Vege—ta—e—se Min—eraes,que se encontr—ãoem differentes paizes ,para apresenta—rem do modopossivel á vista dos curios-os hum como compendi—o das principa—es maravilhas da Natureza.

What do we conclude? • From the point of view of linguistics, this does not teach us something about language (at least, not directly). • From the point of view of statistical learning, this does not teach us about statistical learning procedures.

What do we conclude? What is most interesting about the results is that the linguist sees the errors committed by the system (by comparison with standard spelling, e.g.) as the result of a specification of a model set which fails to allow a method to capture the structure that linguistics has analyzed in language.

We return to this… …in a moment. First, an observation the behavior of MDL in this process, so far.

Usage of MDL? If description length of data D, given model M, is equal to the inverse log probability assigned to D by M + compressed length of M, then The process of word-learning is unambiguously one of increasing the probability of the data, and using the length of M as a stopping criterion.

MDL and the complexity of natural language John Goldsmith - PowerPoint PPT Presentation

MDL and the complexity of natural language John Goldsmith University of Chicago/CNRS MoDyCo January 2007 Thanks Carl de Marcken, Partha Niyogi, Antonio Galves, Jesus Garcia, Yu Hu The word segmentation problem Input:

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

REAL-TIME RAY TRACING WITH MDL Ignacio Llamas & Maksim Eisenstein, 03.21.2019 - GPU Technology

Between Renderers with MDL Jan Jordan Software Product Manager MDL March 18, GTC San Jose 2019

Recent Developments in Class Action Law and Impact on MDL Cases and Impact on MDL Cases

Complexity and Character of Human Languages The Faculty of Language Informatics 2A: Lecture 28

Integrating the NVIDIA Material Definition Language MDL in Your Application Lutz Kettner

Implement Physically Based Ray Tracing with OptiX and MDL Detlef Rttger, 4/4/2016 OptiX

Kolmogorov Complexity of Categories Complexity Programing Language Kolmogorov Noson S.

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23

Strong Asymptotic Assertions for Discrete MDL in Regression and Classification or A Strange Way

Sharing Physically Based Materials Between Renderers with MDL Jan Jordan Software Product Manager

SIGGRAPH16 : NVIDIA BEST OF GTC MDL MATERIALS TO GLSL SHADERS Andreas Senbach, NVIDIA

RESTORATION ADVISORY BOARD MEETING MARCH 14, 2019 87th Air Base Wing JB MDL PBR CONTRACT UPDATE

MDL-Based Unsupervised Attribute Ranking Zdravko Markov Computer Science Department Central

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Understanding We want to communicate with computers using natural language

Structured Predictions: Practical Advancements and Applications Kai-Wei Chang University of

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of

Emotions ...in life Biblical wisdom for our feelings ...in counselling ...in Scripture

DEFENSE INITIA DEFENSE INITIATED TED VICTIM VICTIM OUTRE OUTREACH Kate Siska Mitigation

Performance Measures: Stochastic Optimization & Statistical Consistency Harikrishna Narasimhan

Information Theory on Convex sets In celebration of Prof. Shunichi Amaris 80 years birthday

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

Module 8 Professional Written Communication Module Eight: Professional Written Communication 1

Sambuz

Useful Links

Newsletter

Mail Us

MDL and the complexity of natural language John Goldsmith - PowerPoint PPT Presentation

MDL and the complexity of natural language John Goldsmith University of Chicago/CNRS MoDyCo January 2007 Thanks Carl de Marcken, Partha Niyogi, Antonio Galves, Jesus Garcia, Yu Hu The word segmentation problem Input:

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

REAL-TIME RAY TRACING WITH MDL Ignacio Llamas &amp; Maksim Eisenstein, 03.21.2019 - GPU Technology

Between Renderers with MDL Jan Jordan Software Product Manager MDL March 18, GTC San Jose 2019

Recent Developments in Class Action Law and Impact on MDL Cases and Impact on MDL Cases

Complexity and Character of Human Languages The Faculty of Language Informatics 2A: Lecture 28

Integrating the NVIDIA Material Definition Language MDL in Your Application Lutz Kettner

Implement Physically Based Ray Tracing with OptiX and MDL Detlef Rttger, 4/4/2016 OptiX

Kolmogorov Complexity of Categories Complexity Programing Language Kolmogorov Noson S.

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23

Strong Asymptotic Assertions for Discrete MDL in Regression and Classification or A Strange Way

Sharing Physically Based Materials Between Renderers with MDL Jan Jordan Software Product Manager

SIGGRAPH16 : NVIDIA BEST OF GTC MDL MATERIALS TO GLSL SHADERS Andreas Senbach, NVIDIA

RESTORATION ADVISORY BOARD MEETING MARCH 14, 2019 87th Air Base Wing JB MDL PBR CONTRACT UPDATE

MDL-Based Unsupervised Attribute Ranking Zdravko Markov Computer Science Department Central

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Understanding We want to communicate with computers using natural language

Structured Predictions: Practical Advancements and Applications Kai-Wei Chang University of

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of

Emotions ...in life Biblical wisdom for our feelings ...in counselling ...in Scripture

DEFENSE INITIA DEFENSE INITIATED TED VICTIM VICTIM OUTRE OUTREACH Kate Siska Mitigation

Performance Measures: Stochastic Optimization &amp; Statistical Consistency Harikrishna Narasimhan

Information Theory on Convex sets In celebration of Prof. Shunichi Amaris 80 years birthday

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

Module 8 Professional Written Communication Module Eight: Professional Written Communication 1

Sambuz

Useful Links

Newsletter

Mail Us

REAL-TIME RAY TRACING WITH MDL Ignacio Llamas & Maksim Eisenstein, 03.21.2019 - GPU Technology

Performance Measures: Stochastic Optimization & Statistical Consistency Harikrishna Narasimhan