Statistical revolution in computational linguistics Speech - PDF document

Statistical revolution in computational linguistics ◮ Speech recognition Statistics and ◮ Syntactic parsing ◮ Machine translation the Scientific Study of Language 0.92 What do they have to do with each other? 0.91 0.9 Parse 0.89 Accuracy Mark Johnson 0.88 0.87 Brown University 0.86 0.85 ESSLLI 2005 0.84 1994 1996 1998 2000 2002 2004 2006 Year Statistical models in computational linguistics Outline Why Statistics? ◮ Supervised learning: structure to be learned is visible ◮ speech transcripts, treebank, proposition bank, Learning probabilistic context-free grammars translation pairs ◮ more information than available to a child ◮ annotation requires (linguistic) knowledge Factoring learning into simpler components ◮ a more practical method of making information available to a computer than writing a grammar by hand The Janus-faced nature of computational linguistics ◮ Unsupervised learning: structure to be learned is hidden ◮ alien radio, alien TV Conclusion

The centrality of inference Chomsky’s “Three Questions” ◮ “poverty of the stimulus” ⇒ innate knowledge of language (universal grammar) ◮ What constitutes knowledge of language? ⇒ intricate grammar with rich deductive structure ◮ grammar (universal, language specific) ◮ Statistics is the theory of optimal inference in the ◮ How is knowledge of language acquired? presence of uncertainty ◮ language acquisition ◮ We can define probability distributions over structured objects ◮ How is knowledge of language put to use? ⇒ no inherent contradiction between statistical inference ◮ psycholinguistics and linguistic structure ◮ probabilistic models are declarative (last two questions are about inference) ◮ probabilistic models can be systematically combined P( X , Y ) = P( X )P( Y | X ) Questions that statistical models might answer The centrality of inference ◮ “poverty of the stimulus” ◮ What information is required to learn language? ⇒ innate knowledge of language (universal grammar) ◮ How useful are different kinds of information to language ⇒ intricate grammar with rich deductive structure learners? ◮ Bayesian inference can utilize prior knowledge ◮ Prior can encode “soft” markedness preferences and “hard” universal constraints ◮ Are there synergies between different information sources? ◮ Does knowledge of phonology or morphology make word segmentation easier? ◮ May provide hints about human language acquisition

Estimating PCFGs from visible data Outline S S S Why Statistics? NP VP NP VP NP VP rice grows rice grows corn grows Learning probabilistic context-free grammars   Rule Count Rel Freq S   S → NP VP 3 1 Factoring learning into simpler components   P  = 2 / 3 NP VP  NP → rice 2 2 / 3 NP → corn 1 1 / 3 rice grows The Janus-faced nature of computational linguistics VP → grows 3 1   S   Conclusion Rel freq is maximum likelihood estimator   P  = 1 / 3 NP VP  (selects rule probabilities that maximize probability of trees) corn grows Estimating PCFGs from hidden data Probabilistic Context-Free Grammars ◮ Training data consists of strings w alone 1 . 0 S → NP VP 1 . 0 VP → V ◮ Maximum likelihood selects rule probabilities that 0 . 75 NP → George 0 . 25 NP → Al maximize the marginal probability of the strings w 0 . 6 V → barks 0 . 4 V → snores ◮ Expectation maximization is a way of building hidden data estimators out of visible data estimators     S S ◮ parse trees of iteration i are training data for rule     NP VP NP VP probabilities at iteration i + 1     P  = 0 . 45 P  = 0 . 1     ◮ Each iteration is guaranteed not to decrease P( w ) (but  George V  Al V can get trapped in local minima) barks snores ◮ This can be done without enumerating the parses

Rule probabilities from “English” Example: The EM algorithm with a toy PCFG 1 VP → V NP 0.9 VP → NP V Initial rule probs VP → V NP NP “English” input 0.8 VP → NP NP V rule prob the dog bites 0.7 Det → the · · · · · · the dog bites a man N → the VP → V 0 . 2 Rule 0.6 V → the a man gives the dog a bone probability VP → V NP 0 . 2 · · · 0.5 VP → NP V 0 . 2 0.4 VP → V NP NP 0 . 2 “pseudo-Japanese” input VP → NP NP V 0 . 2 0.3 the dog bites · · · · · · 0.2 the dog a man bites Det → the 0 . 1 a man the dog a bone gives 0.1 N → the 0 . 1 · · · V → the 0 . 1 0 0 1 2 3 4 5 Iteration Probability of “Japanese” Probability of “English” 1 1 0.1 0.1 Geometric Geometric average average 0.01 0.01 sentence sentence probability probability 0.001 0.001 1e-04 1e-04 1e-05 1e-05 1e-06 1e-06 0 1 2 3 4 5 0 1 2 3 4 5 Iteration Iteration

Applying EM to real data Rule probabilities from “Japanese” ◮ ATIS treebank consists of 1,300 hand-constructed parse 1 trees VP → V NP 0.9 VP → NP V ◮ ignore the words (in this experiment) VP → V NP NP 0.8 VP → NP NP V ◮ about 1,000 PCFG rules are needed to build these trees 0.7 Det → the N → the S Rule 0.6 V → the probability VP . 0.5 VB NP NP . 0.4 0.3 Show PRP NP DT JJ NNS PP ADJP 0.2 me PDT the nonstop flights PP PP JJ PP 0.1 all IN NP TO NP early IN NP 0 from NNP to NNP in DT NN 0 1 2 3 4 5 Dallas Denver the morning Iteration Experiments with EM Learning in statistical paradigm 1. Extract productions from trees and estimate probabilities ◮ The likelihood is a differentiable function of rule probabilities from trees to produce PCFG. probabilities 2. Initialize EM with the treebank grammar and MLE ⇒ learning can involve small, incremental updates probabilities ◮ Learning structure (rules) is hard, but . . . 3. Apply EM (to strings alone) to re-estimate production ◮ Parameter estimation can approximate rule learning probabilities. ◮ start with “superset” grammar 4. At each iteration: ◮ estimate rule probabilities ◮ Measure the likelihood of the training data and the ◮ discard low probability rules quality of the parses produced by each grammar. ◮ Parameters can be associated with other things besides ◮ Test on training data (so poor performance is not due to rules (e.g., HeadInitial, HeadFinal) overlearning).

Why does it work so poorly? Log likelihood of training strings ◮ Wrong data: grammar is a transduction between form -14000 and meaning ⇒ learn from form/meaning pairs -14200 ◮ exactly what contextual information is available to a -14400 language learner? -14600 ◮ Wrong model: PCFGs are poor models of syntax -14800 ◮ Wrong objective function: Maximum likelihood makes the log P -15000 sentences as likely as possible, but syntax isn’t intended -15200 to predict sentences (Klein and Manning) ◮ How can information about the marginal distribution of -15400 strings P( w ) provide information about the conditional -15600 distribution of parses t given strings P( t | w )? -15800 ◮ need additional linking assumptions about the -16000 relationship between parses and strings 0 5 10 15 20 ◮ . . . but no one really knows! Iteration Outline Quality of ML parses 1 Precision Why Statistics? Recall 0.95 Learning probabilistic context-free grammars 0.9 Parse Accuracy 0.85 Factoring learning into simpler components 0.8 The Janus-faced nature of computational linguistics 0.75 Conclusion 0.7 0 5 10 15 20 Iteration

Concatenative morphology Factoring the language learning problem Verb ◮ Factor the language learning problem into linguistically simpler components Stem Suffix ◮ Focus on components that might be less dependent on t a l k i n g Data = t a l k i n g context and semantics (e.g., word segmentation, phonology) ◮ Identify relevant information sources (including prior Verb → Stem Suffix knowledge, e.g., UG) by comparing models Stem → w w ∈ Σ ⋆ ◮ Combine components to produce more ambitious learners Suffix → w w ∈ Σ ⋆ ◮ PCFG-like grammars are a natural way to formulate many ◮ Morphological alternation provides primary evidence for of these components phonological generalizations (“trucks” /s/ vs. “cars” /z/) ◮ Morphemes may also provide clues for word segmentation ◮ Algorithms for doing this already exist (e.g., Goldsmith) Joint work with Sharon Goldwater and Tom Griffiths PCFG components can be integrated Word Segmentation Utterance Utterance Word Utterance WordsN t h e Word Utterance N WordsV d o g Word StemN SuffixN V Data = t h e d o g b a r k s b a r k s d o g s StemV SuffixV Utterance → Word Utterance b a r k Utterance → Word Word → w w ∈ Σ ⋆ Utterance → Words S S ∈ S Words S → S Words T T ∈ S ◮ Algorithms for word segmentation from this information S → Stem S Suffix S already exists (e.g., Elman, Brent) Stem S → t t ∈ Σ ⋆ ◮ Likely that children perform some word segmentation Suffix S → f f ∈ Σ ⋆ before they know the meanings of words

Statistical revolution in computational linguistics Speech - PDF document

Statistical revolution in computational linguistics Speech recognition Statistics and Syntactic parsing Machine translation the Scientific Study of Language 0.92 What do they have to do with each other? 0.91 0.9 Parse 0.89

Outline zipfR zipfR (Computational) linguistics Evert & Baroni Evert & Baroni

The Digital Revolution 1 Digital Revolution Nadias Theme 2 Digital Revolution Digital

5. Revolution and Napoleonic Europe 5.1 The Revolution in France 5.2 The Revolution and Europe

Triple Revolution 1. Internet Revolution 2. Mobile Revolution 3. Social Media Revolution

Introduction to Linguistics Darrell Larsen Linguistics 101 Darrell Larsen Introduction to

Sons of the Sons of the American Revolution American Revolution JROTC/ ROTC & Service

Digital Industrial Revolution Bearing Specialists Association Greg Scheu, President ABB Americas

From Evolution to (ML?) Revolution in Mobile Networking Slawomir Stanczak The Actual Revolution

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

Topics in Computational Linguistics Topics in Computational Linguistics March 28, 2014 GIL,

Linguistics 201 Personnel Introduction to Linguistics General Course Description Syllabus

Computational Linguistics I CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

The Corpus of Old English P . S. Langeslag The Dictionary of Old English Corpus 3060 Texts

This weeks message: HELP! The Earth is Falling Psalm 46 hopecc.com/slides &

The Flow of the Psalms INTRODUCTORY DIAGRAM: OVERALL STRUCTURE OF THE PSALTER Psalm 5 10 15

Prayers in the Psalms Introduction Prayer book Psalter Different types of prayers

MT in LArSoft update Kyle J. Knoepfel 23 April 2019 LArSoft coordination meeting MT links

Introduction to DUNE Computing Eileen Berman (stealing from many people) DUNE Physics Week Nov

Patent Law Prof. Roger Ford October 5, 2016 Class 9 Novelty III: patent documents; priority

Introduction to coreboot What is coreboot? How can I try it out? How can I contribute?

Statistical revolution in computational linguistics Speech - PDF document

Statistical revolution in computational linguistics Speech recognition Statistics and Syntactic parsing Machine translation the Scientific Study of Language 0.92 What do they have to do with each other? 0.91 0.9 Parse 0.89

Outline zipfR zipfR (Computational) linguistics Evert &amp; Baroni Evert &amp; Baroni

The Digital Revolution 1 Digital Revolution Nadias Theme 2 Digital Revolution Digital

5. Revolution and Napoleonic Europe 5.1 The Revolution in France 5.2 The Revolution and Europe

Triple Revolution 1. Internet Revolution 2. Mobile Revolution 3. Social Media Revolution

Introduction to Linguistics Darrell Larsen Linguistics 101 Darrell Larsen Introduction to

Sons of the Sons of the American Revolution American Revolution JROTC/ ROTC &amp; Service

Digital Industrial Revolution Bearing Specialists Association Greg Scheu, President ABB Americas

From Evolution to (ML?) Revolution in Mobile Networking Slawomir Stanczak The Actual Revolution

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

Topics in Computational Linguistics Topics in Computational Linguistics March 28, 2014 GIL,

Linguistics 201 Personnel Introduction to Linguistics General Course Description Syllabus

Computational Linguistics I CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

The Corpus of Old English P . S. Langeslag The Dictionary of Old English Corpus 3060 Texts

This weeks message: HELP! The Earth is Falling Psalm 46 hopecc.com/slides &amp;

The Flow of the Psalms INTRODUCTORY DIAGRAM: OVERALL STRUCTURE OF THE PSALTER Psalm 5 10 15

Prayers in the Psalms Introduction Prayer book Psalter Different types of prayers

MT in LArSoft update Kyle J. Knoepfel 23 April 2019 LArSoft coordination meeting MT links

Introduction to DUNE Computing Eileen Berman (stealing from many people) DUNE Physics Week Nov

Patent Law Prof. Roger Ford October 5, 2016 Class 9 Novelty III: patent documents; priority

Introduction to coreboot What is coreboot? How can I try it out? How can I contribute?

Outline zipfR zipfR (Computational) linguistics Evert & Baroni Evert & Baroni

Sons of the Sons of the American Revolution American Revolution JROTC/ ROTC & Service

This weeks message: HELP! The Earth is Falling Psalm 46 hopecc.com/slides &