Data Intensive Linguistics Lecture 1 Introduction (I): Words and - PowerPoint PPT Presentation

Data Intensive Linguistics — Lecture 1 Introduction (I): Words and Probability Philipp Koehn 9 January 2006 PK DIL 9 January 2006

1 Welcome to DIL • Lecturer: Philipp Koehn • TA: Sebastian Riedel • Lectures: Mondays and Thursdays, 14:00, FH Room A9/11 • Practical sessions: 4 extra sessions • Project (worth 30%) will be given out next week • Exam counts for 70% of the grade PK DIL 9 January 2006

2 Outline • Introduction: Words, probability, information theory, n-grams and language modeling • Methods: tagging, finite state machines, statistical modeling, parsing, clustering • Applications: Word sense disambiguation, Information retrieval, text categorisation, summarisation, information extraction, question answering • Statistical machine translation PK DIL 9 January 2006

3 References • Manning and Sch¨ utze: ”Foundations of Statistical Language Processing”, 1999, MIT Press, available online • Jurafsky and Martin: ”Speech and Language Processing”, 2000, Prentice Hall. • also: research papers, other handouts PK DIL 9 January 2006

4 MSc Dissertation Topics • Lattice Decoding for Machine Translation • Word Alignment for Machine Translation • Exploiting Factored Translation Models • Discriminative Training for Machine Translation • Discontinuous phrases in Statistical Machine Translation • Learning inflectional paradigms using parallel corpora • Harvesting multi-lingual comparable corpora from the web • Syntax-Based Models for Statistical Machine Translation PK DIL 9 January 2006

5 What is Data Intensive Linguistics? • Data: work on corpora using statistical models or other machine learning methods • Intensive: fine by me • Linguistics: computational linguistics vs. natural language processing PK DIL 9 January 2006

6 Quotes It must be recognized that the notion ”probability of a sentence” is an entirely useless one, under any known interpretation of this term. Noam Chomsky, 1969 Whenever I fire a linguist our system performance improves. Frederick Jelinek, 1988 PK DIL 9 January 2006

7 Conflicts? • Scientist vs. engineer • Explaining language vs. building applications • Rationalist vs. empiricist • Insight vs. data analysis PK DIL 9 January 2006

8 Why is Language Hard? • Ambiguities on many levels • Rules, but many exceptions • No clear understand how humans process language → ignore humans, learn from data? PK DIL 9 January 2006

9 Language as Data A lot of text is now available in digital form • billions of words of news text distributed by the LDC • billions of documents on the web (trillion of words?) • ten thousands of sentences annotated with syntactic trees for a number of languages (around one million words for English) • 10s–100s of million words translated between English and other languages PK DIL 9 January 2006

10 Word Counts One simple statistic: counting words in Mark Twain’s Tom Sawyer : Word Count the 3332 and 2973 a 1775 to 1725 of 1440 was 1161 it 1027 in 906 that 877 from Manning+Sch¨ utze, page 21 PK DIL 9 January 2006

11 Counts of counts count count of count 1 3993 2 1292 • 3993 singletons (words that 3 664 occur only once in the text) 4 410 5 243 • Most words occur only a very 6 199 few times. 7 172 ... ... • Most of the text consists of 10 91 a few hundred high-frequency 11-50 540 words. 51-100 99 > 100 102 PK DIL 9 January 2006

12 Zipf’s Law Zipf’s law: f × r = k Rank r Word Count f f × r 1 the 3332 3332 2 and 2973 5944 3 a 1775 5235 10 he 877 8770 20 but 410 8400 30 be 294 8820 100 two 104 10400 1000 family 8 8000 8000 applausive 1 8000 PK DIL 9 January 2006

13 Probabilities • Given word counts we can estimate a probability distribution: count ( w ) P ( w ) = w ′ count ( w ′ ) P • This type of estimation is called maximum likelihood estimation . Why? We will get to that later. • Estimating probabilities based on frequencies is called the frequentist approach to probability. • This probability distribution answers the question: If we randomly pick a word out of a text, how likely will it be word w ? PK DIL 9 January 2006

14 A bit more formal • We introduced a random variable W . • We defined a probability distribution p , that tells us how likely the variable W is the word w : prob ( W = w ) = p ( w ) PK DIL 9 January 2006

15 Joint probabilities • Sometimes, we want to deal with two random variables at the same time. • Example: Words w 1 and w 2 that occur in sequence (a bigram ) We model this with the distribution: p ( w 1 , w 2 ) • If the occurrence of words in bigrams is independent , we can reduce this to p ( w 1 , w 2 ) = p ( w 1 ) p ( w 2 ) . Intuitively, this not the case for word bigrams. • We can estimate joint probabilities over two variables the same way we estimated the probability distribution over a single variable: count ( w 1 ,w 2 ) p ( w 1 , w 2 ) = P w 1 ′ ,w 2 ′ count ( w 1 ′ ,w 2 ′ ) PK DIL 9 January 2006

16 Conditional probabilities • Another useful concept is conditional probability p ( w 2 | w 1 ) It answers the question: If the random variable W 1 = w 1 , how what is the value for the second random variable W 2 ? • Mathematically, we can define conditional probability as p ( w 2 | w 1 ) = p ( w 1 ,w 2 ) p ( w 1 ) • If W 1 and W 2 are independent: p ( w 2 | w 1 ) = p ( w 2 ) PK DIL 9 January 2006

17 Chain rule • A bit of math gives us the chain rule: p ( w 2 | w 1 ) = p ( w 1 ,w 2 ) p ( w 1 ) p ( w 1 ) p ( w 2 | w 1 ) = p ( w 1 , w 2 ) • What if we want to break down large joint probabilities like p ( w 1 , w 2 , w 3 ) ? We can repeatedly apply the chain rule: p ( w 1 , w 2 , w 3 ) = p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) PK DIL 9 January 2006

Data Intensive Linguistics Lecture 1 Introduction (I): Words and - PowerPoint PPT Presentation

Data Intensive Linguistics Lecture 1 Introduction (I): Words and Probability Philipp Koehn 9 January 2006 PK DIL 9 January 2006 1 Welcome to DIL Lecturer: Philipp Koehn TA: Sebastian Riedel Lectures: Mondays and Thursdays,

Introduction to Linguistics Darrell Larsen Linguistics 101 Darrell Larsen Introduction to

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Linguistics 201 Personnel Introduction to Linguistics General Course Description Syllabus

Data Intensive Linguistics Lecture 3 Language Modeling Philipp Koehn 16 January 2006 PK

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Outline zipfR zipfR (Computational) linguistics Evert & Baroni Evert & Baroni

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Introduction to English Linguistics 1: Introduction Linguistics or Medieval Studies? Figure:

Data Intensive Linguistics Lecture 17 Machine translation (IV): Phrase-Based Models Philipp

Data Intensive Linguistics Lecture 13 Semantics and discourse Philipp Koehn 20 February 2006

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

and Observational Science The Convergence of Data-Intensive and Compute-Intensive Infrastructure

Turning Data Into Business Value Qwertee 101: Finding Your Next Data Partner Data-Intensive

(Pre-)Algebras for Linguistics 2. Introducing Preordered Algebras Carl Pollard Linguistics 680:

The scope of linguistics John Goldsmith Origins of linguistics In several cases, the roots

HOW TO PICK THE BEST WHEAT VARIETY Scott D. Haley CSU Wheat Breeder Soil and Crop Sciences

Syntax-based Transla0on Part 1: Re-ordering for Phrase-based

eWIC Webinar 8: Outreach Planning for Partners Thursday, November 21, 2013 10:30 11:30 Log

"To a man with a hammer, "T ith h everything looks like a nail" everything

Syntax-based Transla0on Part 1: Re-ordering for

GAC Leadership Elections outline Olof Nordling Julia Charvolen Introduction GAC leadership

Class and program design: SCOOBI perspective Whats difficult in developing scoobi/scoobido?

CS101 Lecture 20 Flash Animation: Tweening Aaron Stevens 11 March 2011 1 Overview/Questions

Data Intensive Linguistics Lecture 1 Introduction (I): Words and - PowerPoint PPT Presentation

Data Intensive Linguistics Lecture 1 Introduction (I): Words and Probability Philipp Koehn 9 January 2006 PK DIL 9 January 2006 1 Welcome to DIL Lecturer: Philipp Koehn TA: Sebastian Riedel Lectures: Mondays and Thursdays,

Introduction to Linguistics Darrell Larsen Linguistics 101 Darrell Larsen Introduction to

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Linguistics 201 Personnel Introduction to Linguistics General Course Description Syllabus

Data Intensive Linguistics Lecture 3 Language Modeling Philipp Koehn 16 January 2006 PK

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Outline zipfR zipfR (Computational) linguistics Evert &amp; Baroni Evert &amp; Baroni

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Introduction to English Linguistics 1: Introduction Linguistics or Medieval Studies? Figure:

Data Intensive Linguistics Lecture 17 Machine translation (IV): Phrase-Based Models Philipp

Data Intensive Linguistics Lecture 13 Semantics and discourse Philipp Koehn 20 February 2006

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

and Observational Science The Convergence of Data-Intensive and Compute-Intensive Infrastructure

Turning Data Into Business Value Qwertee 101: Finding Your Next Data Partner Data-Intensive

(Pre-)Algebras for Linguistics 2. Introducing Preordered Algebras Carl Pollard Linguistics 680:

The scope of linguistics John Goldsmith Origins of linguistics In several cases, the roots

HOW TO PICK THE BEST WHEAT VARIETY Scott D. Haley CSU Wheat Breeder Soil and Crop Sciences

Syntax-based Transla0on Part 1: Re-ordering for Phrase-based

eWIC Webinar 8: Outreach Planning for Partners Thursday, November 21, 2013 10:30 11:30 Log

&quot;To a man with a hammer, &quot;T ith h everything looks like a nail&quot; everything

Syntax-based Transla0on Part 1: Re-ordering for

GAC Leadership Elections outline Olof Nordling Julia Charvolen Introduction GAC leadership

Class and program design: SCOOBI perspective Whats difficult in developing scoobi/scoobido?

CS101 Lecture 20 Flash Animation: Tweening Aaron Stevens 11 March 2011 1 Overview/Questions

Outline zipfR zipfR (Computational) linguistics Evert & Baroni Evert & Baroni

"To a man with a hammer, "T ith h everything looks like a nail" everything