Taal- en spraaktechnologie Sophia Katrenko Utrecht University, the - - PowerPoint PPT Presentation

taal en spraaktechnologie
SMART_READER_LITE
LIVE PREVIEW

Taal- en spraaktechnologie Sophia Katrenko Utrecht University, the - - PowerPoint PPT Presentation

Today Taal- en spraaktechnologie Sophia Katrenko Utrecht University, the Netherlands Sophia Katrenko Lecture 1 Today Outline Today 1 Collocations in linguistics Automatic collocation extraction Sophia Katrenko Lecture 1 Collocations in


slide-1
SLIDE 1

Today

Taal- en spraaktechnologie

Sophia Katrenko

Utrecht University, the Netherlands

Sophia Katrenko Lecture 1

slide-2
SLIDE 2

Today

Outline

1

Today Collocations in linguistics Automatic collocation extraction

Sophia Katrenko Lecture 1

slide-3
SLIDE 3

Today Collocations in linguistics Automatic collocation extraction

Today we discuss Chapter 5 (Manning and Sch¨ utze. “Foundations of statistical natural language processing”), and more precisely

1

look at how to use probability theory machinery for NLP research (e.g., mutual information for detecting collocations)

Sophia Katrenko Lecture 1

slide-4
SLIDE 4

Today Collocations in linguistics Automatic collocation extraction

Terminology What is a collocation in linguistics? Collocation is recurrent, relatively fixed word combination. (in S. Bartsch (2004)) E.g., red herring, kick the bucket, dark-night, dog-bark. Firth (1951) I propose to bring forward as a technical term, meaning by ‘collocation’, and to apply the test of ‘collocability’. Jespersen (1917) Little and few are also incomplete negatives: note the frequent collocation with no: there is little or no danger.

Sophia Katrenko Lecture 1

slide-5
SLIDE 5

Today Collocations in linguistics Automatic collocation extraction

Terminology Firth Meaning by collocation is an abstraction at the syntagmatic level and is not directly concerned with the conceptual or idea approach to the meaning of words. Firth (1957) One of the meanings of night is its collocability with dark, and of dark, of course, its collocation with night.

Sophia Katrenko Lecture 1

slide-6
SLIDE 6

Today Collocations in linguistics Automatic collocation extraction

Terminology Choueka (1988) A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning cannot be derived directly from the meaning or connotation of its components.

Sophia Katrenko Lecture 1

slide-7
SLIDE 7

Today Collocations in linguistics Automatic collocation extraction

Terminology Two main views on collocations: FIRTH: collocations as lexical proximities in text CHOUEKA: collocations as syntactic and semantic units, semantic irregularity

Sophia Katrenko Lecture 1

slide-8
SLIDE 8

Today Collocations in linguistics Automatic collocation extraction

Terminology Halliday (1966) a collocation defines the membership of lexical sets e.g., studies differences between strong and powerful on the examples of strong tea, strong car (non-acceptable), strong or powerful argument. argues that various grammatical configurations are possible, e.g. “he argued strongly against”, “the strength of his argument, others.

Sophia Katrenko Lecture 1

slide-9
SLIDE 9

Today Collocations in linguistics Automatic collocation extraction

Terminology Further, we distinguish between strong collocators (blond is used to describe hair color) and weak collocators (e.g., the is used with various nouns) idioms (in English literature) or phraseological units (in German literature) (black ingratitude) and then many relations expressing synonyms (honest/fair) antonyms (old/new) homonyms (e.g., words that shared the same spelling and pronunciation, but not origin, e.g., bank)

Sophia Katrenko Lecture 1

slide-10
SLIDE 10

Today Collocations in linguistics Automatic collocation extraction

Terminology

Criteria for collocations NON-COMPOSITIONALITY The meaning of a collocation is not a straightforward composition of the meanings of its parts (especially, in the case of idioms). NON-SUBSTITUTABILITY It is not possible to substitute the components of a collocation by their near-synonyms (e.g.,

  • range tape instead of red tape; gray elephant instead of white

elephant) NON-MODIFIABILITY Many collocations cannot be freely modified with additional lexical material or through grammatical transformations (e.g., kick the heavy bucket instead of kick the bucket)

Sophia Katrenko Lecture 1

slide-11
SLIDE 11

Today Collocations in linguistics Automatic collocation extraction

Terminology

Criteria for collocations NON-COMPOSITIONALITY The meaning of a collocation is not a straightforward composition of the meanings of its parts (especially, in the case of idioms). NON-SUBSTITUTABILITY It is not possible to substitute the components of a collocation by their near-synonyms (e.g.,

  • range tape instead of red tape; gray elephant instead of white

elephant) NON-MODIFIABILITY Many collocations cannot be freely modified with additional lexical material or through grammatical transformations (e.g., kick the heavy bucket instead of kick the bucket)

Sophia Katrenko Lecture 1

slide-12
SLIDE 12

Today Collocations in linguistics Automatic collocation extraction

Terminology

Criteria for collocations NON-COMPOSITIONALITY The meaning of a collocation is not a straightforward composition of the meanings of its parts (especially, in the case of idioms). NON-SUBSTITUTABILITY It is not possible to substitute the components of a collocation by their near-synonyms (e.g.,

  • range tape instead of red tape; gray elephant instead of white

elephant) NON-MODIFIABILITY Many collocations cannot be freely modified with additional lexical material or through grammatical transformations (e.g., kick the heavy bucket instead of kick the bucket)

Sophia Katrenko Lecture 1

slide-13
SLIDE 13

Today Collocations in linguistics Automatic collocation extraction

Terminology

  • S. Evert: However,

collocations range from completely fixed to syntactically flexible constructions. syntactic restrictions usually coincide with semantic restrictions and thus are indicators for the degree of lexicalization of a particular word combination. particular word combinations are associated with specific restrictions cannot be inferred from standard rules of grammar and thus need to be stored together with the collocation.

Sophia Katrenko Lecture 1

slide-14
SLIDE 14

Today Collocations in linguistics Automatic collocation extraction

Terminology

Collocations can be word level phenomena (e.g., Red Cross, fix und fertig) phrase level phenomena (collocation phrase) (copula constructions, proverbs) Collocation phrases consist of the lexically determined words (collocates) only or contain additional lexically underspecified material.

Sophia Katrenko Lecture 1

slide-15
SLIDE 15

Today Collocations in linguistics Automatic collocation extraction

Terminology

STRUCTURAL DEPENDENCY the collocates of a collocation are syntactic dependents, thus knowledge of syntactic structure is a precondition for accurate collocation identification. SYNTACTIC CONTEXT may help to discriminate literal and collocational readings, (think, e.g., of red tape)

Sophia Katrenko Lecture 1

slide-16
SLIDE 16

Today Collocations in linguistics Automatic collocation extraction

Terminology

Collocation subclasses verb particle constructions (e.g., look up) light verbs verbs like do or make in do a favor or make a decision proper nouns considered collocations in computational linguistics (e.g., New York) terminological expressions phrases in a particular domain, e.g., hydraulic oil filter

Sophia Katrenko Lecture 1

slide-17
SLIDE 17

Today Collocations in linguistics Automatic collocation extraction

Terminology

How to recognize a collocation? TRANSLATION TEST if we cannot translate a phrase word by word, then it is likely to be a collocation: make a decision, break a record. ASSOCIATION a more general notion, which does not necessarily encompass grammatically bound elements: plane - airport.

Sophia Katrenko Lecture 1

slide-18
SLIDE 18

Today Collocations in linguistics Automatic collocation extraction

Recent work Computational work on extracting collocations automatically has been very popular:

work by Stefan Evert, incl. his PhD dissertation “The Statistics of Word Cooccurrences: Word Pairs and Collocations” (2004). (check http://www.collocations.de/EK/). multilingual collocation extraction using syntax, as in Seretan, Violeta (in press). Syntax-Based Collocation Extraction.

  • Springer. 2011.

Why is is still challenging?

collocations are not always adjacent collocations pose a problem for machine translation

Sophia Katrenko Lecture 1

slide-19
SLIDE 19

Today Collocations in linguistics Automatic collocation extraction

Procedure According to S. Evert, the extraction of collocations involves the following steps:

Use application dependent notion of collocation as opposed to Firth/Choueka. Extract collocations from relational data (relational n-grams). Consider grammatically homogenous data (e.g. Adj-N, PP-V). Use recurrence/cooccurrence frequency as main criterion for collocation extraction (statistical approaches).

Sophia Katrenko Lecture 1

slide-20
SLIDE 20

Today Collocations in linguistics Automatic collocation extraction

Procedure

Depending on what type of collocations we want to extract (e.g., of the type subj-verb), we may need to do the following preprocessing steps: tokenization (orthographic words) pos-tagging morphological analysis / lemmatization partial parsing (full parsing)

Sophia Katrenko Lecture 1

slide-21
SLIDE 21

Today Collocations in linguistics Automatic collocation extraction

Procedure

Candidate extraction strategies: STRATEGY 1 Retrieval of n-grams from word forms only. STRATEGY 2 Retrieval of n-grams from part-of-speech annotated word forms. STRATEGY 3 Retrieval of n-grams from word forms with particular parts-of-speech, at particular positions in syntactic structure.

Sophia Katrenko Lecture 1

slide-22
SLIDE 22

Today Collocations in linguistics Automatic collocation extraction

Procedure

Candidate extraction strategies, pros and cons: the strategy choice depends on language and collocation type retrieval of PP-verb collocations (e.g., zur Verfuegung stellen) from word forms only is clearly inappropriate blunt use of stop word lists leads to the loss of collocation-relevant information, as accessibility of prepositions and determiners may be crucial for successful extraction

Sophia Katrenko Lecture 1

slide-23
SLIDE 23

Today Collocations in linguistics Automatic collocation extraction

Methods

The main approaches to collocation extraction are based on data filtering using PoS information using mean/variance information hypothesis testing mutual information

Sophia Katrenko Lecture 1

slide-24
SLIDE 24

Today Collocations in linguistics Automatic collocation extraction

FREQUENCY AND FILTERING

Sophia Katrenko Lecture 1

slide-25
SLIDE 25

Today Collocations in linguistics Automatic collocation extraction

Methods

A frequency-based method select span (context window) size (e.g., n) collect all n-grams from corpora sort n-grams according to their frequency take the most frequent n-grams as collocations

Sophia Katrenko Lecture 1

slide-26
SLIDE 26

Today Collocations in linguistics Automatic collocation extraction

Methods

A frequency-based method easy, select your favourite context window size and start counting, but . . . frequency alone will not produce desirable results (why?) PoS filter is used for filtering(Justeson and Katz 1995b)

Sophia Katrenko Lecture 1

slide-27
SLIDE 27

Today Collocations in linguistics Automatic collocation extraction

Frequency

Sophia Katrenko Lecture 1

slide-28
SLIDE 28

Today Collocations in linguistics Automatic collocation extraction

Zipf’s law

Zipf’s law named after G. K. Zipf (1902-1950). “. . . the observation that frequency of occurrence of some event (A), as a function of the rank (i) when the rank is determined by the above frequency of occurrence, is a power-law function ≈

1 iα

with the exponent α close to unity (1).”

Sophia Katrenko Lecture 1

slide-29
SLIDE 29

Today Collocations in linguistics Automatic collocation extraction

Zipf’s law

Examples of Zipf’s law POPULATION: there are a few very populous cities, while numerous cities have a small population. ECONOMICS: income or revenue of a company is a function of the rank. LANGUAGE: English words follows an exponential distribution. The most common words tend to be short and appear often.

Sophia Katrenko Lecture 1

slide-30
SLIDE 30

Today Collocations in linguistics Automatic collocation extraction

Zipf’s law

Source: E. Glaeser. A Tale of Many Cities. http://economix.blogs. nytimes.com/2010/04/20/a-tale-of-many-cities/

Sophia Katrenko Lecture 1

slide-31
SLIDE 31

Today Collocations in linguistics Automatic collocation extraction

Zipf’s law

Emre Sevinc Le Quan Ha et al.’02

Sophia Katrenko Lecture 1

slide-32
SLIDE 32

Today Collocations in linguistics Automatic collocation extraction

Back to collocation extraction Preliminary conclusions:

frequency alone is not sufficient PoS filtering is already helpful how to extract collocates which are not adjacent?

Sophia Katrenko Lecture 1

slide-33
SLIDE 33

Today Collocations in linguistics Automatic collocation extraction

Back to collocation extraction Other observations

definition of span size is crucial! if the span size is kept small, it is unlikely to properly cover nonadjacent collocates of structurally flexible collocations. enlarging the span size leads to an increase of candidate collocations including an increase of noisy data which need to be discarded in a further processing step.

Sophia Katrenko Lecture 1

slide-34
SLIDE 34

Today Collocations in linguistics Automatic collocation extraction

COLLOCATION EXTRACTION USING MEAN/VARIANCE (PROBABILITY THEORY: REMINDER)

Sophia Katrenko Lecture 1

slide-35
SLIDE 35

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder Probability theory studies how likely is that an event will

  • happen. For instance,

1

how likely is that it will be raining tomorrow, provided it has been sunny last week?

2

how likely is that by tossing a coin, it will come heads up (is that a fair/biased coin?)

3

how likely is to encounter a sequence ‘green apple’ vs. ‘apple green’? What about ‘blue apple’? Probability theory is widely used for modeling and prediction.

Sophia Katrenko Lecture 1

slide-36
SLIDE 36

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder

1

An outcome of an event is observed via an experiment (trial).

2

The basic outcomes form the sample space Ω.

3

Sample spaces can be either discrete (countable), or continuous (uncountable).

4

An event A is a subset of Ω, A ⊂ Ω.

5

We distinguish between a certain event (Ω) and an impossible event (∅).

Sophia Katrenko Lecture 1

slide-37
SLIDE 37

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder

1

The likelihood that the event will occur is called its probability (an impossible event has a probability of 0, and a certain event a probability of 1).

2

Probability of A is denoted by P(A).

3

A discrete probability function is any function P : F → [0, 1], s.t.

P(Ω) = 1 for disjoint sets of Ai ∈ F: P(∪∞

i=1Ai) = ∞ i=1 P(Ai)

Sophia Katrenko Lecture 1

slide-38
SLIDE 38

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder A random variable is a function X : Ω → Rn. A discrete random variable is X : Ω → S, where S ⊂ Rn and is countable. An X is an indicator random variable (Bernoulli trial) if X : Ω → {0, 1}. The probability mass function (pmf) of X assigns a probability P(X = xi) for each of the possible values xi of the variable. If X is distributed according to some pmf p(x), it is written X ∼ p(x). Pmf should satisfy: p(xi) ≥ 0, n

i=1 p(xi) = 1.

Sophia Katrenko Lecture 1

slide-39
SLIDE 39

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder An example

1

What is the sample space of a fair coin tossed twice?

2

What is the probability of a basic outcome here?

3

How likely is to get one head? What is the event A here?

Sophia Katrenko Lecture 1

slide-40
SLIDE 40

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder An example

1

What is the sample space of a fair coin tossed twice? (answer: Ω = {TT, HH, HT, TH})

2

What is the probability of a basic outcome here?

3

How likely is to get one head? What is the event A here?

Sophia Katrenko Lecture 1

slide-41
SLIDE 41

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder An example

1

What is the sample space of a fair coin tossed twice? (answer: Ω = {TT, HH, HT, TH})

2

What is the probability of a basic outcome here? (answer:

1 4)

3

How likely is to get one head? What is the event A here?

Sophia Katrenko Lecture 1

slide-42
SLIDE 42

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder An example

1

What is the sample space of a fair coin tossed twice? (answer: Ω = {TT, HH, HT, TH})

2

What is the probability of a basic outcome here? (answer:

1 4)

3

How likely is to get one head? What is the event A here?(answer: A = {HH, HT, TH})

4

The probability of A is therefore P(A) = |A| |Ω| = 3 4 (1)

Sophia Katrenko Lecture 1

slide-43
SLIDE 43

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder The expectation (or expected value, or mean) is the average of a random variable, E(X) =

  • x

xp(x) (2) The variance of a random variable shows consistency of the values of the random variable over the time: Var(X) = E((X − E(X))2) = E(X 2) − E2(X) (3) Standard deviation for a sample is defined as σ =

x(x−E(x))

n−1

.

Sophia Katrenko Lecture 1

slide-44
SLIDE 44

Today Collocations in linguistics Automatic collocation extraction

Probability theory: reminder Example: rolling a die. Let K be a number that we get. Then, E(K) = 1 ∗ 1 6 + 2 ∗ 1 6 + 3 ∗ 1 6 + 4 ∗ 1 6 + 5 ∗ 1 6 + 6 ∗ 1 6 = 7 2 = 3.5 (4) The variance of K is computed as Var(K) =1 6 ∗ (1 − 7 2)2 + 1 6 ∗ (2 − 7 2)2 + 1 6 ∗ (3 − 7 2)2+ + 1 6 ∗ (4 − 7 2)2 + 1 6 ∗ (5 − 7 2)2 + 1 6 ∗ (6 − 7 2)2 = 2.92 Note that it is may be the case that an expectation of a random variable can be a value it never takes!

Sophia Katrenko Lecture 1

slide-45
SLIDE 45

Today Collocations in linguistics Automatic collocation extraction

Mean/variance

Mean and variance are good indicators in the case a collocation can be spanned over several words (Smadja, 1993), like she knocked on the wooden door. How to use it? Corpus (1) she knocked on his door (2) they knocked at the door (3) 100 women knocked on Donaldson’s door (4) a man knocked on the metal front door

Mean for the offset between knock and door equals 1/4(3 + 3 + 5 + 5) = 4 σ =

  • 1

3((3 − 4)2 + (3 − 4)2 + (5 − 4)2 + (5 − 4)2) = 1.15 (5)

Sophia Katrenko Lecture 1

slide-46
SLIDE 46

Today Collocations in linguistics Automatic collocation extraction

Mean/variance

Now, what does σ = 0 mean? Compare the two plots below. What do mean and deviation mean in this case? What happens if the mean is around 1.0, and σ is low?

Sophia Katrenko Lecture 1

slide-47
SLIDE 47

Today Collocations in linguistics Automatic collocation extraction

Analysis

Problem: low variance and high frequency can be accidental. The basic assumptions for collocation extraction using statistical information are The collocates of a collocation co-occur more frequently within text than arbitrary word combinations. (Recurrence). Stricter control of co-occurrence data leads to more meaningful results in collocation extraction.

Sophia Katrenko Lecture 1

slide-48
SLIDE 48

Today Collocations in linguistics Automatic collocation extraction

HYPOTHESIS TESTING

Sophia Katrenko Lecture 1

slide-49
SLIDE 49

Today Collocations in linguistics Automatic collocation extraction

Hypothesis testing

Frequency counts for collocation candidates (e.g., all bigrams, or consecutive pair of words) are stored in the contingency table:

Sophia Katrenko Lecture 1

slide-50
SLIDE 50

Today Collocations in linguistics Automatic collocation extraction

Hypothesis testing

But what happens if we do not put any restrictions on collocations, except for their length (2)? Example The number of resulting n-grams is very large, with its worst case being linear with the corpus size. So how to use the contingency matrix to get some meaningful collocations?

Sophia Katrenko Lecture 1

slide-51
SLIDE 51

Today Collocations in linguistics Automatic collocation extraction

Hypothesis testing

Answer We can rank co-occurrence data by “association scores”! For two words (or types) v and u, we measure statistical association between u and v assume that true collocations should obtain high scores use association measures (AM)

  • btain N-best list = listing of N highest-ranked collocation

candidates

Sophia Katrenko Lecture 1

slide-52
SLIDE 52

Today Collocations in linguistics Automatic collocation extraction

Hypothesis testing

The idea of hypothesis testing reflects the intuition that collocates

  • ccur together more often than chance. The following terminology is

used Null hypothesis H0: there is no association between collocates If probability p for H0 to be true is too low (smaller than some significance level), then we reject H0, otherwise we retain H0. What is p for two words w1 and w2 to occur together by chance? (Recall independent events from the last lecture). P(w1w2) = P(w1)P(w2) (6)

Sophia Katrenko Lecture 1

slide-53
SLIDE 53

Today Collocations in linguistics Automatic collocation extraction

Significance

significant = probably true (not due to chance). significance levels show you how likely a result is due to chance. the most common level, used to mean something is good enough to be believed, is .95. this means that the finding has a 95% chance of being true. Sometimes, it is written α = .05, meaning that the finding has a five percent (.05) chance of not being true, which is the converse

  • f a 95% chance of being true.

Sophia Katrenko Lecture 1

slide-54
SLIDE 54

Today Collocations in linguistics Automatic collocation extraction

t - test

The t-test operates on the following information a sample of observations with their mean and variance H0: the sample is drawn with µ checks the difference between observed and estimated mean and informs how likely it is to get a sample with that mean assuming that the sample is drawn from a normal distribution with mean µ

Sophia Katrenko Lecture 1

slide-55
SLIDE 55

Today Collocations in linguistics Automatic collocation extraction

t - test

For the sample size N, with the mean ¯ x and sample variance s2, t-statistics is calculated as follows: t = ¯ x − µ

  • s2

N

(7)

Sophia Katrenko Lecture 1

slide-56
SLIDE 56

Today Collocations in linguistics Automatic collocation extraction

t - test

Example P(companies) = 4675/14307668 P(new) = 15828/14307668 H0: P(new companies) = P(companies)P(new) = 3.615 ×10−7 ¯ x = 8/14307668 = 5.591 ×10−7 N = 14307668 σ2 = p(p − 1) ≈ p

Sophia Katrenko Lecture 1

slide-57
SLIDE 57

Today Collocations in linguistics Automatic collocation extraction

t - test

Example t = ¯ x − µ

  • s2

N

≈ 5.591 × 10−7 − 3.615 × 10−7

  • 5.591×10−7

14307668

(8) t ≈ 0.999932 < 2.576 (α = 0.005), so H0 is not rejected.

Sophia Katrenko Lecture 1

slide-58
SLIDE 58

Today Collocations in linguistics Automatic collocation extraction

t-test: Results t-test applied to 10 bigrams that occur with the frequency 20.

Sophia Katrenko Lecture 1

slide-59
SLIDE 59

Today Collocations in linguistics Automatic collocation extraction

Pearson’s chi-square test

t-test assumes normal distributions, which is not necessarily true χ2 is an alternative, whereby it doesn’t look at independence, but rather at the difference between expected and observed frequencies. X 2 =

  • i,j

(Oij − Eij)2 Eij (9)

Sophia Katrenko Lecture 1

slide-60
SLIDE 60

Today Collocations in linguistics Automatic collocation extraction

Pearson’s chi-square test

Oij are found in the contingency matrix, Eij are expected frequencies, computed from marginal probabilities, e.g. for new companies E11 = 8 + 4667 N × 8 + 15820 N × N ≈ 5.2 (10) Explanation: if new and companies occurred completely independently of each other we would expect 5.2 occurrences of new companies on average for a text of the size of our corpus. Using X 2 formulae, X 2 ≈ 1.55. At α = 0.05, χ2 = 3.841 - H0 is not rejected.

Sophia Katrenko Lecture 1

slide-61
SLIDE 61

Today Collocations in linguistics Automatic collocation extraction

Pointwise MI

Pointwise mutual information (Church et al. 1991) has also been used for collocation discovery. Fano, 1961 The amount of information provided by the occurrence of the event represented by [y′] about the occurrence of the event represented by [x′] is defined as I(x′, y′) = log P(x′, y′) P(x′)P(y′) = log P(x′|y′) P(x′) (11)

Sophia Katrenko Lecture 1

slide-62
SLIDE 62

Today Collocations in linguistics Automatic collocation extraction

Pointwise MI

Example: (Ayatollah,Ruhollah) freq(Ayatollah) = 42 freq(Ruhollah) = 20 freq(Ayatollah, Ruhollah) = 20 I(Ayatollah, Ruhollah) ≈ 18.38 (12) so.. “the amount of information we have about the occurrence of Ayatollah at position i in the corpus increases by 18.38 bits if we are told that Ruhollah occurs at position i + 1. ”

Sophia Katrenko Lecture 1

slide-63
SLIDE 63

Today Collocations in linguistics Automatic collocation extraction

Mutual Information: Results Mutual information (10 bigrams with the frequency 20).

Sophia Katrenko Lecture 1

slide-64
SLIDE 64

Today Collocations in linguistics Automatic collocation extraction

Mutual Information: Problems There are the following problems with using mutual information for collocation extraction a larger decrease in uncertainty does not always reveal good collocation candidates mutual information estimates are sensitive to data sparseness

Sophia Katrenko Lecture 1

slide-65
SLIDE 65

Today Collocations in linguistics Automatic collocation extraction

Mutual Information: Problem 1 Consider the following example (from the aligned Hansard corpus): Which word, chambre or communes, shall be used to translate house?

Sophia Katrenko Lecture 1

slide-66
SLIDE 66

Today Collocations in linguistics Automatic collocation extraction

Mutual Information: Problem 1 The MI is calculated as follows: log P(house|chambre) P(house) = log

31950 31950+4793

P(house) = log 0.87 P(house) log P(house|communes) P(house) = log

4974 4974+441

P(house) = log 0.92 P(house) Consequently, log P(house|chambre) P(house) < log P(house|communes) P(house)

Sophia Katrenko Lecture 1

slide-67
SLIDE 67

Today Collocations in linguistics Automatic collocation extraction

Mutual Information: Conclusions Mutual information is a good measure for independence I(x, y) = log P(x, y) P(x)P(y) = log P(x)P(y) P(x)P(y) = 0 a bad measure for dependence ( = the frequency of the individual words influences the final score, so bigrams composed of low-frequency words receive a higher score than bigrams composed of high-frequency words.)

Sophia Katrenko Lecture 1

slide-68
SLIDE 68

Today Collocations in linguistics Automatic collocation extraction

Software If you want to try it yourself, check NSP Ngram Statistics Package (NSP) by Ted Pedersen http://www.d.umn.edu/˜tpederse/nsp.html

Sophia Katrenko Lecture 1

slide-69
SLIDE 69

Today Collocations in linguistics Automatic collocation extraction

Recent work

Wehrli et al. (“Collocations in a rule-based MT system: A case study evaluation of their translation adequacy”. 2009) have conducted experiments with a large-scale translation system (the ITS-2 system), which supports English, German, Italian and Spanish to French, French-German, and French-English translations includes monolingual and bilingual lexicons with indicated collocations and idioms and studied two other translation systems, Google and Systran.

Sophia Katrenko Lecture 1

slide-70
SLIDE 70

Today Collocations in linguistics Automatic collocation extraction

Recent work

Data: verb-object collocations

Sophia Katrenko Lecture 1

slide-71
SLIDE 71

Today Collocations in linguistics Automatic collocation extraction

Recent work

Precision (the test set is split in 3 disjoint sub-sets, according to the distance between the items of a collocation instance: low (distance=1,2); medium (distance=3,4) and high (distance>4).)

Sophia Katrenko Lecture 1

slide-72
SLIDE 72

Today Collocations in linguistics Automatic collocation extraction

To summarize (1) Today, we have looked at collocation definitions how extract collocations from text

frequency-based, filtering approach a method using mean/variance methods based on statistical test

Sophia Katrenko Lecture 1

slide-73
SLIDE 73

Today Collocations in linguistics Automatic collocation extraction

To summarize (1) Today, we have looked at collocation definitions how extract collocations from text

frequency-based, filtering approach a method using mean/variance methods based on statistical test

Sophia Katrenko Lecture 1

slide-74
SLIDE 74

Today Collocations in linguistics Automatic collocation extraction

To summarize (2) ToDo read at home (if you haven’t done it yet) chapter 5 from Manning and Sch¨ utze.

Sophia Katrenko Lecture 1