Probabilistic Content Models Marc Schulder Saarland University - - PowerPoint PPT Presentation
Probabilistic Content Models Marc Schulder Saarland University - - PowerPoint PPT Presentation
Probabilistic Content Models Marc Schulder Saarland University presenting Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization Barzilay & Lee (2004) Probabilistic Content Models Aim Model
Probabilistic Content Models
Aim Model topical structures of a text Means Hidden Markov Models Language Bigrams Clustering Tasks Sentence Ordering Extractive Summarization
2
Reminder: Hidden Markov Model
y1 y2 y3 x1 x2 x3
States Observations Transition Emission
3
Reminder: Hidden Markov Model
N V N romanes eunt domus
States Observations Transition Emission
4
Reminder: Hidden Markov Model
N V N romanes eunt domus $
Transition Emission
5
Reminder: Hidden Markov Model
N V N romanes eunt domus
P(V|N) P(romanes|N) P(N|V) P(eunt|V) P(domus|N) P(N|$)
$
5
Reminder: Hidden Markov Model
N V N romanes eunt domus
P(V|N) P(romanes|N) P(N|V) P(eunt|V) P(domus|N) P(N|$)
$
* * * * *
7
= P(romanes eunt domus|$NVN)
HMM as Content Model
y1 y2 y3 x1 x2 x3
State Observation Transition Emission
8
HMM as Content Model
t1 t2 t3 s1 s2 s3
Topic Sentence Transition Emission
10
HMM as Content Model
t1 t2 t3 s1 s2 s3
Topic Sentence Transition Emission
10
Sentences as Bigram Word Sequences
P(romanes eunt domus) = P(romanes|$) * P(eunt|romanes) * P(domus|eunt)
romanes eunt domus romanes eunt domus
11
HMM as Content Model
t1 t2 t3 s1 s2 s3
Topic Sentence
12
Transition Emission
Topics as Sentence Clusters
13
Topic is defined by its content Group together similar sentences to form topics But What does "similar" mean? Here Using the same words
Topics as Sentence Clusters
14
Step 1 Make text generic Replace proper names, numbers and dates with placeholders
The U.S. Geological Survey said the June earthquake was centered 121 miles west-northwest of Bengkulu on Sumatra island, at a depth of 14 miles.
Topics as Sentence Clusters
15
Step 1 Make text generic Replace proper names, numbers and dates with placeholders
The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
Topics as Sentence Clusters
16
Step 2 Group similar texts together Sentence similarity = Cosine of Bigram Vectors
NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami.
The NAME seismological institute said the temblor’s epicenter was located NUM kilometers (NUM miles) south of the capital. The temblor was centered NUM kilometers (NUM miles) northwest of the provincial capital of NAME, about NUM kilometers (NUM miles) southwest of NAME, a bureau seismologist said. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
Topics as Sentence Clusters
17
Seismologists in NAME’s NAME said the temblor’s epicenter was about NUM kilometers (NUM miles) north of the provincial capital NAME. NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. It was initially reported as a NUM magnitude but quickly downgraded.
The NAME seismological institute said the temblor’s epicenter was located NUM kilometers (NUM miles) south of the capital. The temblor was centered NUM kilometers (NUM miles) northwest of the provincial capital of NAME, about NUM kilometers (NUM miles) southwest of NAME, a bureau seismologist said. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
Topics as Sentence Clusters
17
Seismologists in NAME’s NAME said the temblor’s epicenter was about NUM kilometers (NUM miles) north of the provincial capital NAME. NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. It was initially reported as a NUM magnitude but quickly downgraded.
The NAME seismological institute said the temblor’s epicenter was located NUM kilometers (NUM miles) south of the capital. The temblor was centered NUM kilometers (NUM miles) northwest of the provincial capital of NAME, about NUM kilometers (NUM miles) southwest of NAME, a bureau seismologist said. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
Topics as Sentence Clusters
17
Seismologists in NAME’s NAME said the temblor’s epicenter was about NUM kilometers (NUM miles) north of the provincial capital NAME. NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. It was initially reported as a NUM magnitude but quickly downgraded.
The NAME seismological institute said the temblor’s epicenter was located NUM kilometers (NUM miles) south of the capital. The temblor was centered NUM kilometers (NUM miles) northwest of the provincial capital of NAME, about NUM kilometers (NUM miles) southwest of NAME, a bureau seismologist said. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
Topics as Sentence Clusters
17
Seismologists in NAME’s NAME said the temblor’s epicenter was about NUM kilometers (NUM miles) north of the provincial capital NAME. NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. It was initially reported as a NUM magnitude but quickly downgraded.
Location Information
The NAME seismological institute said the temblor’s epicenter was located NUM kilometers (NUM miles) south of the capital. The temblor was centered NUM kilometers (NUM miles) northwest of the provincial capital of NAME, about NUM kilometers (NUM miles) southwest of NAME, a bureau seismologist said. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
Topics as Sentence Clusters
17
Seismologists in NAME’s NAME said the temblor’s epicenter was about NUM kilometers (NUM miles) north of the provincial capital NAME. NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. It was initially reported as a NUM magnitude but quickly downgraded.
Location Information
The NAME seismological institute said the temblor’s epicenter was located NUM kilometers (NUM miles) south of the capital. The temblor was centered NUM kilometers (NUM miles) northwest of the provincial capital of NAME, about NUM kilometers (NUM miles) southwest of NAME, a bureau seismologist said. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
Topics as Sentence Clusters
17
Seismologists in NAME’s NAME said the temblor’s epicenter was about NUM kilometers (NUM miles) north of the provincial capital NAME. NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. It was initially reported as a NUM magnitude but quickly downgraded.
Location Information Etcetera
Topics as Sentence Clusters
18
Step 3 Viterbi re-estimation
- 1. Compute probabilities, based on intial topic clusters
- 2. Let HMM predict topics of sentences
- 3. Put sentence in predicted topic cluster
- 4. Rinse, repeat
HMM as Content Model
19
t1 t2 t3 s1 s2 s3
Topic Sentence Transition Emission
13
HMM as Content Model
20
t1 t2 t3
Topic Sentence Transition Emission
13
HMM as Content Model
21
Topic Sentence Transition Emission
13
Evaluation
Evaluation 1 Information Ordering
Evaluation 1 Information Ordering
5 Domains
- Earthquakes
- Clashes between armies and rebel groups
- Drug-related criminal offenses
- Financial reports
- Aviation accidents
24
Evaluation 1 Information Ordering
25
Domain Average Length Standard Deviation Vocabulary Token/Type Earthquakes Clashes Drugs Finance Accidents 10.4 5.2 1182 13.2 14.0 2.6 1302 4.5 10.3 7.5 1566 4.1 13.7 1.6 1378 12.8 11.5 6.3 2003 5.6
Evaluation 1 Information Ordering
26
A powerful earthquake with a magnitude of 6.4 on Saturday struck off Mentawai islands in western Indonesia, causing panic but officials said there were no reports of damages or casualties. The U.S. Geological Survey said the earthquake was centered 121 miles west-northwest of Bengkulu on Sumatra island, at a depth of 14 miles. Hardimansyah Maitam, a local maritime patrol officer, said residents in Sikakap town on North Pagai, an island in the Mentawai chain, poured into the streets and ran to higher ground as the quake struck.
Evaluation 1 Information Ordering
27
A powerful earthquake with a magnitude of 6.4 on Saturday struck off Mentawai islands in western Indonesia, causing panic but officials said there were no reports of damages or casualties. The U.S. Geological Survey said the earthquake was centered 121 miles west-northwest of Bengkulu on Sumatra island, at a depth of 14 miles. Hardimansyah Maitam, a local maritime patrol officer, said residents in Sikakap town on North Pagai, an island in the Mentawai chain, poured into the streets and ran to higher ground as the quake struck.
Evaluation 1 Information Ordering
28
A powerful earthquake with a magnitude of 6.4 on Saturday struck off Mentawai islands in western Indonesia, causing panic but officials said there were no reports of damages or casualties. The U.S. Geological Survey said the earthquake was centered 121 miles west-northwest of Bengkulu on Sumatra island, at a depth of 14 miles. Hardimansyah Maitam, a local maritime patrol officer, said residents in Sikakap town on North Pagai, an island in the Mentawai chain, poured into the streets and ran to higher ground as the quake struck.
Evaluation 1 Information Ordering
29
Hardimansyah Maitam, a local maritime patrol officer, said residents in Sikakap town on North Pagai, an island in the Mentawai chain, poured into the streets and ran to higher ground as the quake struck. A powerful earthquake with a magnitude of 6.4 on Saturday struck off Mentawai islands in western Indonesia, causing panic but officials said there were no reports of damages or casualties. The U.S. Geological Survey said the earthquake was centered 121 miles west-northwest of Bengkulu on Sumatra island, at a depth of 14 miles.
3 sentences 6 combinations 3! = 3*2*1 = 6 4! = 4*3*2*1 = 24
Evaluation 1 Information Ordering
Domain Average Length Earthquakes Clashes Drugs Finance Accidents 10.4 14.0 10.3 13.7 11.5
29
Hardimansyah Maitam, a local maritime patrol officer, said residents in Sikakap town on North Pagai, an island in the Mentawai chain, poured into the streets and ran to higher ground as the quake struck. A powerful earthquake with a magnitude of 6.4 on Saturday struck off Mentawai islands in western Indonesia, causing panic but officials said there were no reports of damages or casualties. The U.S. Geological Survey said the earthquake was centered 121 miles west-northwest of Bengkulu on Sumatra island, at a depth of 14 miles.
3 sentences 6 combinations 3! = 3*2*1 = 6 4! = 4*3*2*1 = 24 10! = 10*9*...*1 = 3,628,800 14! = 14*13*...*1 = 87,178,291,200
Evaluation 1 Information Ordering
Task
- 1. Create all possible sentence orderings
- 2. Assign probability for each ordering
- 3. Sort by probability
Metric Original Sentence Order Rank Position of original sentence in sorted list Baseline Word Bigram Model
30
Evaluation 1 Information Ordering
31
Domain System Rank Earthquakes Earthquakes Clashes Clashes Drugs Drugs Finance Finance Accidents Accidents Content 2.67 Bigram 485.16 Content 3.05 Bigram 635.15 Content 15.38 Bigram 712.03 Content 0.05 Bigram 7.44 Content 10.96 Bigram 973.75
Evaluation 1 Information Ordering
State of the Art Lapata (2003) Pairwise sentence-ordering Metric OSO Prediction Rate Percentage of sortings that ranked OSO highest
32
Evaluation 1 Information Ordering
33
Domain System Rank Prediction rate Earthquakes Clashes Drugs Finance Accidents Content 2.67 72% Bigram 485.16 4% Lapata
- 24%
Content 3.05 48% Bigram 635.15 12% Lapata
- 27%
Content 15.38 38% Bigram 712.03 11% Lapata
- 27%
Content 0.05 96% Bigram 7.44 66% Lapata
- 18%
Content 10.96 41% Bigram 973.75 2% Lapata
- 10%
Evaluation 2 Summarization
Evaluation 2 Summarization
Task Shorten text to length of gold summary Data Domain: Earthquakes Summaries by AP journalists 60 Documents 900 Sentences 50% Training 50% Testing
- Avg. Document: 15 sentences
- Avg. Summary: 6 sentences
35
Evaluation 2 Summarization
Baseline Pick first L sentences Sentence classifier Kupiec et al (1999) Looks at words and where they are in sentence, but not at connections between sentences
36
Evaluation 2 Summarization
Content Model
- 1. Assign topics to all sentences in documents
- a. Count in how many documents topic appears
- 2. Assign topics to all sentences in summaries
- a. Count in how many summaries topic appears
37
NAME, a local maritime patrol officer, said residents in NAME town on NAME, an island in the NAME chain, poured into the streets and ran to higher ground as the quake struck. NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
Evaluation 2 Summarization
38
DATE's quake happened NUM days after a magnitude NUM tremor killed at least NUM people and damaged more than NUM houses and buildings in NAME province. A powerful earthquake with a magnitude of NUM on DATE struck off NAME islands in western NAME, causing panic but officials said there were no reports of damages or casualties. A magnitude NUM earthquake off NAME in DATE triggered a tsunami, killing NUM people in NUM countries.
Document
NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles. A powerful earthquake with a magnitude of NUM on DATE struck off NAME islands in western NAME, causing panic but officials said there were no reports of damages or casualties.
Summary
Topic Doc Sumry Summary Location Details Previous 1 1 1 1 1 1 1
NAME, a local maritime patrol officer, said residents in NAME town on NAME, an island in the NAME chain, poured into the streets and ran to higher ground as the quake struck. NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles.
Evaluation 2 Summarization
38
DATE's quake happened NUM days after a magnitude NUM tremor killed at least NUM people and damaged more than NUM houses and buildings in NAME province. A powerful earthquake with a magnitude of NUM on DATE struck off NAME islands in western NAME, causing panic but officials said there were no reports of damages or casualties. A magnitude NUM earthquake off NAME in DATE triggered a tsunami, killing NUM people in NUM countries.
P r e v i
- u
s e v e n t P r e v i
- u
s e v e n t Location Information Event Summary Event Details E v e n t D e t a i l s
Document
NAME of NAME's NAME said the quake which was felt in some cities on NAME did not have the potential to trigger a tsunami. The NAME said the DATE earthquake was centered NUM miles west-northwest of NAME on NAME island, at a depth of NUM miles. A powerful earthquake with a magnitude of NUM on DATE struck off NAME islands in western NAME, causing panic but officials said there were no reports of damages or casualties.
Location Information Event Summary
Summary
E v e n t D e t a i l s
Topic Doc Sumry Summary Location Details Previous 1 1 1 1 1 1 1
Evaluation 2 Summarization
Content Model
- 1. Assign topics to all sentences in documents
- a. Count in how many documents topic appears
- 2. Assign topics to all sentences in summaries
- a. Count in how many summaries topic appears
- 3. Probability = Summary count / Doc count
- 4. Choose sentences whose topic has high
probability of appearing in summaries
39
Evaluation 2 Summarization
40
System Accuracy Baseline Sentence classifier Content model 69% 76% 88%
Evaluation 3 Relation between Tasks
41
0% 25% 50% 75% 100% 10 20 40 60 64 80
Ordering Summarization
Number of topic clusters
Conclusion
42
Probabilistic Content Model Hidden Markov Model Clustering Applications Sentence Ordering Extractive Summarization
Questions?
Sources
44
Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization
- Barzilay & Lee (2004)
Probabilistic text structuring: Experiments with sentence ordering
- Lapata (2003)
A trainable document summarizer
- Kupiec et al (1999)