Computational Models of Discourse: TextTiling Caroline Sporleder - - PowerPoint PPT Presentation

computational models of discourse texttiling
SMART_READER_LITE
LIVE PREVIEW

Computational Models of Discourse: TextTiling Caroline Sporleder - - PowerPoint PPT Presentation

Computational Models of Discourse: TextTiling Caroline Sporleder Universit at des Saarlandes Sommersemester 2009 13.05.2009 Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse Text Segmentation . . .


slide-1
SLIDE 1

Computational Models of Discourse: TextTiling

Caroline Sporleder

Universit¨ at des Saarlandes

Sommersemester 2009 13.05.2009

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-2
SLIDE 2

Text Segmentation

. . . indentification of (sub-)topic shifts

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-3
SLIDE 3

Text Segmentation

Example Penicillin is a group of beta-lactam antibiotics used in the treatment

  • f bacterial infections caused by susceptible, usually Gram-positive,
  • rganisms. The discovery of penicillin is usually attributed to Scot-

tish scientist Sir Alexander Fleming in 1928. Fleming noticed a halo

  • f inhibition of bacterial growth around a contaminant blue-green

mold Staphylococcus plate culture. Fleming concluded that the mold was releasing a substance that was inhibiting bacterial growth and lysing the bacteria. Common adverse drug reactions associated with use of the penicillins include: diarrhea, nausea, rash, urticaria, and/or superinfection (including candidiasis).

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-4
SLIDE 4

Text Segmentation

Example Penicillin is a group of beta-lactam antibiotics used in the treatment

  • f bacterial infections caused by susceptible, usually Gram-positive,
  • rganisms. The discovery of penicillin is usually attributed to Scot-

tish scientist Sir Alexander Fleming in 1928. Fleming noticed a halo

  • f inhibition of bacterial growth around a contaminant blue-green

mold Staphylococcus plate culture. Fleming concluded that the mold was releasing a substance that was inhibiting bacterial growth and lysing the bacteria. Common adverse drug reactions associated with use of the penicillins include: diarrhea, nausea, rash, urticaria, and/or superinfection (including candidiasis).

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-5
SLIDE 5

Identifying Sub-Topic Boundaries

Why not simply use paragraph boundaries or section headings? not all paragraph boundaries reflect topic changes (Stark 1988) paragraphing conventions are genre-dependent (e.g., news texts) subsections often too large ⇒ segmentation into multi-paragraph units

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-6
SLIDE 6

Identifying Sub-Topic Boundaries

Why not simply use paragraph boundaries or section headings? not all paragraph boundaries reflect topic changes (Stark 1988) paragraphing conventions are genre-dependent (e.g., news texts) subsections often too large ⇒ segmentation into multi-paragraph units Applications

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-7
SLIDE 7

Identifying Sub-Topic Boundaries

Why not simply use paragraph boundaries or section headings? not all paragraph boundaries reflect topic changes (Stark 1988) paragraphing conventions are genre-dependent (e.g., news texts) subsections often too large ⇒ segmentation into multi-paragraph units Applications hypertext display information retrieval text summarisation

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-8
SLIDE 8

Sub-Topic Segmentation

Methods

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-9
SLIDE 9

Sub-Topic Segmentation

Methods find linguistic markers for “topic-shift” adverbial clauses and prosodic markers (Brown & Yule, 1983) certain (discourse) markers: oh, well, ok, however (Litman & Passonneau, 1995) pronoun reference structure (Passonneau & Litman, 1993) tense and aspect (Webber 1987) distribution of lexical chains distribution of lexical items (Hearst 1997) (for expository texts)

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-10
SLIDE 10

Sub-Topic Segmentation

Methods find linguistic markers for “topic-shift” adverbial clauses and prosodic markers (Brown & Yule, 1983) certain (discourse) markers: oh, well, ok, however (Litman & Passonneau, 1995) pronoun reference structure (Passonneau & Litman, 1993) tense and aspect (Webber 1987) distribution of lexical chains distribution of lexical items (Hearst 1997) (for expository texts)

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-11
SLIDE 11

TextTiling (Hearst, 1997)

Main Hypothesis “. . . when [a] subtopic changes, a significant proportion of the vocabulary changes as well.” (Hearst, 1997, p. 40) ⇒ related to segmentation with lexical chains but “shallower” (thus easier to compute)

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-12
SLIDE 12

Text Structure Types (Skorochod’ko 1972)

Compute word overlap between sentences and look at distribution

  • f highly connected sentences:

chained ringed monolithic piecewise Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-13
SLIDE 13

TextTiling (Hearst, 1997)

Example: Stargazer text (life on earth and other planets) Segments:

1 Intro: search for life in space (paragraphs 1–3) 2 The moon’s chemical composition (4–5) 3 How early earth-moon proximity shaped the moon (6–8) 4 How the moon helped life evolve on earth (9–12) 5 Improbability of the earth-moon system (13) 6 Binary/trinary star systems make life unlikely (14–16) 7 The low probability of nonbinary/trinary systems (17–18) 8 Properties of earth’s sun that facilitate life (19–20) 9 Summary (21) Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-14
SLIDE 14

Distribution of Terms

Distribution of terms in Stargazer (Hearst 1997, p. 42)

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-15
SLIDE 15

Distribution of Terms

frequent terms (e.g., moon): main topic less frequent but uniformly distributed terms (scientist): neutral terms that cluster (binary): indicative of sub-topic boundaries ⇒ need to find clusters

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-16
SLIDE 16

Comparing adjacent text blocks

Aim Determine which sentence boundaries correspond to sub-topic boundaries. compute lexical score for the gap between pairs of text blocks. text blocks contain k adjacent sentences and act as moving windows low lexical scores preceded and followed by high lexical scores may indicate topic shifts Lexical Score Three possibilities: dot product of word vectors (tf.idf weighting found not to work so well) vocabulary introduction lexical chains

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-17
SLIDE 17

Comparing adjacent text blocks

Hearst (1997), p. 44:

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-18
SLIDE 18

The TextTiling Algorithm

Three main steps:

1 Tokenisation:

tokenisation stop word removal stemming splitting text in equal-size “pseudo-sentences” (token sequences)

2 Computing Lexical Score

set block size k (i.e., size of moving window) ≈ avg. paragraph length compute score

3 Boundary Identification

assign a depth score to each token sequence gap i (find peak at left hand side l and peak at right hand side r, depth − score(i) = score(l) − score(i) + score(r) − score(i)) smooth scores minor heuristics for boundary assignment (prevent boundaries that are too close, mover boundaries to paragraph breaks) sort boundary scores and assign top n boundaries

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-19
SLIDE 19

Determining Boundaries

Hearst (1997), p. 51

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-20
SLIDE 20

Applying the algorithm to Stargazers

Hearst (1997), p. 55

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-21
SLIDE 21

Evaluation

Manually created gold standard seven annotators for 12 articles inter-annotator agreement: kappa score of .647 Kappa Score K = P(A)−P(E)

1−P(E)

where P(A) is the proportion of times that the annotators agree and P(E) the proportion of times that they would be expected to agree by chance.

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-22
SLIDE 22

Evaluation

Algorithm parameters block/window size number of words in a token sequence smoothing parameters number of boundaries to assign ⇒ ideally optimise on development set

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-23
SLIDE 23

Evaluation

Hearst (1997), p. 56

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

slide-24
SLIDE 24

And after TextTiling?

Other work on text segmentation mostly also based on lexical overlap, e.g., Choi (2000): cosine between word vectors of adjacent sentences plus comparison

  • f ranks

typically evaluated on artificial data sets (join different texts and attempt to find the boundaries) What do you think about evalutation on artificial data sets?

Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse