Computational Models of Discourse: TextTiling Caroline Sporleder - PowerPoint PPT Presentation

Computational Models of Discourse: TextTiling Caroline Sporleder Universit¨ at des Saarlandes Sommersemester 2009 13.05.2009 Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Text Segmentation . . . indentification of (sub-)topic shifts Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Text Segmentation Example Penicillin is a group of beta-lactam antibiotics used in the treatment of bacterial infections caused by susceptible, usually Gram-positive, organisms. The discovery of penicillin is usually attributed to Scot- tish scientist Sir Alexander Fleming in 1928. Fleming noticed a halo of inhibition of bacterial growth around a contaminant blue-green mold Staphylococcus plate culture. Fleming concluded that the mold was releasing a substance that was inhibiting bacterial growth and lysing the bacteria. Common adverse drug reactions associated with use of the penicillins include: diarrhea, nausea, rash, urticaria, and/or superinfection (including candidiasis). Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Identifying Sub-Topic Boundaries Why not simply use paragraph boundaries or section headings? not all paragraph boundaries reflect topic changes (Stark 1988) paragraphing conventions are genre-dependent (e.g., news texts) subsections often too large ⇒ segmentation into multi-paragraph units Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Identifying Sub-Topic Boundaries Why not simply use paragraph boundaries or section headings? not all paragraph boundaries reflect topic changes (Stark 1988) paragraphing conventions are genre-dependent (e.g., news texts) subsections often too large ⇒ segmentation into multi-paragraph units Applications Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Identifying Sub-Topic Boundaries Why not simply use paragraph boundaries or section headings? not all paragraph boundaries reflect topic changes (Stark 1988) paragraphing conventions are genre-dependent (e.g., news texts) subsections often too large ⇒ segmentation into multi-paragraph units Applications hypertext display information retrieval text summarisation Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Sub-Topic Segmentation Methods Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Sub-Topic Segmentation Methods find linguistic markers for “topic-shift” adverbial clauses and prosodic markers (Brown & Yule, 1983) certain (discourse) markers: oh, well, ok, however (Litman & Passonneau, 1995) pronoun reference structure (Passonneau & Litman, 1993) tense and aspect (Webber 1987) distribution of lexical chains distribution of lexical items (Hearst 1997) (for expository texts) Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

TextTiling (Hearst, 1997) Main Hypothesis “. . . when [a] subtopic changes, a significant proportion of the vocabulary changes as well.” (Hearst, 1997, p. 40) ⇒ related to segmentation with lexical chains but “shallower” (thus easier to compute) Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Text Structure Types (Skorochod’ko 1972) Compute word overlap between sentences and look at distribution of highly connected sentences: chained ringed monolithic piecewise Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

TextTiling (Hearst, 1997) Example: Stargazer text (life on earth and other planets) Segments: 1 Intro: search for life in space (paragraphs 1–3) 2 The moon’s chemical composition (4–5) 3 How early earth-moon proximity shaped the moon (6–8) 4 How the moon helped life evolve on earth (9–12) 5 Improbability of the earth-moon system (13) 6 Binary/trinary star systems make life unlikely (14–16) 7 The low probability of nonbinary/trinary systems (17–18) 8 Properties of earth’s sun that facilitate life (19–20) 9 Summary (21) Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Distribution of Terms Distribution of terms in Stargazer (Hearst 1997, p. 42) Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Distribution of Terms frequent terms (e.g., moon ): main topic less frequent but uniformly distributed terms ( scientist ): neutral terms that cluster ( binary ): indicative of sub-topic boundaries ⇒ need to find clusters Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Comparing adjacent text blocks Aim Determine which sentence boundaries correspond to sub-topic boundaries. compute lexical score for the gap between pairs of text blocks. text blocks contain k adjacent sentences and act as moving windows low lexical scores preceded and followed by high lexical scores may indicate topic shifts Lexical Score Three possibilities: dot product of word vectors (tf.idf weighting found not to work so well) vocabulary introduction lexical chains Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Comparing adjacent text blocks Hearst (1997), p. 44: Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

The TextTiling Algorithm Three main steps: 1 Tokenisation: tokenisation stop word removal stemming splitting text in equal-size “pseudo-sentences” (token sequences) 2 Computing Lexical Score set block size k (i.e., size of moving window) ≈ avg. paragraph length compute score 3 Boundary Identification assign a depth score to each token sequence gap i (find peak at left hand side l and peak at right hand side r , depth − score ( i ) = score ( l ) − score ( i ) + score ( r ) − score ( i )) smooth scores minor heuristics for boundary assignment (prevent boundaries that are too close, mover boundaries to paragraph breaks) sort boundary scores and assign top n boundaries Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Determining Boundaries Hearst (1997), p. 51 Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Applying the algorithm to Stargazers Hearst (1997), p. 55 Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Evaluation Manually created gold standard seven annotators for 12 articles inter-annotator agreement: kappa score of .647 Kappa Score K = P ( A ) − P ( E ) 1 − P ( E ) where P ( A ) is the proportion of times that the annotators agree and P ( E ) the proportion of times that they would be expected to agree by chance. Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Evaluation Algorithm parameters block/window size number of words in a token sequence smoothing parameters number of boundaries to assign ⇒ ideally optimise on development set Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Evaluation Hearst (1997), p. 56 Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

And after TextTiling? Other work on text segmentation mostly also based on lexical overlap, e.g., Choi (2000): cosine between word vectors of adjacent sentences plus comparison of ranks typically evaluated on artificial data sets (join different texts and attempt to find the boundaries) What do you think about evalutation on artificial data sets? Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse

Computational Models of Discourse: TextTiling Caroline Sporleder - PowerPoint PPT Presentation

Computational Models of Discourse: TextTiling Caroline Sporleder Universit at des Saarlandes Sommersemester 2009 13.05.2009 Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse Text Segmentation . . .

Computational Models of Discourse Regina Barzilay MIT What is Discourse? What is Discourse?

Computational Discourse 11-711 Algorithms for NLP 15 November 2018 What Is Discourse? Discourse

Computational Discourse 11-711 Algorithms for NLP 31 October 2019 What Is Discourse? Discourse

Computational Models of Discourse: Discourse Parsing Caroline Sporleder Universit at des

Discourse Coherence Lecture Plan: Einf uhrung in Pragmatik Discourse cohesion and

Discourse Structure Ling575 Discourse & Dialogue April 13, 2011 Roadmap Project

Computational Models of Discourse: Introduction to Discourse: Coherence and Cohesion, Lexical

Memory-Enhanced Models for Discourse Understanding COMP90042 Web Search and Text Analysis Guest

A Systematic Study of Neural Discourse Models for Implicit Discourse Relation Attapol T.

Computational Models of Discourse: Lexical Chains; Centering Theory Caroline Sporleder

Computational Models of Discourse: Co-Reference Caroline Sporleder Universit at des

Computational Models of Discourse: Preliminaries, Overview Caroline Sporleder Universit at

Computational Models of Discourse: Generating Referring Expressions Caroline Sporleder

Explicit Discourse Connectives Implicit Discourse Relations Bonnie Webber Hannah Rohde

Modeling Discourse Cohesion for Discourse Parsing via Memory Network Yanyan Jia, Yuan Ye, Yansong

IMMIGRATION: CHANGING THE PUBLIC DISCOURSE IMMIGRATION: CHANGING THE PUBLIC DISCOURSE

Strongyloidiasis Important health problems in migrants? Important health problem? At least 370

Outline Care Practitioner Urticaria Alopecia Lindy P. Fox, MD Acne in the adult

Conflict of Interest Ronald Loo, M.D. Has no real or apparent conflicts of interest to report. 2

Urological cancers why we need to change Clinical workshop 12 March 2013 Introduction

January 27 th , 2017 Nathalie Quach PGY-1 Resident Valley Baptist Medical Center - Brownsville

1 Plasmapheresis By: Dr Mohammad Hossein Shojamoradi Nephrology Research Center, TUMS May 19,

The LAPLACE-2 Trial: A Phase 3, Double-blind, Randomized, Placebo and Ezetimibe Controlled,

Disclosures Intermediate Uveitis: I have no financial disclosures When and How Do You Treat?

Computational Models of Discourse: TextTiling Caroline Sporleder - PowerPoint PPT Presentation

Computational Models of Discourse: TextTiling Caroline Sporleder Universit at des Saarlandes Sommersemester 2009 13.05.2009 Caroline Sporleder csporled@coli.uni-sb.de Computational Models of Discourse Text Segmentation . . .

Computational Models of Discourse Regina Barzilay MIT What is Discourse? What is Discourse?

Computational Discourse 11-711 Algorithms for NLP 15 November 2018 What Is Discourse? Discourse

Computational Discourse 11-711 Algorithms for NLP 31 October 2019 What Is Discourse? Discourse

Computational Models of Discourse: Discourse Parsing Caroline Sporleder Universit at des

Discourse Coherence Lecture Plan: Einf uhrung in Pragmatik Discourse cohesion and

Discourse Structure Ling575 Discourse &amp; Dialogue April 13, 2011 Roadmap Project

Computational Models of Discourse: Introduction to Discourse: Coherence and Cohesion, Lexical

Memory-Enhanced Models for Discourse Understanding COMP90042 Web Search and Text Analysis Guest

A Systematic Study of Neural Discourse Models for Implicit Discourse Relation Attapol T.

Computational Models of Discourse: Lexical Chains; Centering Theory Caroline Sporleder

Computational Models of Discourse: Co-Reference Caroline Sporleder Universit at des

Computational Models of Discourse: Preliminaries, Overview Caroline Sporleder Universit at

Computational Models of Discourse: Generating Referring Expressions Caroline Sporleder

Explicit Discourse Connectives Implicit Discourse Relations Bonnie Webber Hannah Rohde

Modeling Discourse Cohesion for Discourse Parsing via Memory Network Yanyan Jia, Yuan Ye, Yansong

IMMIGRATION: CHANGING THE PUBLIC DISCOURSE IMMIGRATION: CHANGING THE PUBLIC DISCOURSE

Strongyloidiasis Important health problems in migrants? Important health problem? At least 370

Outline Care Practitioner Urticaria Alopecia Lindy P. Fox, MD Acne in the adult

Conflict of Interest Ronald Loo, M.D. Has no real or apparent conflicts of interest to report. 2

Urological cancers why we need to change Clinical workshop 12 March 2013 Introduction

January 27 th , 2017 Nathalie Quach PGY-1 Resident Valley Baptist Medical Center - Brownsville

1 Plasmapheresis By: Dr Mohammad Hossein Shojamoradi Nephrology Research Center, TUMS May 19,

The LAPLACE-2 Trial: A Phase 3, Double-blind, Randomized, Placebo and Ezetimibe Controlled,

Disclosures Intermediate Uveitis: I have no financial disclosures When and How Do You Treat?

Discourse Structure Ling575 Discourse & Dialogue April 13, 2011 Roadmap Project