Center for Reflected Text Analytics
Lecture 2 Annotation tools & Segmentation Summary of Part 1 - - PowerPoint PPT Presentation
Lecture 2 Annotation tools & Segmentation Summary of Part 1 - - PowerPoint PPT Presentation
Center for Reflected Text Analytics Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory Guidelines Inter-Annotator agreement Inter-subjective annotations Annotation exercise Discuss
- Annotation theory
- Guidelines
- Inter-Annotator agreement
- Inter-subjective annotations
- Annotation exercise
- Discuss disagreements with your neighbor
- Improve annotation guidelines
University of Stuttgart 2
Summary of Part 1
- Tools can support the annotation process at various stages
- Managing multiple annotators
- Assign documents to annotate
- Supervise their progress
- Analyse disagreements
- Display disagreements (only)
- Calculate quantitative IAA (κ)
- Create a gold standard
- Make decisions on disagreements
- Record final decisions
- Usable tools: See handout
University of Stuttgart 3
Tool Support Annotation
Segmentation
University of Stuttgart 4
University of Stuttgart 5
Segmentation Tool Download
http://tinyurl.com/cretanetworker = http://www2.ims.uni-stuttgart.de/gcl/reiterns/creta/CRETANetworker.jar
- Abstract definition
- No meaning of a segment implied
- The task of separating a text into multiple parts (“segments”)
- Segmentation according to various criteria based on
- Structure (chapters, acts, letters, speeches)
- Linguistics (sentences, paragraphs)
- Narrative content (scenes, time, place)
- Content (topics under discussion)
- No generic criterion covering multiple research questions
Segmentation
University of Stuttgart 6
- Focus on segments
- Spans of text
- Focus on segment boundaries
- Positions in a text
- Views are equivalent – we will switch between them when appropriate
University of Stuttgart 7
Viewpoints Segmentation
Segment 1 Segment 2 Segment 3 Segment 4 Segment Boundary Segment Boundary Segment Boundary
University of Stuttgart 8
Entities + Segments = Networks
Mary Peter Paul Co-Occurrence Network
- Segmented text with the appearing entities
⟨{A, B}, {A, B, B, B, A}, {A, C}⟩
- Convert into an (quadratic) adjacency matrix
- Diagonal is typically uninteresting
- Matrix is symmetric
- Create network
- A node is created for each row (or cell)
- An edge is created for each cell,
weighted according to cell value
University of Stuttgart 9
Slightly more abstract description Entities + Segments = Networks
A B C A 2 1 B 2 C 1
A B C
- Theoretically
- Segments can be annotated just like entity references
- Both cover sequences of words
- Appropriate annotation guidelines would define when to annotate segments
- Practically
- Segmentation criterion closely tied to research question
- No reasonable generic abstraction layer
- That works for multiple research questions and/or text corpora
- Single texts only contain a few segments
- Much more annotated texts needed for any kind of automatisation
University of Stuttgart 10
Segmentation Annotation
- Web-based UI
- Beta-Software
- Automatic annotation through rules and tools
- Entity annotation
- Stanford Named Entity Recognizer (Finkel et al., 2005)
- Only proper names, no descriptive noun phrases
- Rules (regular expressions) – to specify the entity references
- Segment annotation
- Rules (regular expressions) – to specify the segment boundaries
- Unsupervised segmentation algorithm (TextTiling; Hearst, 1994)
- Network export → Gephi
University of Stuttgart 11
Segment Annotation Tool
- Free and open source
- https://gephi.org
- Wide range of metric, filter and layout algorithms
- Network editing (e.g., merge nodes)
- Plugins
- Export into static images
University of Stuttgart 12
Network Tool Gephi
University of Stuttgart 13
demo
- A powerful way to describe sets of character sequences
- Many search tools support REs, and all programming languages do
- Looks cryptic, but is quite systematic
- REs on slides/handout are marked in forward slashes / / for readability
- they don’t need to be typed in the tool
- Basics
- Many regular characters stand for themselves
- The RE /a/ finds occurrences of the character “a”
- Sequences of characters stand for sequences of themselves
- The RE /the/ finds occurrences of the string “the”
University of Stuttgart 14
Useful text processing skills 101 Regular Expressions
- Many regular characters stand for themselves
- The RE /a/ finds occurrences of the character “a”
- Sequences of characters stand for sequences of themselves
- The RE /the/ finds occurrences of the string “the”
- Meta characters (“quantifiers”) are applied on the previous character
- ?: previous character optional (0-1 times)
- /them?/ finds both “the” and “them”
- +: Previous character one or more times
- /ab+/ finds ”ab”, “abb”, ”abbb”, …
- The kleene star * finds the previous character zero or more times
- /ab*/ finds “a”, “ab”, “abb”, ”abbb”, …
University of Stuttgart 15
Basics Regular Expressions
- /(re1|re2)/ finds everything that finds either re1 or re2
- /(good|better|best)/ finds comparative and superlative forms of
the adjective “good”
- /great(er|est)?/ finds comp. and sup. forms of “great”
- The question mark makes the suffixes optional
- We can mark alternatives on character level in square brackets: […]
- /[Tt]he/ finds upper and lower case forms of “the”
- Square brackets support ranges of characters
- /[A-Z]/ finds upper-case characters (beware: locale)
- /[0-9]/ finds digits
University of Stuttgart 16
Alternations and Character Classes Regular Expressions
- The dot . matches everything
- /a.*b/ finds everything that begins with a and ends with b
- Escape character: Backslash
- In order to find a dot, we need to prevent its special meaning
- /.*\.doc/ finds everything that ends on “.doc” (e.g., filenames)
University of Stuttgart 17
Special cases and exceptions Regular Expressions
- Chapter 10.
- /Chapter
[0-9]+\./
- Chapter V. (Roman numbers)
- /Chapter
[IVXCM]+\./
- Beware: Possible over-matching
- Dates: MAY 22., AUGUST 23.
- /[A-Z]+
[0-9]+\./
- Beware: Possible over-matching
University of Stuttgart 18
Real examples Regular Expressions
- Unsupervised segmentation algorithm, developed for expository texts
- Compares lexicon in a window left and right of a target sentence gap
University of Stuttgart 19
Hearst (1994) TextTiling
window size = 2
sentence boundary
2 1 7 3 2 9
v1= v2= dist(v1, v2) =
n n+1
dn
step size = 3
- Unsupervised segmentation algorithm, developed for expository texts
- Compares lexicon in a window left and right of a target sentence gap
University of Stuttgart 20
Hearst (1994) TextTiling
sentence boundary
n n+1
dn dn+3
- More powerful algorithms are available
- E.g., topic segmentation
- Clear adaptation possibilities
- How to create word vectors?
- Which words are included (function/content words)?
- Which value is represented in the vector (frequency, tf*idf, information, …)
- How to calculate similarity/distance?
- Cosine, manhattan, …
- But: Evaluation is hard
- No gold standard available
- Different expectations
University of Stuttgart 21
Hearst (1994) TextTiling
- Go to …
- Load a text of your liking (it‘s better if you are familiar with it)
- Add entity references by applying the Stanford NER system
- Make a brief check, if the important entities are included
(“Passepartout”, for instance, is not)
- You can add specific names by specifying regular expressions
- Add reasonable segment annotations
- Export a GEXF file and load it into Gephi
- Play with various options and see how the network changes
University of Stuttgart 22