[PPT] - Lecture 2 Annotation tools & Segmentation Summary of Part 1 PowerPoint Presentation

SLIDE 1

Center for Reflected Text Analytics

Lecture 2 Annotation tools & Segmentation

SLIDE 2

Annotation theory
Guidelines
Inter-Annotator agreement
Inter-subjective annotations
Annotation exercise
Discuss disagreements with your neighbor
Improve annotation guidelines

University of Stuttgart 2

Summary of Part 1

SLIDE 3

Tools can support the annotation process at various stages
Managing multiple annotators
Assign documents to annotate
Supervise their progress
Analyse disagreements
Display disagreements (only)
Calculate quantitative IAA (κ)
Create a gold standard
Make decisions on disagreements
Record final decisions
Usable tools: See handout

University of Stuttgart 3

Tool Support Annotation

SLIDE 4

Segmentation

University of Stuttgart 4

SLIDE 5

University of Stuttgart 5

Segmentation Tool Download

http://tinyurl.com/cretanetworker = http://www2.ims.uni-stuttgart.de/gcl/reiterns/creta/CRETANetworker.jar

SLIDE 6

Abstract definition
No meaning of a segment implied
The task of separating a text into multiple parts (“segments”)
Segmentation according to various criteria based on
Structure (chapters, acts, letters, speeches)
Linguistics (sentences, paragraphs)
Narrative content (scenes, time, place)
Content (topics under discussion)
No generic criterion covering multiple research questions

Segmentation

University of Stuttgart 6

SLIDE 7

Focus on segments
Spans of text
Focus on segment boundaries
Positions in a text
Views are equivalent – we will switch between them when appropriate

University of Stuttgart 7

Viewpoints Segmentation

Segment 1 Segment 2 Segment 3 Segment 4 Segment Boundary Segment Boundary Segment Boundary

SLIDE 8

University of Stuttgart 8

Entities + Segments = Networks

Mary Peter Paul Co-Occurrence Network

SLIDE 9

Segmented text with the appearing entities

⟨{A, B}, {A, B, B, B, A}, {A, C}⟩

Convert into an (quadratic) adjacency matrix
Diagonal is typically uninteresting
Matrix is symmetric
Create network
A node is created for each row (or cell)
An edge is created for each cell,

weighted according to cell value

University of Stuttgart 9

Slightly more abstract description Entities + Segments = Networks

A B C A 2 1 B 2 C 1

A B C

SLIDE 10

Theoretically
Segments can be annotated just like entity references
Both cover sequences of words
Appropriate annotation guidelines would define when to annotate segments
Practically
Segmentation criterion closely tied to research question
No reasonable generic abstraction layer
That works for multiple research questions and/or text corpora
Single texts only contain a few segments
Much more annotated texts needed for any kind of automatisation

University of Stuttgart 10

Segmentation Annotation

SLIDE 11

Web-based UI
Beta-Software
Automatic annotation through rules and tools
Entity annotation
Stanford Named Entity Recognizer (Finkel et al., 2005)
Only proper names, no descriptive noun phrases
Rules (regular expressions) – to specify the entity references
Segment annotation
Rules (regular expressions) – to specify the segment boundaries
Unsupervised segmentation algorithm (TextTiling; Hearst, 1994)
Network export → Gephi

University of Stuttgart 11

Segment Annotation Tool

SLIDE 12

Free and open source
https://gephi.org
Wide range of metric, filter and layout algorithms
Network editing (e.g., merge nodes)
Plugins
Export into static images

University of Stuttgart 12

Network Tool Gephi

SLIDE 13

University of Stuttgart 13

demo

SLIDE 14

A powerful way to describe sets of character sequences
Many search tools support REs, and all programming languages do
Looks cryptic, but is quite systematic
REs on slides/handout are marked in forward slashes / / for readability
they don’t need to be typed in the tool
Basics
Many regular characters stand for themselves
The RE /a/ finds occurrences of the character “a”
Sequences of characters stand for sequences of themselves
The RE /the/ finds occurrences of the string “the”

University of Stuttgart 14

Useful text processing skills 101 Regular Expressions

SLIDE 15

Many regular characters stand for themselves
The RE /a/ finds occurrences of the character “a”
Sequences of characters stand for sequences of themselves
The RE /the/ finds occurrences of the string “the”
Meta characters (“quantifiers”) are applied on the previous character
?: previous character optional (0-1 times)
/them?/ finds both “the” and “them”
+: Previous character one or more times
/ab+/ finds ”ab”, “abb”, ”abbb”, …
The kleene star * finds the previous character zero or more times
/ab*/ finds “a”, “ab”, “abb”, ”abbb”, …

University of Stuttgart 15

Basics Regular Expressions

SLIDE 16

/(re1|re2)/ finds everything that finds either re1 or re2
/(good|better|best)/ finds comparative and superlative forms of

the adjective “good”

/great(er|est)?/ finds comp. and sup. forms of “great”
The question mark makes the suffixes optional
We can mark alternatives on character level in square brackets: […]
/[Tt]he/ finds upper and lower case forms of “the”
Square brackets support ranges of characters
/[A-Z]/ finds upper-case characters (beware: locale)
/[0-9]/ finds digits

University of Stuttgart 16

Alternations and Character Classes Regular Expressions

SLIDE 17

The dot . matches everything
/a.*b/ finds everything that begins with a and ends with b
Escape character: Backslash
In order to find a dot, we need to prevent its special meaning
/.*\.doc/ finds everything that ends on “.doc” (e.g., filenames)

University of Stuttgart 17

Special cases and exceptions Regular Expressions

SLIDE 18

Chapter 10.
/Chapter

[0-9]+\./

Chapter V. (Roman numbers)
/Chapter

[IVXCM]+\./

Beware: Possible over-matching
Dates: MAY 22., AUGUST 23.
/[A-Z]+

[0-9]+\./

Beware: Possible over-matching

University of Stuttgart 18

Real examples Regular Expressions

SLIDE 19

Unsupervised segmentation algorithm, developed for expository texts
Compares lexicon in a window left and right of a target sentence gap

University of Stuttgart 19

Hearst (1994) TextTiling

window size = 2

sentence boundary

2 1 7 3 2 9

v1= v2= dist(v1, v2) =

n n+1

dn

step size = 3

SLIDE 20

Unsupervised segmentation algorithm, developed for expository texts
Compares lexicon in a window left and right of a target sentence gap

University of Stuttgart 20

Hearst (1994) TextTiling

sentence boundary

n n+1

dn dn+3

SLIDE 21

More powerful algorithms are available
E.g., topic segmentation
Clear adaptation possibilities
How to create word vectors?
Which words are included (function/content words)?
Which value is represented in the vector (frequency, tf*idf, information, …)
How to calculate similarity/distance?
Cosine, manhattan, …
But: Evaluation is hard
No gold standard available
Different expectations

University of Stuttgart 21

Hearst (1994) TextTiling

SLIDE 22

Go to …
Load a text of your liking (it‘s better if you are familiar with it)
Add entity references by applying the Stanford NER system
Make a brief check, if the important entities are included

(“Passepartout”, for instance, is not)

You can add specific names by specifying regular expressions
Add reasonable segment annotations
Export a GEXF file and load it into Gephi
Play with various options and see how the network changes

University of Stuttgart 22

Lecture 2 Annotation tools & Segmentation Summary of Part 1 - - PowerPoint PPT Presentation

Lecture 2 Annotation tools & Segmentation

Summary of Part 1

Tool Support Annotation

Segmentation

Segmentation Tool Download

http://tinyurl.com/cretanetworker = http://www2.ims.uni-stuttgart.de/gcl/reiterns/creta/CRETANetworker.jar

Segmentation

Viewpoints Segmentation

Entities + Segments = Networks

Mary Peter Paul Co-Occurrence Network

⟨{A, B}, {A, B, B, B, A}, {A, C}⟩

weighted according to cell value

Slightly more abstract description Entities + Segments = Networks

Segmentation Annotation

Segment Annotation Tool

Network Tool Gephi

demo

Useful text processing skills 101 Regular Expressions

Basics Regular Expressions

the adjective “good”

Alternations and Character Classes Regular Expressions

Special cases and exceptions Regular Expressions

[0-9]+\./

[IVXCM]+\./

[0-9]+\./

Real examples Regular Expressions

Hearst (1994) TextTiling

v1= v2= dist(v1, v2) =

dn

Hearst (1994) TextTiling

dn dn+3

Hearst (1994) TextTiling

(“Passepartout”, for instance, is not)

Hands-On Session 2