Lecture 2 Annotation tools & Segmentation Summary of Part 1 - - PowerPoint PPT Presentation

lecture 2 annotation tools segmentation summary of part 1
SMART_READER_LITE
LIVE PREVIEW

Lecture 2 Annotation tools & Segmentation Summary of Part 1 - - PowerPoint PPT Presentation

Center for Reflected Text Analytics Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory Guidelines Inter-Annotator agreement Inter-subjective annotations Annotation exercise Discuss


slide-1
SLIDE 1

Center for Reflected Text Analytics

Lecture 2 Annotation tools & Segmentation

slide-2
SLIDE 2
  • Annotation theory
  • Guidelines
  • Inter-Annotator agreement
  • Inter-subjective annotations
  • Annotation exercise
  • Discuss disagreements with your neighbor
  • Improve annotation guidelines

University of Stuttgart 2

Summary of Part 1

slide-3
SLIDE 3
  • Tools can support the annotation process at various stages
  • Managing multiple annotators
  • Assign documents to annotate
  • Supervise their progress
  • Analyse disagreements
  • Display disagreements (only)
  • Calculate quantitative IAA (κ)
  • Create a gold standard
  • Make decisions on disagreements
  • Record final decisions
  • Usable tools: See handout

University of Stuttgart 3

Tool Support Annotation

slide-4
SLIDE 4

Segmentation

University of Stuttgart 4

slide-5
SLIDE 5

University of Stuttgart 5

Segmentation Tool Download

http://tinyurl.com/cretanetworker = http://www2.ims.uni-stuttgart.de/gcl/reiterns/creta/CRETANetworker.jar

slide-6
SLIDE 6
  • Abstract definition
  • No meaning of a segment implied
  • The task of separating a text into multiple parts (“segments”)
  • Segmentation according to various criteria based on
  • Structure (chapters, acts, letters, speeches)
  • Linguistics (sentences, paragraphs)
  • Narrative content (scenes, time, place)
  • Content (topics under discussion)
  • No generic criterion covering multiple research questions

Segmentation

University of Stuttgart 6

slide-7
SLIDE 7
  • Focus on segments
  • Spans of text
  • Focus on segment boundaries
  • Positions in a text
  • Views are equivalent – we will switch between them when appropriate

University of Stuttgart 7

Viewpoints Segmentation

Segment 1 Segment 2 Segment 3 Segment 4 Segment Boundary Segment Boundary Segment Boundary

slide-8
SLIDE 8

University of Stuttgart 8

Entities + Segments = Networks

Mary Peter Paul Co-Occurrence Network

slide-9
SLIDE 9
  • Segmented text with the appearing entities

⟨{A, B}, {A, B, B, B, A}, {A, C}⟩

  • Convert into an (quadratic) adjacency matrix
  • Diagonal is typically uninteresting
  • Matrix is symmetric
  • Create network
  • A node is created for each row (or cell)
  • An edge is created for each cell,

weighted according to cell value

University of Stuttgart 9

Slightly more abstract description Entities + Segments = Networks

A B C A 2 1 B 2 C 1

A B C

slide-10
SLIDE 10
  • Theoretically
  • Segments can be annotated just like entity references
  • Both cover sequences of words
  • Appropriate annotation guidelines would define when to annotate segments
  • Practically
  • Segmentation criterion closely tied to research question
  • No reasonable generic abstraction layer
  • That works for multiple research questions and/or text corpora
  • Single texts only contain a few segments
  • Much more annotated texts needed for any kind of automatisation

University of Stuttgart 10

Segmentation Annotation

slide-11
SLIDE 11
  • Web-based UI
  • Beta-Software
  • Automatic annotation through rules and tools
  • Entity annotation
  • Stanford Named Entity Recognizer (Finkel et al., 2005)
  • Only proper names, no descriptive noun phrases
  • Rules (regular expressions) – to specify the entity references
  • Segment annotation
  • Rules (regular expressions) – to specify the segment boundaries
  • Unsupervised segmentation algorithm (TextTiling; Hearst, 1994)
  • Network export → Gephi

University of Stuttgart 11

Segment Annotation Tool

slide-12
SLIDE 12
  • Free and open source
  • https://gephi.org
  • Wide range of metric, filter and layout algorithms
  • Network editing (e.g., merge nodes)
  • Plugins
  • Export into static images

University of Stuttgart 12

Network Tool Gephi

slide-13
SLIDE 13

University of Stuttgart 13

demo

slide-14
SLIDE 14
  • A powerful way to describe sets of character sequences
  • Many search tools support REs, and all programming languages do
  • Looks cryptic, but is quite systematic
  • REs on slides/handout are marked in forward slashes / / for readability
  • they don’t need to be typed in the tool
  • Basics
  • Many regular characters stand for themselves
  • The RE /a/ finds occurrences of the character “a”
  • Sequences of characters stand for sequences of themselves
  • The RE /the/ finds occurrences of the string “the”

University of Stuttgart 14

Useful text processing skills 101 Regular Expressions

slide-15
SLIDE 15
  • Many regular characters stand for themselves
  • The RE /a/ finds occurrences of the character “a”
  • Sequences of characters stand for sequences of themselves
  • The RE /the/ finds occurrences of the string “the”
  • Meta characters (“quantifiers”) are applied on the previous character
  • ?: previous character optional (0-1 times)
  • /them?/ finds both “the” and “them”
  • +: Previous character one or more times
  • /ab+/ finds ”ab”, “abb”, ”abbb”, …
  • The kleene star * finds the previous character zero or more times
  • /ab*/ finds “a”, “ab”, “abb”, ”abbb”, …

University of Stuttgart 15

Basics Regular Expressions

slide-16
SLIDE 16
  • /(re1|re2)/ finds everything that finds either re1 or re2
  • /(good|better|best)/ finds comparative and superlative forms of

the adjective “good”

  • /great(er|est)?/ finds comp. and sup. forms of “great”
  • The question mark makes the suffixes optional
  • We can mark alternatives on character level in square brackets: […]
  • /[Tt]he/ finds upper and lower case forms of “the”
  • Square brackets support ranges of characters
  • /[A-Z]/ finds upper-case characters (beware: locale)
  • /[0-9]/ finds digits

University of Stuttgart 16

Alternations and Character Classes Regular Expressions

slide-17
SLIDE 17
  • The dot . matches everything
  • /a.*b/ finds everything that begins with a and ends with b
  • Escape character: Backslash
  • In order to find a dot, we need to prevent its special meaning
  • /.*\.doc/ finds everything that ends on “.doc” (e.g., filenames)

University of Stuttgart 17

Special cases and exceptions Regular Expressions

slide-18
SLIDE 18
  • Chapter 10.
  • /Chapter

[0-9]+\./

  • Chapter V. (Roman numbers)
  • /Chapter

[IVXCM]+\./

  • Beware: Possible over-matching
  • Dates: MAY 22., AUGUST 23.
  • /[A-Z]+

[0-9]+\./

  • Beware: Possible over-matching

University of Stuttgart 18

Real examples Regular Expressions

slide-19
SLIDE 19
  • Unsupervised segmentation algorithm, developed for expository texts
  • Compares lexicon in a window left and right of a target sentence gap

University of Stuttgart 19

Hearst (1994) TextTiling

window size = 2

sentence boundary

2 1 7 3 2 9

v1= v2= dist(v1, v2) =

n n+1

dn

step size = 3

slide-20
SLIDE 20
  • Unsupervised segmentation algorithm, developed for expository texts
  • Compares lexicon in a window left and right of a target sentence gap

University of Stuttgart 20

Hearst (1994) TextTiling

sentence boundary

n n+1

dn dn+3

slide-21
SLIDE 21
  • More powerful algorithms are available
  • E.g., topic segmentation
  • Clear adaptation possibilities
  • How to create word vectors?
  • Which words are included (function/content words)?
  • Which value is represented in the vector (frequency, tf*idf, information, …)
  • How to calculate similarity/distance?
  • Cosine, manhattan, …
  • But: Evaluation is hard
  • No gold standard available
  • Different expectations

University of Stuttgart 21

Hearst (1994) TextTiling

slide-22
SLIDE 22
  • Go to …
  • Load a text of your liking (it‘s better if you are familiar with it)
  • Add entity references by applying the Stanford NER system
  • Make a brief check, if the important entities are included

(“Passepartout”, for instance, is not)

  • You can add specific names by specifying regular expressions
  • Add reasonable segment annotations
  • Export a GEXF file and load it into Gephi
  • Play with various options and see how the network changes

University of Stuttgart 22

Hands-On Session 2