Natural Language Processing: Traditional Processing Pipeline Roman - - PowerPoint PPT Presentation

natural language processing traditional processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing: Traditional Processing Pipeline Roman - - PowerPoint PPT Presentation

Natural Language Processing: Traditional Processing Pipeline SCIENCE PASSION TECHNOLOGY Natural Language Processing: Traditional Processing Pipeline Roman Kern <rkern@tugraz.at> 2020-03-19 Roman Kern <rkern@tugraz.at>, Institute


slide-1
SLIDE 1

Natural Language Processing: Traditional Processing Pipeline

SCIENCE PASSION TECHNOLOGY

Natural Language Processing: Traditional Processing Pipeline

Roman Kern <rkern@tugraz.at> 2020-03-19

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 1

slide-2
SLIDE 2

Natural Language Processing: Traditional Processing Pipeline

Outline

1 Introduction 2 Building Blocks 3 Basic Tasks 4 Document Representation 5 Language Models 6 PoS Tagging 7 Information Extraction 8 Sentiment Detection

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 2

slide-3
SLIDE 3

Introduction

Processing pipeline at a glance

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 3

slide-4
SLIDE 4

Introduction

Motivational Example Given a piece of writen text We want to analyse its content e.g., for sentiment detection Ofen the starting point is not pure text Some sort of file format (PDF, Word Documents, HTML, ...)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 4

slide-5
SLIDE 5

Introduction

Text Extraction Before the text can be analysed, it need to be extracted Multiple terms being used Format conversion, document conversion, data normalisation, ... Challenges Various interpretation of a standard, e.g., PDF Keep document structure information, e.g., headings

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 5

slide-6
SLIDE 6

Introduction

PDF Extraction PDF was originally designed to provide device-independent rendering Based on pages, and a stream of operations e.g., ❣♦t♦ ❁♣♦s✐t✐♦♥❃❀ ♣✉t ❁❧❡tt❡r❃ Problems Fonts only partially embedded (also covered by licences) Scanned in PDFs (only images) No structure information, e.g., e.g., no distinction between headings and text, caption of images, ... Glyphs could be directly drawn, or inline images

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 6

slide-7
SLIDE 7

Introduction

Basic Idea Starting with the raw text

  • 1. Keep the raw text (do not change)
  • 2. Apply a component from a predefined pipeline
  • 3. Store the output of the component as annotation (meta-data)

e.g., (start, stop, annotation-specific data)

  • 4. Goto 2 until end of pipeline reached

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 7

slide-8
SLIDE 8

Introduction

Recall Pipeline Architecture A pipeline is a specialised form of a pipes and filters architecture Typically a central storage for the meta-data is used (option b)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 8

slide-9
SLIDE 9

Introduction

Text Pipeline Architecture Each component (filter) in the pipeline serves a purpose Waits until all previous components are finished Takes the raw text and the annotations as input Produces new annotations (optionally), but does not change the raw text May use information (annotations) already produced by previous components

→ Implicit dependencies

Optionally, the pipeline has dynamic components (added on the fly)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 9

slide-10
SLIDE 10

Introduction

Text Pipeline Architecture + Pros Simple processing scheme Easy to adapt − Cons Might not be suited for low-latency applications Error propagation throughout the pipeline

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 10

slide-11
SLIDE 11

Introduction

Typical Text Pipeline From shallow to deep parsing

  • 1. Split sentences
  • 2. Split tokens (words)
  • 3. Apply Part-of-Speech tagging (word groups)
  • 4. Chunking (phrases)
  • 5. Build sentence tree (constituency parsing)
  • 6. Extract grammatical relationship between words (dependency parsing)

Everything up to POS is considered to be shallow parsing; building a sentence tree is considered to be deep parsing.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 11

slide-12
SLIDE 12

Introduction

Text Pipeline Typical scenario Afer all pipeline components are finished The annotations are transformed into features ... and ofen analysed using machine learning techniques e.g., a classification algorithm

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 12

slide-13
SLIDE 13

Introduction

Text Pipeline Architecture A NLP application has typically multiple pre-defined pipelines ... based on input format ... based on language ... based on target

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 13

slide-14
SLIDE 14

Building Blocks

Key Technologies to Build Text Annotator Components

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 14

slide-15
SLIDE 15

Building Blocks

Basics Traditionally, many processing components are based on expert knowledge ... formulated as rules, e.g., a ’.’ character signifies the end of a sentence In many cases these rule will work well ... but not all the time

e.g., consider sentence spliting and the term “I.B.M.”

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 15

slide-16
SLIDE 16

Building Blocks

White and Black Lists White list: collection of tokens (typically words) that fall into a certain category Black list: collection of tokens not belonging to a category Typically combined Apply a white list and then filter out the exceptions listed in the black list

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 16

slide-17
SLIDE 17

Building Blocks

White and Black Lists Examples Gazeteer Traditionally a list of geographical entities Nowadays used for any list of (named) entities Dictionary of words carrying sentiment

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 17

slide-18
SLIDE 18

Building Blocks

White and Black Lists + Pros Easy to create, understand and curate Easy to debug − Cons Limited flexibility May grow excessively (for languages that allow high flexibility)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 18

slide-19
SLIDE 19

Building Blocks

Regular Expressions Specialised language for patern matching Developed in 50ties Popularised by Unix Integral part of many programming languages e.g., Perl Provides a match (i.e., start/stop positions) Sentence: Albert Einstein was a German-born theoretical physicist. RegEx: ✭❬❆✲❩❪✮❭✇✰ Match: Albert Einstein was a German-born theoretical physicist.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 19

slide-20
SLIDE 20

Building Blocks

Regular Expressions + Pros Relatively easy to create Well known (many tools) − Cons Only a “hard” match (vs. a probabilistic, sof match) Tendency to grow complex Trade-off between false positives and false negatives (i.e, precision and recall)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 20

slide-21
SLIDE 21

Building Blocks

Hearst Paterns Proposed by Marti Hearst1 Lexico-Syntactic paterns Typically manually created (and curated)

Subsequently automatic extraction proposed by many works

... for specific type of information

Features for specific tasks Detect semantic relationships (ontology learning) e.g., hyponyms, synonym, causal relationships, ...

1Hearst, M. A. (1992, August). Automatic acquisition of hyponyms from large text

  • corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2 (pp.

539-545). Association for Computational Linguistics.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 21

slide-22
SLIDE 22

Building Blocks

Hearst Paterns - Examples

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 22

slide-23
SLIDE 23

Building Blocks

Hearst Paterns + Pros Relatively easy to create and to understand Fast execution (simple matches) − Cons Hard to automatically learn Low coverage (would require many rules to find all instances)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 23

slide-24
SLIDE 24

Building Blocks

Lexer and Scanner Lexer: lexical transformation a stream of characters into tokens Also called: scanner, tokenizer Parser: syntactical transformation (lexer is ofen a part of a parser) Annotate sequence of characters with a pre-defined class, e.g., person name, word, ... Ofen generated using dedicated (programming) language e.g., lex, flex, ANTLR, JavaCC

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 24

slide-25
SLIDE 25

Building Blocks

Lexer and Scanner

Figure: Example from the ClassicTokenizerImpl class of Apache Lucene (JFlex)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 25

slide-26
SLIDE 26

Building Blocks

Edit Distance Task: Measure the difference between tokens Ofen rules should not match exactly ... but should allow for small changes e.g., in case of spelling errors: beans vs. beens

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 26

slide-27
SLIDE 27

Building Blocks

Edit Distance Most common approach: Levenshtein Distance Measures the number of operations to transform a token into another token Operations: add, remove, replace (optional)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 27

slide-28
SLIDE 28

Building Blocks

Machine Learning Components As alternative to manually-created rule-based components ... use of statistical models ... use of machine learning models Expert knowledge is then ofen used ... to create a annotated dataset (supervised scenario) ... typically called ground truth (or gold standard)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 28

slide-29
SLIDE 29

Building Blocks

Machine Learning Components + Pros Ofen provide a “probabilistic” output

For example, decision with a confidence value ... or a ranked list of candidates

− Cons Data hungry

Complex machine learning algorithm require large amounts (of clean, unambiguous) data

Typically associated with higher computational complexity

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 29

slide-30
SLIDE 30

Basic Tasks

Oven used and basic processing components

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 30

slide-31
SLIDE 31

Basic Tasks

Language Detection Task: Detect the language of the input text Ofen the first step is to identify the language of a text Typical assumption: the language does not change within the text

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 31

slide-32
SLIDE 32

Basic Tasks

Language Detection Approaches (1/2) Use white lists Have a list of typical words for each language Count the frequency of their occurrence Typically these words will be function words e.g., German: ④❞❡r✱ ❞✐❡✱ ❞❛s✱ ❡✐♥❡r✱ ❡✐♥❡✱ ✳✳✳⑥

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 32

slide-33
SLIDE 33

Basic Tasks

Language Detection Approaches (2/2) Learn the character distribution Ofen on sub-word level, e.g., character 3-grams Based on corpora of specific languages Compare the language-specific distributions with the text

Further literature: Jauhiainen, T. S., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65, 675-782.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 33

slide-34
SLIDE 34

Basic Tasks

Sentence Segmentation Task: split a sequence of characters into sentences Also called sentence spliting, sentence detection Typically solved via a white list of sentence boundary characters ... in combination with exception rules

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 34

slide-35
SLIDE 35

Basic Tasks

Word Segmentation Task: split a sequence of characters into words (tokens) Ofen also called word tokenisation Hard for languages without (white-)space between words Problem: Ambiguous what a token should be e.g., clitic contractions: don’t → <don’t> | <don, t> | <do, n’t> | <do, not> Typically approached using rules (regex)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 35

slide-36
SLIDE 36

Basic Tasks

Compound-Word Spliting Task: split a compound word into its sub-words e.g., Frühlingserwachen → <Frühling, s2, erwachen> (spring awakening) Approach

  • 1. Split the word into its syllables
  • 2. Check every combination of consecutive syllables against a dictionary

2Called the Fugen-s

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 36

slide-37
SLIDE 37

Basic Tasks

Subword Spliting Task: split a single words (tokens) into smaller parts Motivation: Subwords may still carry some semantics ... deal with out-of-vocabulary words Example, using Wordpiece3 segmenter <Jim, Hen, ##son, was, a, puppet, ##eer>

3Yonghui Wu, et al. 2016. Google’s neural machine translation system: Bridging the gap

between human and machine translation. arXiv preprintarXiv:1609.08144

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 37

slide-38
SLIDE 38

Basic Tasks

Word n-Grams Task: capture the sequence information of consecutive words Approach: Combine pairs of adjacent tokens into a single token Example (for word 2-gram) Das ist ein kleiner Test→ { Das_ist, ist_ein, ein_kleiner, kleiner_Test } Many variations exists, for example skip-grams

Useful resource: ❤tt♣s✿✴✴❜♦♦❦s✳❣♦♦❣❧❡✳❝♦♠✴♥❣r❛♠s

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 38

slide-39
SLIDE 39

Basic Tasks

Character n-Grams Task: capture the sequence information of consecutive characters Approach: Combine pairs of adjacent characters into a single token Example (for character 3-gram) Das ist ein kleiner Test→ { Das, ist, ein, kle, lei, ein, ine, ner, Tes, est } Ofen use to find similar (writen) words ... to generate candidates for spelling correction

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 39

slide-40
SLIDE 40

Basic Tasks

Text Normalisation Task: Transform the text into a normalised form In search engines the document and the query at normalised the same way Ofen transform all characters to lower case Also called case folding Unix command line tool

✩ tr ❆✲❩ ❛✲③

Note: This tasks is language-dependent

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 40

slide-41
SLIDE 41

Basic Tasks

Text Normalisation Additionally, wide variety of normalisation strategies For example, removal of diacritics e.g., for German: Größe → groesze Conflate character repetitions (whitespace) Removal of special characters e.g., : word_press → wordpress Many domain-dependent approaches (e.g, mathematics formulas)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 41

slide-42
SLIDE 42

Basic Tasks

Text Normalisation + Pros Reduces the number of unique tokens

Increased performance when using machine learning

Cleaner text − Cons May remove too much information for certain tasks

e.g., sentiment detection: “TELL ME MORE‼1!”

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 42

slide-43
SLIDE 43

Basic Tasks

Stop word list Manually assembled list of non-content words e.g. the, a, with, to, ... Remove words without semantics

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 43

slide-44
SLIDE 44

Basic Tasks

Stemming Task: reduce words (tokens) to its root form (stem) Typically cuting of the suffix, and optionally replacing it e.g., hopping in snowy conditions → hop in snowi condit Ofen rule based, for example Porter Stemmer List of rewrite rules Problems: under- and overgeneralising

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 44

slide-45
SLIDE 45

Basic Tasks

Lemmatisation Task: reduce words (tokens) to its root form (lemma) e.g., going, went, gone → go, go, go Ofen based on dictionaries Based on corpora4

4For example: ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❲❩❇❙♦❝✐❛❧❙❝✐❡♥❝❡❈❡♥t❡r✴❣❡r♠❛❧❡♠♠❛

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 45

slide-46
SLIDE 46

Basic Tasks

Part-of-Speech (PoS) Tagger Task: add the word group to each token Nouns, verbs, adverbs, determiner, ... Ofen based on machine learning

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 46

slide-47
SLIDE 47

Basic Tasks

Chunker Task: combine multiple, consecutive words to form phrases Noun phrases, verb phrases, prepositional phrases, ... Typically based on the output of the PoS tagger e.g., <adjectivce> <noun> → NP

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 47

slide-48
SLIDE 48

Basic Tasks

Syntactic Parsing Transform a sentence into a tree representation ... which reflects the grammatical structure of the sentence Example sentence

❚❤❡ ❝♦♣ s❛✇ t❤❡ ♠❛♥ ✇✐t❤ t❤❡ ❜✐♥♦❝✉❧❛rs✳

Taken from: Bergmann, A., Hall, K. C., & Ross, S. M. (2007). Language files: Materials for an introduction to language and linguistics. Ohio State University Press.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 48

slide-49
SLIDE 49

Basic Tasks

Syntactic Parsing - Example

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 49

slide-50
SLIDE 50

Basic Tasks

Syntactic Parsing - Example

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 50

slide-51
SLIDE 51

Basic Tasks

Dependency Parsing Transform a sentence into a graph representation ... where each vertex is a word ... and each edge represents a grammatical relationship Example sentence

❆❢t❡r✇❛r❞ ✱ ■ ✇❛t❝❤❡❞ ❛s ❛ ❜✉tt✲t♦♥ ♦❢ ❣♦♦❞ ✱ ❜✉t ♠✐s❣✉✐❞❡❞ ♣❡♦♣❧❡ ❢✐❧❡❞ ♦✉t ♦❢ t❤❡ t❤❡❛t❡r ✱ ❛♥❞ ✐♠♠❡❞✐❛t❡❧② ❧✐t ✉♣ ❛ s♠♦❦❡ ✳

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 51

slide-52
SLIDE 52

Basic Tasks

Dependency Parsing - Example Sentence Tree

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 52

slide-53
SLIDE 53

Basic Tasks

Dependency Parsing - Example Dependency Output

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 53

slide-54
SLIDE 54

Document Representation

Feature extraction and engineering for text

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 54

slide-55
SLIDE 55

Document Representation

Bag of Words (BoW) Afer the processing pipeline has been executed (on a document/sentence) ... the output is collected and put into a bag Thus, each word is treated independently The sequence information is lost Semantic similarity (between two documents/bags) Compare the overlap between the two bags

Assumption: Many tokens in common equates similar content For example, the Jaccard distance, Dice coefficient, ...

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 55

slide-56
SLIDE 56

Document Representation

Bag of Words (BoW) Example “The green house is next to the blue building” →

{blue, building, green, house, is, next, the, to}

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 56

slide-57
SLIDE 57

Document Representation

Vector Space Model Each unique word is assigned its own dimension Each document is then represented by a single vector Assumption: The vector represents the semantics of the document In a simple seting If the word is contained in the document The value for the dimension in to the vector ... will be set to a non-zero value The process of assigning a dimension to each word is also called vectorisation

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 57

slide-58
SLIDE 58

Document Representation

Vector Space Model Document-Term Matrix Documents are rows, and terms are columns The resulting matrix is very sparse Typically approx. 2% non-zero entries

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 58

slide-59
SLIDE 59

Document Representation

Vector Space Model One-Hot Encoding Representation of a set of words (document, sentence, sequence, single word, ...) Words contained in the set are represented by a 1 Can be seen as a single row of the document-term matrix

Note: Ofen also used to encode nominal features as multiple binary features

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 59

slide-60
SLIDE 60

Document Representation

Vector Space Model Weighting strategies (term weighting) Simple case: use 1 as “non-zero value” More sophisticated strategies Count how ofen a token occurs → term frequency Down-weight common token → inverse document frequency Take into consideration the length of a document → length normalisation

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 60

slide-61
SLIDE 61

Document Representation

Latent Semantic Analysis (LSI/LSA) [1] Idea: Apply thin SVD on the document-term matrix Where the SVD is limited to the k most important singular values Requires as input: Document/term matrix Fixed number of topics Provides: Mapping of document to a (dense) lower-dimensional representation Probabilistic version: pLSA [2]

[1] Landauer, et al. (1997). A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. [2] Hofmann, T. (1999). Probabilistic latent semantic indexing.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 61

slide-62
SLIDE 62

Document Representation

Latent Dirichlet Allocation (LDA) Requires as input: Document/term matrix Fixed number of topics Provides: Mapping of document to topics (as vector of probabilities) Mapping of terms to topics (as vector of probabilities) Can be seen as fuzzy co-clustering

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 62

slide-63
SLIDE 63

Document Representation

Latent Dirichlet Allocation - Example

Figure: Example of LDA build using the TASA corpus

Steyvers, M., & Griffiths, T. (2007). Probabilistic Topic Models. Handbook of Latent Semantic Analysis.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 63

slide-64
SLIDE 64

Language Models

Probabilities of Words

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 64

slide-65
SLIDE 65

Language Models

Introduction Estimate the probabilities of words e.g., occurring in a document Estimate the probability of a span of text e.g., sequence of words

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 65

slide-66
SLIDE 66

Language Models

Unigram Language Model Estimate the probabilities of a single words P(wi) Can be estimated from a corpus P(wi) ≈

count(wi)

  • wj∈W count(wj)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 66

slide-67
SLIDE 67

Language Models

n-Gram Language Model Estimate the probabilities of a sequence of words P(w1, . . . , wm) Can be used to predict the next (unseen) word P(w1, . . . , wm) = m

i=1 P(wi | w1, . . . , wi−1)

Estimated via a corpus P(wi | wi−(n−1), . . . , wi−1) = count(wi−(n−1),...,wi−1,wi)

count(wi−(n−1),...,wi−1)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 67

slide-68
SLIDE 68

PoS Tagging

Assigning word groups to individual words

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 68

slide-69
SLIDE 69

PoS Tagging

What is PoS tagging? Part-of-speech tagging is the process of assigning a part-of-speech or other lexical class marker to each word in a corpus [Jurafsky & Martin] Input: a string of words and a specified tagset Output: a single best match for each word

Figure: Assing words to tags out of a tagset [Jon Atle Gulla]

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 69

slide-70
SLIDE 70

PoS Tagging

POS Examples

❇♦♦❦ t❤❛t ❢❧✐❣❤t✳ ❱❇ ❉❚ ◆◆ ❉♦❡s t❤❛t ❢❧✐❣❤t s❡r✈❡ ❞✐♥♥❡r❄ ❱❇❩ ❉❚ ◆◆ ❱❇ ◆◆

This task is not trivial For example: “book” is ambiguous (noun or verb) Challenge for POS tagging: resolve these ambiguities!

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 70

slide-71
SLIDE 71

PoS Tagging

Tagset The tagset is the vocabulary of possible POS tags. Choosing a tagset Striking a balance between Expressiveness (number of different word classes) “Classifiability” (ability to automatically classify words into the classes)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 71

slide-72
SLIDE 72

PoS Tagging

Examples for existing tagsets Brown corpus, 87-tag tagset (1979) Penn Treebank, 45-tag tagset, selected from Brown tagset (1993) C5, 61-tag tagset C7, 146-tag tagset STTS, German tagset (1995/1999)

❤tt♣✿✴✴✇✇✇✳✐♠s✳✉♥✐✲st✉tt❣❛rt✳❞❡✴❢♦rs❝❤✉♥❣✴r❡ss♦✉r❝❡♥✴❧❡①✐❦❛✴❚❛❣❙❡ts✴stts✲t❛❜❧❡✳❤t♠❧

Today, universal tagsets are common for course grain classification.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 72

slide-73
SLIDE 73

PoS Tagging

Penn Treebank Over 4.5 mio words Presumed to be the first large syntactically annotated corpus Annotated with POS information And with skeletal syntactic structure Two-stage tagging process:

  • 1. Assigning POS tags automatically (stochastic approach, 3-5% error)
  • 2. Correcting tags by human annotators

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 73

slide-74
SLIDE 74

PoS Tagging

How hard is the tagging problem?

Figure: The number of word classes in the the Brown corpus by degree of ambiguity

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 74

slide-75
SLIDE 75

PoS Tagging

Some approaches for POS tagging Rule based ENGTWOL tagger Transformation based Brill tagger Stochastic (machine learning) HMM tagger

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 75

slide-76
SLIDE 76

PoS Tagging

Rule based POS tagging A two stage process

  • 1. Assign a list of potential parts-of-speech to each word, e.g. BRIDGE →

V N

  • 2. Using rules, eliminate parts-of-speech tags from that list until a single

tag remains ENGTWOL uses about 1.100 rules to rule out incorrect parts-of-speech

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 76

slide-77
SLIDE 77

PoS Tagging

Rule based POS tagging Input

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 77

slide-78
SLIDE 78

PoS Tagging

Rule based POS tagging Rules

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 78

slide-79
SLIDE 79

PoS Tagging

Pros and Cons of Rule-Base System + Interpretable model + Make use of expert/domain knowledge − Lot of work to create rules − Number of rules may explode (and contradict each other) ○ Good starting point

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 79

slide-80
SLIDE 80

PoS Tagging

Transformation based POS tagging Brill Tagger, combination of rule-based tagger with supervised learning [Brill 1995] Rules Initially assign each word a tag (without taking the context into account)

Known words → assign the most frequent tag Unknown word → e.g. noun (guesser rules)

Apply rules iteratively (taking the surrounding context into account → context rules)

e.g. ■❢ ❚r✐❣❣❡r✱ t❤❡♥ ❝❤❛♥❣❡ t❤❡ t❛❣ ❢r♦♠ ❳ t♦ ❨, ■❢ ❚r✐❣❣❡r✱ t❤❡♥ ❝❤❛♥❣❡ t❤❡ t❛❣ t♦ ❨

Typically 50 guessing rules and 300 context rules

Rules have been induced from tagged corpora by means of Transformation-Based Learning (TBL)

❤tt♣✿✴✴✇✇✇✳❧✐♥❣✳❣✉✳s❡✴⑦✴❧❛❣❡r✴♠♦❣✉❧✴❜r✐❧❧✲t❛❣❣❡r✴✐♥❞❡①✳❤t♠❧

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 80

slide-81
SLIDE 81

PoS Tagging

Transformation-Based Learning

  • 1. Generate all rules that correct at least one error
  • 2. For each rule:
  • a. Apply a copy of the most recent state of the training set
  • b. Score the result using the objective function (e.g. number of wrong

tags)

  • 3. Select the rules with the best score
  • 4. Update the training set by applying the selected rules
  • 5. Stop if the the score is smaller than some pre-set threshold T; otherwise

repeat from step 1

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 81

slide-82
SLIDE 82

PoS Tagging

Pros and Cons of Hybrid System + Interpretable model + Make use of expert/domain knowledge + Less work to create rules − Additional work to annotate dataset − Risk of many rules ○ Works in special cases

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 82

slide-83
SLIDE 83

PoS Tagging

Stochastic part-of-speech tagging Based on probability of a certain tag given a certain context Requires a training corpus No probabilities available for words not in training corpus Smoothing Simple Method: Choose the most frequent tag in the training text for each word Result: 90% accuracy Baseline method Lot of non-trivial methods, e.g. Hidden Markov Models (HMM)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 83

slide-84
SLIDE 84

PoS Tagging

Generative Stochastic Part-of-Speech Tagging Intuition: Pick the most likely tag for each word Choose the best tag sequence for an entire sentence, seeking to maximize the formula P(word|tag) × P(tag|previous n tags) Let T = t1, .., tn be a sequence of tags Let W = w1, ..., wn be a sequence of words Find the PoS tags that generate a sequence of words, i.e., look for the most probable sequence of tags T underlying the observed words W

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 84

slide-85
SLIDE 85

PoS Tagging

Markov models & Markov chains Markov chains can be seen as a weighted finite-state machines They have the following Markov properties, where Xi is a state in the Markov chain, and s is a value that the state takes: Limited horizon: P(Xt+1 = s|X1, ..., Xt) = P(Xt+1 = s|Xt) (first order Markov models) ... the value at state t + 1 just depends on the previous state Time invariant: P(Xt+1 = s|Xt) is always the same, regardless of t ... there are no side effects

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 85

slide-86
SLIDE 86

PoS Tagging

Hidden Markov Models Now, that we are given a sequence of words (observation) and want to find the POS tags? Each state in the Markov model will be a POS tag (hidden state), but we don’t know the correct state sequence The underlying sequence of events (= the POS tags) can be seen as generating a sequence of words ... thus, we have a Hidden Markov Model

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 86

slide-87
SLIDE 87

PoS Tagging

Hidden Markov Model Needs three matrices as input: A (transmission, POS → POS), B (emission, POS → Word), π (initial probabilities, POS)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 87

slide-88
SLIDE 88

PoS Tagging

Three Fundamental Problems

  • 1. Probability estimation: How do we efficiently compute probabilities, i.e.

P(O|θ) - the probability of an observation sequence O given a model θ

θ = (A, B, π), A ... transition matrix, B ... emission matrix, π initial

probability matrix

  • 2. Best path estimation: How do we choose the best sequence of states X,

given our observation O and the model θ How do we maximise P(X|O)?

  • 3. Parameter estimation: From a space of models, how do we find the best

parameters (A, B, and π) to explain the observation How do we (re)estimate θ in order to maximise P(O|µ)?

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 88

slide-89
SLIDE 89

PoS Tagging

Three Fundamental Problems - Algorithmic Approaches

  • 1. Probability estimation

Dynamic programming (summing forward probabilities)

  • 2. Best path estimation

Viterbi algorithm

  • 3. Parameter estimation

Baum-Welch algorithm (Forward-Backward algorithm)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 89

slide-90
SLIDE 90

PoS Tagging

Simplifying the Probabilities

argmaxt1,n P(t1,n|w1,n) = argmaxt1,n P(w1,n|t1,n)P(t1,n) → refers to the whole sentence

... estimating probabilities for an entire sentence is a bad idea Markov models have the property of limited horizon: one state refers only back the previous (n, typically 1) steps - is has no memory

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 90

slide-91
SLIDE 91

PoS Tagging

Simplifying the probabilities Independence assumption: words/tags are independent of each other For a bi-gram model: P(t1,n) ≈ P(tn|tn−1)P(tn−1|tn−2) . . . P(t2|t1) = n

i=1 P(ti|ti−1)

A word’s identity only depends on its tag P(w1,n|t1,n) ≈ n

i=1 P(wi|ti)

The final equation is:

ˆ

t1,n = n

i=1 P(wi|ti)P(ti|ti−1)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 91

slide-92
SLIDE 92

PoS Tagging

Probability Estimation for Tagging How do we get such probabilities?

→ With supervised tagging we can simply use Maximum Likelihood

Estimation (MLE) and use counts (C) from a reference corpus P(ti|ti−1) ≈ C(ti−1,ti)

C(ti−1)

P(wi|ti) ≈ C(wi,ti)

C(ti)

Given these probabilities we can finally assign a probability to a sequence

  • f states (tags)

To find the best sequence (of tags) we can apply the Viterbi algorithm

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 92

slide-93
SLIDE 93

PoS Tagging

Probability Estimation - Justification Given an observation, estimate the underlying probability e.g. recall PMF for binomial: p(k) = n

k

  • (1 − p)n−kpk

We want to estimate the best p

argmaxp P(observed data) = argmaxp

n

k

  • (1 − p)n−kpk

→ derivative to find the maxima (0 = ∂

∂p

n

k

  • (1 − p)n−kpk )

For large np one can approximate p to be k

n (and standard deviation of

  • k(n−k)

n3

for independent and an unbiased estimate) Note: There are alternative versions on how to estimate the probabilities

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 93

slide-94
SLIDE 94

PoS Tagging

Pros and Cons of Stochastic Systems + Less involvement of expert/domain knowledge (annotate dataset) + Finds an optimal working point − High complexity − Black-box, cannot easily be interpreted − Requires machine learning expert (e.g., check for preconditions) ○ Best results (if preconditions met)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 94

slide-95
SLIDE 95

PoS Tagging

POS tagging - Stochastic part-of-speech tagging Does work for cases, where there is evidence in the corpus But what to do, if there are rare events, which just did not make it into the corpus? Simple non-solution: always assume their probability to be 0 Alternative solution: smoothing

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 95

slide-96
SLIDE 96

PoS Tagging

POS tagging - Stochastic part-of-speech tagging Will the sun rise tomorrow? Laplace’s Rule of Succession We start with the assumption that rise/non-rise are equally probable On day n + 1, we’ve observed that the sun has risen s times before pLap(Sn+1 = 1|S1 + ... + Sn = s) = s+1

n+2

What is the probability on day 0, 1, ...?

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 96

slide-97
SLIDE 97

PoS Tagging

POS tagging - Stochastic part-of-speech tagging Laplace Smoothing Simply add one:

C(ti−1,ti) C(ti−1) ⇒ C(ti−1,ti)+1 C(ti−1)+V(ti−1,t)

... where V(ti−1, t) = | {ti|C(ti−1, ti) > 0} | (vocabulary size) Can be further generalised by introducing a smoothing parameter λ

C(ti−1,ti)+λ C(ti−1)+λV(ti−1,t) Note: Also called Lidstone smoothing, additive smoothing

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 97

slide-98
SLIDE 98

PoS Tagging

POS tagging - Stochastic part-of-speech tagging Estimate the smoothing parameter

C(ti−1,ti)+λ C(ti−1)+λV(ti−1,t)

... typically λ is set between 0 and 1 How to choose the correct λ? Separate a small part of the training set (held out data) ... development set Apply the maximum likelihood estimate

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 98

slide-99
SLIDE 99

PoS Tagging

State-of-the-Art in POS Tagging Selected Approaches

System name Short description All tokens Unknown words TnT Hidden markov model 96.46% 85.86% MElt MEMM 96.96% 91.29% GENiA Tagger Maximum entropy 97.05% Not available Averaged Perceptron Averaged Perception 97.11% Not available Maxent easiest-first Maximum entropy 97.15% Not available SVMTool SVM-based 97.16% 89.01% LAPOS Perceptron based 97.22% Not available Morče/COMPOST Averaged Perceptron 97.23% Not available Stanford Tagger 2.0 Maximum entropy 97.32% 90.79% LTAG-spinal Bidirectional perceptron 97.33% Not available SCCN Condensed nearest neighbor 97.50% Not available Taken from: ❤tt♣✿✴✴❛❝❧✇❡❜✳♦r❣✴❛❝❧✇✐❦✐✴✐♥❞❡①✳♣❤♣❄t✐t❧❡❂P❖❙❴❚❛❣❣✐♥❣❴✪✷✽❙t❛t❡❴♦❢❴t❤❡❴❛rt✪✷✾

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 99

slide-100
SLIDE 100

Information Extraction

Extract semantic content from text

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 100

slide-101
SLIDE 101

Information Extraction

Information Extraction (IE) Goal: Transform unstructured text into a higher degree of structure Might be a goal in itself For example, to find all mentions of a person name Or a preprocessing step For example, person names are good indicators for the content of a document

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 101

slide-102
SLIDE 102

Information Extraction

Named Entity Recognition (NER) Task: identify named entities in the text Nouns, verbs, adverbs, determiner, ... Example sentence: ❁♣❡rs♦♥❃❆❧❜❡rt ❊✐♥st❡✐♥❁✴♣❡rs♦♥❃ ✇✉r❞❡ ✐♥

❁❧♦❝❛t✐♦♥❃❯❧♠❁✴❧♦❝❛t✐♦♥❃ ❣❡❜♦r❡♥✳

Endocoded: ❆❧❜❡rt✴❇✲P❡rs♦♥ ❊✐♥st❡✐♥✴■✲P❡rs♦♥ ✇✉r❞❡✴❖ ✐♥✴❖

❯❧♠✴❇✲▲♦❝❛t✐♦♥ ❣❡❜♦r❡♥✴❖✳

Example of BIO encoding (beginning, inside, other) Common alternative: BILOU (beginning, inside, last, other, unit)

Sang, E. F. T. K., & De Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. CONLL ’03 Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 4, 142–147.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 102

slide-103
SLIDE 103

Information Extraction

Wikification The list of extracted terms to expanded to all Wikipedia articles i.e. each Wikipedia is treated as an entity Also called entity linking

Figure: Screenshot of the TAGME system: ❤tt♣s✿✴✴t❛❣♠❡✳❞✹s❝✐❡♥❝❡✳♦r❣✴t❛❣♠❡✴

Mendes, P. N., Jakob, M., García-silva, A., & Bizer, C. (2011). DBpedia Spotlight : Shedding Light on the Web of

  • Documents. Proceedings of the 7th International Conference on Semantic Systems (I-Semantics)., 95, 1–8.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 103

slide-104
SLIDE 104

Information Extraction

Approaches Gazeteer list Combination of white lists and black lists Machine learning Typically supervised learning Ofen sequence classification

e.g., (Hidden) Markov Models, Conditional Random Fields

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 104

slide-105
SLIDE 105

Sentiment Detection

Example application for the NLP pipeline

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 105

slide-106
SLIDE 106

Sentiment Detection

Introduction Goal: Identify for a given piece of text the overall atitude Positive/Negative emotion Alternatively also called: opinion mining, opinion extraction Related tasks like subjectivity/objectivity extraction

Généreux, Michel, and Roger Evans. "Distinguishing Affective States in Weblog Posts." AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. 2006.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 106

slide-107
SLIDE 107

Sentiment Detection

Example Sentiment Detection Based on a survey paper [1] Opinion A conclusion open to dispute (because different experts have different opinions) View subjective opinion Belief deliberate acceptance and intellectual assent Sentiment opinion representing one’s feelings

[1] Kharde, V., & Sonawane, P. (2016). Sentiment analysis of twiter data: a survey of techniques. arXiv preprint arXiv:1601.06971.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 107

slide-108
SLIDE 108

Sentiment Detection

Example Sentiment Pipeline

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 108

slide-109
SLIDE 109

Sentiment Detection

Processing Pipeline Language Detection Pre- Processing NLP Pipeline Feature Extrac- tion Pre-Processing Remove Non-English Tweets Remove all URLs, hash tags (e.g. #topic), targets (@username) Replace all the emoticons with their sentiment Remove all punctuations, symbols, numbers

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 109

slide-110
SLIDE 110

Sentiment Detection

NLP Pipeline Sentence spliting (optional) Tokenisation Normalisation Stop-word removal Spelling correction (e.g., character repetitions) Expand acronyms PoS tagging (note: specific to Tweets) Constituency parsing

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 110

slide-111
SLIDE 111

Sentiment Detection

Feature Extraction Words and their frequencies Parts of speech tags Opinion words and phrases, e.g., from a lexicon5 Position of terms Negation Syntax, e.g. collocations

5e.g., ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❛❡s✉❧✐✴s❡♥t✐✇♦r❞♥❡t

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 111

slide-112
SLIDE 112

Sentiment Detection

Classification Approaches Lexicon-based (starting with known opinion words) Dictionary-based

Start with known seed words, strategy to expand

Corpus-based (ofen domain-specific)

Apply statistical methods to grow dictionary

Machine learning Supervised

Starting with a manually labelled dataset

Unsupervised (typically clustering)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 112

slide-113
SLIDE 113

Sentiment Detection

Results

Method Data Set Acc. Author Machine Learning SVM Movie reviews 86.40% Pang, Lee[23] CoTraining SVM Twiter 82.52% Liu[14] Deep learning Stanford Sentiment Treebank 80.70% Richard[18] Lexical based Corpus Product reviews 74.00% Turkey Dictionary Amazon’s Mechanical Turk — Taboada[20] References in: Kharde, V., & Sonawane, P. (2016). Sentiment analysis of twiter data: a survey of techniques. arXiv preprint arXiv:1601.06971.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 113

slide-114
SLIDE 114

Thank You!

Next: Evaluation & Hypothesis-Testing

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-03-19 114