Text Processing CS440 Text processing NLP tasks typically require - PDF document

11/5/19 Text Processing CS440 Text processing NLP tasks typically require multiple steps of text processing: • Segmenting/tokenizing words in text • Normalizing word forms • Segmenting sentences 1

11/5/19 Tokenization • Tokenization: segmenting text into words, using punctuation as separate tokens • Example: "The San Francisco-based restaurant," they said, "doesn’t charge". " The | San | Francisco-based | restaurant | , | " | they | said | , | " | does | n’t | charge | " • Seems like an easy problem. But: – Commas typically appear at word boundaries. Except: 1,000,000 – Apostrophes should be parsed differently than quotation marks than in situations like: books's cover or they're Issues in tokenization Finland’s capital → Finland Finland’s ? • • what’re, I’m, isn’t → What are, I am, is not • Hewlett-Packard → Hewlett Packard ? • state-of-the-art → state of the art ? → lower-case lowercase lower case ? • Lowercase • San Francisco → one token or two? • m.p.h., PhD. → ?? 2

11/5/19 Tokenization: language issues • French – L'ensemble → one token or two? • L ? L’ ? Le ? • German noun compounds are not segmented – Lebensversicherungsgesellschaftsangestellter – ‘life insurance company employee’ – German information retrieval needs compound splitter Normalization • Need to “normalize” terms: bring them into a common form – Example: we want to match US, U.S.A. and USA • Case folding: convert everything to lowercase – But: need to distinguish between US and us – Exceptions: upper case in mid-sentence • e.g., General Motors • Fed vs. fed 3

11/5/19 Morphology • Morphemes : – The small meaningful units that make up words – Stems: The core meaning-bearing units – Affixes: Bits and pieces that adhere to stems • Often with grammatical functions – Example: the word cats is composed of the stem "cat" and "s" Stemming • Reduce terms to their stems • Stemming is crude chopping of affixes – language dependent – e.g., automate(s), automatic, automation all reduced to automat . for example compressed for exampl compress and and compression are both compress ar both accept accepted as equivalent to as equival to compress compress . 4

11/5/19 Porter’s algorithm The most common English stemmer Step 1a Step 2 (for long stems) ational → ate relational → relate sses → ss caresses → caress izer → ize digitizer → digitize ies → i ponies → poni ator → ate operator → operate ss → ss caress → caress … s → ø cats → cat Step 3 (for longer stems) Step 1b al → ø revival → reviv able → ø adjustable → adjust (*v*)ing → ø walking → walk ate → ø activate → activ sing → sing … (*v*)ed → ø plastered → plaster … Viewing morphology in a corpus Why only strip –ing if there is a vowel? (*v*)ing → ø walking → walk sing → sing tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr 1312 King 548 being 548 being 541 nothing 541 nothing 152 something 388 king 145 coming 375 bring 130 morning 358 thing 122 having 307 ring 120 living 152 something 117 loving 145 coming 116 Being 130 morning 102 going tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr 5

11/5/19 Regular expressions • A formal language for specifying text strings • How can we search for any of these? – woodchuck – woodchucks – Woodchuck – Woodchucks Regular Expressions: Disjunctions • Letters inside square brackets [] Pattern Matches [wW]oodchuck Woodchuck, woodchuck [1234567890] Any digit • Ranges, e.g. [A-Z] Pattern Matches [A-Z] An upper case letter Drenched Blossoms [a-z] A lower case letter my beans were impatient [0-9] A single digit Chapter 1: Down the Rabbit Hole 6

11/5/19 Regular Expressions: Negation • Negation [^Ss] – Caret means negation only when first in [] Pattern Matches [^A-Z] Not an upper case Oyfn pripetchik letter [^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason” [^e^] Neither e nor ^ Look here a^b The pattern a caret b Look up a^b now Regular Expressions: More Disjunction • Woodchuck is another name for groundhog! • The pipe | for disjunction Pattern Matches groundhog | woodchuck yours | mine yours mine = [abc] a | b | c [gG]roundhog | [Ww]oodchuck 7

11/5/19 Regular Expressions: ? * + . Pattern Matches colou?r Optional color colour previous char oo*h! 0 or more of oh! ooh! oooh! ooooh! previous char o+h! 1 or more of oh! ooh! oooh! ooooh! previous char Stephen C Kleene baa+ baa baaa baaaa baaaaa beg.n begin begun begun beg3n Kleene *, Kleene + Regular Expressions: Anchors ^ $ Pattern Matches ^[A-Z] Palo Alto ^[^A-Za-z] “Hello” \.$ The end. .$ The end! 8

11/5/19 Example • Find me all instances of the word “the” in a text. the Misses capitalized examples [tT]he Incorrectly returns other or theology [^a-zA-Z][tT]he[^a-zA-Z] Errors • The process we just went through was based on fixing two kinds of errors – Matching strings that we should not have matched (there, then, other) • False positives – Not matching things that we should have matched (The) • False negatives 9

11/5/19 Errors cont. • In NLP we are always dealing with these kinds of errors. • Reducing the error rate for an application often involves two antagonistic efforts: – Increasing precision (minimizing false positives) – Increasing coverage or recall (minimizing false negatives). Regular expressions • Regular expressions play a surprisingly large role – Sophisticated sequences of regular expressions are often the first model for any text processing text • For many hard tasks, we use machine learning classifiers 20 10

11/5/19 Lemmatization • Reduce variant forms to a base form – am, are, is → be – car, cars, car's , cars' → car • the boy's cars are different colors → the boy car be different color Stemming vs lemmatization • Stemming: a crude heuristic process that chops off the ends of words • Lemmatization: return the base or dictionary form of a word, which is known as the lemma. 11

11/5/19 Sentence segmentation • !, ? are relatively unambiguous • Period “.” is quite ambiguous – Sentence boundary – Abbreviations like Inc. or Dr. – Numbers like .02% or 4.3 • Build a binary classifier – Looks at a “.” – Decides EndOfSentence/NotEndOfSentence – Classifiers: hand-written rules, regular expressions, or machine-learni ng Determining if a word is end-of sentence: a Decision Tree 12

Text Processing CS440 Text processing NLP tasks typically require - PDF document

11/5/19 Text Processing CS440 Text processing NLP tasks typically require multiple steps of text processing: Segmenting/tokenizing words in text Normalizing word forms Segmenting sentences 1 11/5/19 Tokenization

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

CS1100: Computer Science and Its Applications Text Processing Processing Text Excel can be

Text processing Format Text File IASP 321 IASP 221 Dr. John Yoon Text Processing Commands

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar

Text mining with ngram variables Matthias Schonlau, Ph.D. The most common approach to dealing

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w

of Geometric Concepts Uri Stemmer Ben-Gurion University joint work with Haim Kaplan, Yishay

Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng

Overview Esta es una naranja atrac1va: Adventures in Adap1ng an English Research Goal

Natural Language Processing CSCI 4152/6509 Lecture 9 Elements of Morphology Instructor: