Shallow NLP Three Early Stages: Pre-processing, Tokenization & - PDF document

CoLi USb Resources for com putational linguists Sem inar Shallow NLP Three Early Stages: Pre-processing, Tokenization & Morphological Analysis by Achmad Yani CoLi Saarland University Contents 1 STAGES IN NLP SYSTEMS PRE-PROCESSING 2 3 TOKENIZATION 3 4 MORPHOLOGICAL ANALYSIS 4 CONCLUSIONS 5 1

CoLi Saarland University Stages in a Com prehensive NLP System Text Generation KB Reasoning P&D Analysis Linguistic Analysis Linguistic Analysis Sem antic Analysis Stages Stages Syntactic Analysis Morphological Analysis Morphological Analysis Tokenization Tokenization Pre- Pre - Linguistic Linguistic Analysis Analysis Preprocessing Preprocessing CoLi Saarland University 1 . Preprocessing Stage Main Task of the Stage: � Filter out the text from unnecessary character, such as: � extra whitespace � text subdivisions � special character � SGML-type code HOW ? � Using: lex or flex on Unix-based w orkstations 2

CoLi Saarland University Flex program for filtering out SGML m arkings /* Call this file StripSGML.lx, and then run: Flex -8 –CF StripSGML.lx; gcc –o StripSGML lex.yy.c –lfl –s To pass this simple filter over a text file called toto, run: StripSGML < toto %% “<“[^\n<>]+“>“; ECHO; ECHO; [\n] %% Delete SGML markings from an input files CoLi Saarland University Flex program for dehyphenating a text /* Call this file dehyphen.lx, and then run: Flex -8 –CF dehyphen.lx; gcc –o dehyphen lex.yy.c –lfl –s To pass this simple filter over a text file called toto, run: dehyphen < toto %% [a-z]-[\t]*\n[\t] * {printf( “%c“,yytext[0]);} %% Lower-case letter, followed by a hyphen, then any number of tabs or spaces, followed by a newline character and more spaces. 3

CoLi Saarland University 2 . Tokenization Main Task of the Stage : � Isolation of word-like units from a text / Recognition of sentence boundaries � The element of the text is recognized by: � certain syntactic class. • For example: dog � SINGULAR-NOUN � Sentence boundaries CoLi Saarland University Non-Trivial Tokenization Cases � Recognize token that contains am biguous punctuatin � Numbers, Alphanumeric references • e.g. T-1-AB.1.2 � Dates • e.g. 05/07/07 � Acronyms • e.g. AT&T � Punctuations • !,?,. � Abbreviations • e.g. m.p.h 4

CoLi Saarland University Sentence Boundaries I dentification Approach Tokenization Approach Maximum Entropy Manually Writing Approach Approach The system learns to Primitive Way, Using classify each Regular Expression occurance of Grammar punctuation as sentence boundary. CoLi Saarland University MANUAL APPROACH – RE for Am biguous Separators in Num bers � English num ber: 1 2 3 ,4 5 6 .7 8 ([0-9]+[,])*[0-9]([.][0-9]+)? � French Num ber: 1 2 3 4 5 6 ,7 8 ([0-9]+[ ])*[0-9]([,][0-9]+) � Fractions, Dates [0-9]+(\/[0-9]+)+ � Percent ([+\-])?[0-9]+ (\.)? [0-9] *% � Decim al Num ers ( e.g. 1 ,2 3 4 .5 6 ) ([0-9]+,?)+(\.[0-9]+ | [0-9]+)* 5

CoLi Saarland University MANUAL APPROACH - RE for Abbreviations � Three classes of Abbreviations: 1. A single capital followed by period, e.g. A.,B., C. [A-Za-z]\. 2. A sequence of letter-period-letter-period‘s, e.g. U.S., i.e., m.p.h [A-Za-z]\.([A-Za-z0-9]\.)+ 3. A capital letter followed by a sequence of consonant followed by a period, e.g. Mr., St., Assn., [A-Z][bcdfghj-np-tvxz]+\. CoLi Saarland University MANUAL APPROACH - System Perform ance • Using Brow n Corpus Correct Errors Full Stop Regular Expression [A-Za-]\. 1 3 2 7 5 2 1 4 [A-Za-z]\.([A-Za-z0-9]\.)+ 5 7 0 0 6 6 [A-Z][bcdfghj-np-tvxz]+\. 1 9 3 8 4 4 2 6 Totals 3 8 3 5 9 6 1 0 6 6

CoLi Saarland University MANUAL APPROACH - Problem � The list of exception lists w ill never be exhaustive, alw ays need to be updated! � Multiple rules m ay interact badly, since punctuation m arks does not alw ays follow the logic of the form ula � e.g. • The president lives in Washington D.C . � Logically, it should be: • The president lives in Washington D.C .. CoLi Saarland University Maxim um Entropy ( ME) Approach THE I DEA : � Scanning text for sequences of character separated by w hitespace ( tokens) : ., ?, and ! � potential sentence boundaries � contextual information 7

CoLi Saarland University ME APPROACH - Term inology � Candidate: � token containing the symbol which marks a putative sentence boundary � Prefix: � the portion of the Candidate preceding the potential sentence boundary � Suffix: � the portion of the Candidate following the potential sentence boundary CoLi Saarland University ME APPROACH - Contextual Tem plates � The Prefix � The Suffix � Whether the Prefix or Suffix is on the list of induced abbreviations (from training data) � The word left, of the Candidate � The word right of the Candidate � Whether the word to the left or right of the Candidate is on the list of induced abbreviations 8

CoLi Saarland University ME APPROACH - Exam ple 1 � Sentence: � ANLP Corp. chairman Dr. Smith resigned. � The exact information for the potential sentence boundary marked by . in Corp. would be: � PreviousWord =ANLP, Following-Word =chairman, Prefix =Corp, Suffix=NULL, PrefixFeature=InducedAbbreviation. CoLi Saarland University ME APPROACH - Joint Probability � For each potential boundary token ( .,?,!) , estim ate the joint probability p and its surrounding context. { } = ∏ = f (b,c) k ∈ α p(b, c) , where b no, yes j j j 1 α j = unknown parameter of the model, each of it corresponds to f j . The probability of seeing an actual sentence boundary in the context c is given by p(yes, c) 9

CoLi Saarland University ME APPROACH – Exam ple Useful Feature ⎧ = = 1 if Prefix (c) Mr & b no = f j b c ⎨ ( , ) ⎩ 0 otherwise Allow to discover that the period at the end of the word Mr. seldom occurs as sentence boundary CoLi Saarland University ME APPROACH - Decision Rule � A potential sentence boundary is an actual sentence boundary if and only if p( yes| c) > .5 w here: p yes c ( , ) = p yes c ( | ) + p yes c p no c ( , ) ( , ) 10

CoLi Saarland University ME APPROACH - System Perform ance Corpus W SJ Brow n Sentences 2 0 4 7 8 5 1 6 7 2 Candidate P.Marks 3 2 1 7 3 6 1 2 8 2 Accuracy 9 8 .8 % 9 7 .9 % False Positives 2 0 1 7 5 0 False Negatives 1 7 1 5 0 6 CoLi Saarland University 3 . Morphological Analysis Main Task of the Stage: � Analysing the m eaningful com ponents of w ords � Non- trivial Case: � Word division 11

CoLi Saarland University W ord Division English: � I t‘s, he‘s, that‘s, there‘s, w ho‘s, she‘s French: L‘addition, m ‘appelle, donne-le, va-t-ill, etc Bahasa ( I ndonesian) : Pertanggungjaw aban, kem erdekaan CoLi Saarland University Morphology � Hebrew ( transliterated) : ukshepagashtihu � English translation: and w hen I m et you ( m asculine) 12

CoLi Saarland University Morphology Analysis Tools : PC-Kim m o � Tw o-level Processor for Morphological Analysis � The program is designed to generate and parse w ords using tw o-level m odel of w ord structure, represented as a correspondence betw een: 1. Its lexical level form and 2. Its surface level form. CoLi Saarland University PC-KI MMO FI LES ( provided by the user) 1 . A rules file � specifies the alphabet and the phonological (spelling) rules 2 . A lexicon file � lists lexical items (words and morphemes) and their glosses, and encodes morphotactic constraints 13

CoLi Saarland University Main Com ponents od PC-Kim m o Rules Rules Lexicon Lexicon Recognizer Recognizer Surface Form Surface Form Lexical Form Lexical Form Add Your Text here Generator Generator Lexical Form Lexical Form Surface Form Surface Form CoLi Saarland University Exam ple : � W ord form : dying � Lexical Representation : d i e + i n g � Surface Representation : d y 0 0 i n g + indicates morpheme boundary 0 indicates a null element 14

CoLi Saarland University Exam ple ( cont) � Rules m ust be w ritten to account the correspondences: d:d , i:y, e:0, +:0, i:i, n:n and g:g � The phonological rules som ehow looks like this: i:y => ___ e:0 +:0 i � And w ill be translated into finite state table like these: i e + i @ y 0 0 I @ 1: 2 1 1 1 1 2: 0 3 0 0 0 3: 0 0 4 0 0 4: 0 0 0 1 0 CoLi Saarland University Tw o Level Rules Notation � Made up of three parts: � Correspondence � The rule Operator � The environment or context � Exam ple: � Lexical Representation (LR) : t a t i � Surface Representation (SR) : t a c i 15

CoLi Saarland University 1 . Correspondence � Correspondence pair � Lexical-character : surface-character � There must be an exact 1 to 1 correspondence between LR and SR � From the exam ple: � LR : t a t i and SR : t a c i , we have: •t:t, a:a, i:i � default correspondence •t:c � special correspondence CoLi Saarland University 2 . Rule Operator � Four types of Operator: => the correspondence only (but not always) occurs in the environment <= the correspondence always (but not only) occurs in the environment <=> the correspondence always and only occurs in the environment /<= the correspondence never occurs in the environment 16

Shallow NLP Three Early Stages: Pre-processing, Tokenization & - PDF document

CoLi USb Resources for com putational linguists Sem inar Shallow NLP Three Early Stages: Pre-processing, Tokenization & Morphological Analysis by Achmad Yani CoLi Saarland University Contents 1 STAGES IN NLP SYSTEMS PRE-PROCESSING 2 3

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

GEOTHERMAL SYSTEMS AND TECHNOLOGIES 5. SHALLOW GEOTHERMAL SYSTEMS 5. SHALLOW GEOTHERMAL SYSTEMS

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Question-Answering: Shallow & Deep Techniques for NLP Deep Processing Techniques for NLP

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques

SHALLOW WATER BATHYMETRY WITH AN SHALLOW WATER BATHYMETRY WITH AN INCOHERENT X- -BAND RADAR

1.25 1.25 Moz Moz HIGH HIGH - GRADE, SHALLOW GRADE, SHALLOW WA GOLD PROJECT WA GOLD PROJECT

Shallow vs. deep networks Restricted Boltzmann Machines Shallow : one hidden layer Features

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

diffi: diff improved a preview Gioele Barabucci gioele.barabucci@uni-koeln.de University of

The Essence of XML J er ome Sim eon, Bell Labs, Lucent Philip Wadler, Avaya Labs The

Mika Seppl

EVALUATING TECHNOLOGY A C G T specialisation ubiquity cooperation We shape our tools

XML and XQuery 5DV120 Database System Principles Ume a University Department of Computing

Welcome to the TEI Community 1/32 What is the TEI? an organization or an institution? a club or

Generative XPath One XPath to rule them all Oleg Parashchenko Saint-Petersburg State University,

Relational Databases for Answer a lot of XML Queries Easy/Auto Effective Efficient Querying

Shallow NLP Three Early Stages: Pre-processing, Tokenization & - PDF document

CoLi USb Resources for com putational linguists Sem inar Shallow NLP Three Early Stages: Pre-processing, Tokenization & Morphological Analysis by Achmad Yani CoLi Saarland University Contents 1 STAGES IN NLP SYSTEMS PRE-PROCESSING 2 3

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

GEOTHERMAL SYSTEMS AND TECHNOLOGIES 5. SHALLOW GEOTHERMAL SYSTEMS 5. SHALLOW GEOTHERMAL SYSTEMS

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Question-Answering: Shallow &amp; Deep Techniques for NLP Deep Processing Techniques for NLP

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Question-Answering: Shallow &amp; Deep Techniques for NLP Ling571 Deep Processing Techniques

SHALLOW WATER BATHYMETRY WITH AN SHALLOW WATER BATHYMETRY WITH AN INCOHERENT X- -BAND RADAR

1.25 1.25 Moz Moz HIGH HIGH - GRADE, SHALLOW GRADE, SHALLOW WA GOLD PROJECT WA GOLD PROJECT

Shallow vs. deep networks Restricted Boltzmann Machines Shallow : one hidden layer Features

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

diffi: diff improved a preview Gioele Barabucci gioele.barabucci@uni-koeln.de University of

The Essence of XML J er ome Sim eon, Bell Labs, Lucent Philip Wadler, Avaya Labs The

Mika Seppl

EVALUATING TECHNOLOGY A C G T specialisation ubiquity cooperation We shape our tools

XML and XQuery 5DV120 Database System Principles Ume a University Department of Computing

Welcome to the TEI Community 1/32 What is the TEI? an organization or an institution? a club or

Generative XPath One XPath to rule them all Oleg Parashchenko Saint-Petersburg State University,

Relational Databases for Answer a lot of XML Queries Easy/Auto Effective Efficient Querying

Question-Answering: Shallow & Deep Techniques for NLP Deep Processing Techniques for NLP

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques