Shallow NLP Three Early Stages: Pre-processing, Tokenization & - - PDF document

shallow nlp
SMART_READER_LITE
LIVE PREVIEW

Shallow NLP Three Early Stages: Pre-processing, Tokenization & - - PDF document

CoLi USb Resources for com putational linguists Sem inar Shallow NLP Three Early Stages: Pre-processing, Tokenization & Morphological Analysis by Achmad Yani CoLi Saarland University Contents 1 STAGES IN NLP SYSTEMS PRE-PROCESSING 2 3


slide-1
SLIDE 1

1

CoLi USb

Shallow NLP

Resources for com putational linguists Sem inar Three Early Stages: Pre-processing, Tokenization & Morphological Analysis

by Achmad Yani

CoLi Saarland University

Contents

STAGES IN NLP SYSTEMS

1

PRE-PROCESSING

2

TOKENIZATION

3 3

MORPHOLOGICAL ANALYSIS

4 4

CONCLUSIONS

5

slide-2
SLIDE 2

2

CoLi Saarland University

Stages in a Com prehensive NLP System

Syntactic Analysis Morphological Analysis Morphological Analysis Tokenization Tokenization Preprocessing Preprocessing

Sem antic Analysis P&D Analysis

Text Generation KB Reasoning

Linguistic Analysis Linguistic Analysis Stages Stages Pre Pre-

  • Linguistic

Linguistic Analysis Analysis

CoLi Saarland University

1 . Preprocessing Stage Main Task of the Stage: Filter out the text from unnecessary character, such as: extra whitespace text subdivisions special character SGML-type code HOW ? Using: lex or flex on Unix-based w orkstations

slide-3
SLIDE 3

3

CoLi Saarland University

Flex program for filtering out SGML m arkings

/* Call this file StripSGML.lx, and then run: Flex -8 –CF StripSGML.lx; gcc –o StripSGML lex.yy.c –lfl –s To pass this simple filter over a text file called toto, run: StripSGML < toto %% “<“[^\n<>]+“>“; ECHO; ECHO; [\n] %%

Delete SGML markings from an input files

CoLi Saarland University

Flex program for dehyphenating a text

/* Call this file dehyphen.lx, and then run: Flex -8 –CF dehyphen.lx; gcc –o dehyphen lex.yy.c –lfl –s To pass this simple filter over a text file called toto, run: dehyphen < toto %% [a-z]-[\t]*\n[\t] * {printf( “%c“,yytext[0]);} %%

Lower-case letter, followed by a hyphen, then any number of tabs or spaces, followed by a newline character and more spaces.

slide-4
SLIDE 4

4

CoLi Saarland University

2 . Tokenization Main Task of the Stage : Isolation of word-like units from a text / Recognition of sentence boundaries The element of the text is recognized by: certain syntactic class.

  • For example: dog SINGULAR-NOUN

Sentence boundaries

CoLi Saarland University

Non-Trivial Tokenization Cases

Recognize token that contains am biguous punctuatin Numbers, Alphanumeric references

  • e.g. T-1-AB.1.2

Dates

  • e.g. 05/07/07

Acronyms

  • e.g. AT&T

Punctuations

  • !,?,.

Abbreviations

  • e.g. m.p.h
slide-5
SLIDE 5

5

CoLi Saarland University

Sentence Boundaries I dentification Approach

Maximum Entropy Approach

The system learns to classify each

  • ccurance of

punctuation as sentence boundary.

Manually Writing Approach

Primitive Way, Using Regular Expression Grammar

Tokenization

Approach

CoLi Saarland University

MANUAL APPROACH – RE for Am biguous Separators in Num bers

  • English num ber: 1 2 3 ,4 5 6 .7 8

([0-9]+[,])*[0-9]([.][0-9]+)?

  • French Num ber: 1 2 3 4 5 6 ,7 8

([0-9]+[ ])*[0-9]([,][0-9]+)

  • Fractions, Dates

[0-9]+(\/[0-9]+)+

  • Percent

([+\-])?[0-9]+ (\.)? [0-9] *%

  • Decim al Num ers ( e.g. 1 ,2 3 4 .5 6 )

([0-9]+,?)+(\.[0-9]+ | [0-9]+)*

slide-6
SLIDE 6

6

CoLi Saarland University

MANUAL APPROACH - RE for Abbreviations

Three classes of Abbreviations:

1. A single capital followed by period, e.g. A.,B., C. [A-Za-z]\. 2. A sequence of letter-period-letter-period‘s, e.g. U.S., i.e., m.p.h [A-Za-z]\.([A-Za-z0-9]\.)+ 3. A capital letter followed by a sequence of consonant followed by a period, e.g. Mr., St., Assn., [A-Z][bcdfghj-np-tvxz]+\.

CoLi Saarland University

MANUAL APPROACH - System Perform ance

  • Using Brow n Corpus

1 0 6 9 6 3 8 3 5 Totals 2 6 4 4 1 9 3 8

[A-Z][bcdfghj-np-tvxz]+\.

6 6 5 7 0

[A-Za-z]\.([A-Za-z0-9]\.)+

1 4 5 2 1 3 2 7

[A-Za-]\. Full Stop Errors Correct

Regular Expression

slide-7
SLIDE 7

7

CoLi Saarland University

MANUAL APPROACH - Problem The list of exception lists w ill never be exhaustive, alw ays need to be updated! Multiple rules m ay interact badly, since punctuation m arks does not alw ays follow the logic of the form ula e.g.

  • The president lives in Washington D.C.

Logically, it should be:

  • The president lives in Washington D.C..

CoLi Saarland University

Maxim um Entropy ( ME) Approach THE I DEA : Scanning text for sequences of character separated by w hitespace ( tokens) : ., ?, and ! potential sentence boundaries contextual information

slide-8
SLIDE 8

8

CoLi Saarland University

ME APPROACH - Term inology Candidate: token containing the symbol which marks a putative sentence boundary Prefix: the portion of the Candidate preceding the potential sentence boundary Suffix: the portion of the Candidate following the potential sentence boundary

CoLi Saarland University

ME APPROACH - Contextual Tem plates

The Prefix The Suffix Whether the Prefix or Suffix is on the list of induced abbreviations (from training data) The word left, of the Candidate The word right of the Candidate Whether the word to the left or right of the Candidate is on the list of induced abbreviations

slide-9
SLIDE 9

9

CoLi Saarland University

ME APPROACH - Exam ple 1 Sentence: ANLP Corp. chairman Dr. Smith resigned. The exact information for the potential sentence boundary marked by . in Corp. would be:

PreviousWord=ANLP, Following-Word=chairman, Prefix=Corp, Suffix=NULL, PrefixFeature=InducedAbbreviation.

CoLi Saarland University

ME APPROACH - Joint Probability

For each potential boundary token ( .,?,!) , estim ate the joint probability p and its surrounding context.

} {

yes no, b where , c) p(b,

1

∈ = ∏ =

k j (b,c) f j

j

α

αj = unknown parameter of the model, each

  • f it corresponds to fj .

The probability of seeing an actual sentence boundary in the context c is given by p(yes, c)

slide-10
SLIDE 10

10

CoLi Saarland University

ME APPROACH – Exam ple Useful Feature

⎩ ⎨ ⎧ =

= = no b & Mr (c) Prefix if 1

  • therwise

) , ( c b f j

Allow to discover that the period at the end of the word

  • Mr. seldom occurs as sentence boundary

CoLi Saarland University

ME APPROACH - Decision Rule

A potential sentence boundary is an actual sentence boundary if and only if p( yes| c) > .5 w here:

) , ( ) , ( ) , ( ) | ( c no p c yes p c yes p c yes p + =

slide-11
SLIDE 11

11

CoLi Saarland University

ME APPROACH - System Perform ance

Corpus 5 0 6 1 7 1 False Negatives 7 5 0 2 0 1 False Positives 9 7 .9 % 9 8 .8 % Accuracy 6 1 2 8 2 3 2 1 7 3 Candidate P.Marks 5 1 6 7 2 2 0 4 7 8 Sentences Brow n W SJ

CoLi Saarland University

3 . Morphological Analysis Main Task of the Stage: Analysing the m eaningful com ponents

  • f w ords

Non- trivial Case: Word division

slide-12
SLIDE 12

12

CoLi Saarland University

W ord Division English: I t‘s, he‘s, that‘s, there‘s, w ho‘s, she‘s French: L‘addition, m ‘appelle, donne-le, va-t-ill, etc Bahasa ( I ndonesian) : Pertanggungjaw aban, kem erdekaan

CoLi Saarland University

Morphology Hebrew ( transliterated) : ukshepagashtihu English translation: and w hen I m et you ( m asculine)

slide-13
SLIDE 13

13

CoLi Saarland University

Morphology Analysis Tools : PC-Kim m o

Tw o-level Processor for Morphological Analysis The program is designed to generate and parse w ords using tw o-level m odel of w ord structure, represented as a correspondence betw een:

  • 1. Its lexical level form and
  • 2. Its surface level form.

CoLi Saarland University

PC-KI MMO FI LES ( provided by the user)

1 . A rules file specifies the alphabet and the phonological (spelling) rules 2 . A lexicon file lists lexical items (words and morphemes) and their glosses, and encodes morphotactic constraints

slide-14
SLIDE 14

14

CoLi Saarland University

Main Com ponents od PC-Kim m o

Add Your Text here

Recognizer Recognizer Generator Generator

Rules Rules Surface Form Surface Form Lexicon Lexicon Lexical Form Lexical Form Surface Form Surface Form Lexical Form Lexical Form

CoLi Saarland University

Exam ple : W ord form : dying Lexical Representation : d i e + i n g Surface Representation : d y 0 0 i n g + indicates morpheme boundary 0 indicates a null element

slide-15
SLIDE 15

15

CoLi Saarland University

Exam ple ( cont)

Rules m ust be w ritten to account the correspondences: d:d , i:y, e:0, +:0, i:i, n:n and g:g

  • The phonological rules som ehow looks like this:

i:y => ___ e:0 +:0 i

  • And w ill be translated into finite state table like these:

4 3: 1 1 I i 4: 3 2: 1 1 1 2 1: @ y @ + e i

CoLi Saarland University

Tw o Level Rules Notation Made up of three parts: Correspondence The rule Operator The environment or context Exam ple: Lexical Representation (LR) : t a t i Surface Representation (SR) : t a c i

slide-16
SLIDE 16

16

CoLi Saarland University

1 . Correspondence Correspondence pair Lexical-character : surface-character There must be an exact 1 to 1 correspondence between LR and SR From the exam ple: LR : t a t i and SR : t a c i, we have:

  • t:t, a:a, i:i default correspondence
  • t:c special correspondence

CoLi Saarland University

2 . Rule Operator

Four types of Operator: => the correspondence only (but not always) occurs in the environment <= the correspondence always (but not only) occurs in the environment <=> the correspondence always and only occurs in the environment /<= the correspondence never occurs in the environment

slide-17
SLIDE 17

17

CoLi Saarland University

3 . The environm ent or context

I ndicates the position of the correspondence in the environm ent The com plete rule for the exam ple:

LR : t a t i SR : t a c i

is: t:c => __i:i or simply: t:c => __i Meaning: A lexical t corresponds to a surface c only when it precedes a lexical i that corresponds to a surface i.

CoLi Saarland University

FST Diagram

1 2

@:@ i : i t : c

slide-18
SLIDE 18

18

CoLi Saarland University

Finite State Transition Table

1 2 . 1 1 2 1 : @ i c @ i t

CoLi Saarland University

CONCLUSI ON

The stage of preparing raw text for a linguistic treatment is non-trivial. To maintain as much flexibility as possible, the tokenization process should be considered as a series of modular filters through wich text can be selectively passed. Preprocessing, Tokenization and Morphological Stage play important roles to provide robust foundation for the next steps of Linguistics Processing

slide-19
SLIDE 19

19

CoLi Saarland University

References

Grefenstette, G., Tapanainen, P. (1994), What is a Word, What is a Sentence? Problems of Tokenization Reynar, J.C., Ratnaparkhi, A. (1997). A Maximum Entropy Approach to Identifying Sentence Boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C. Antworth, Evan L. (1990). PC-KIMMO: A Two-level Processor for Morphological Analysis. Occasional Publications in Academic Computing No. 16. Dallas, TX: Summer Institute of Linguistics. Evan L. Antworth, Appendix A: Developing The Rules Component, http: / / www.sil.org/ pckimmo/ v2/ doc/ guide.html CoLi USb

Comment, Discussion ? Comment, Discussion ?