EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 2: - PowerPoint PPT Presentation

Language Technology EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 2: Corpus Processing Tools Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ August 28 and 31, 2017 Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 1/54

Language Technology Chapter 2: Corpus Processing Tools Corpora A corpus is a collection of texts (written or spoken) or speech Corpora are balanced from different sources: news, novels, etc. English French German Most frequent words in a collection the de der of le (article) die of contemporary running texts to la (article) und in et in and les des Most frequent words in Genesis and et und the de die of la der his à da he il er Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 2/54

Language Technology Chapter 2: Corpus Processing Tools Characteristics of Current Corpora Big: The Bank of English (Collins and U Birmingham) has more than 500 million words Available in many languages Easy to collect: The web is the largest corpus ever built and within the reach of a mouse click Parallel: same text in two languages: English/French (Canadian Hansards), European parliament (23 languages) Annotated with part-of-speech or manually parsed (treebanks): Characteristics/ N of/ PREP Current/ ADJ Corpora/ N ( NP ( NP Characteristics) ( PP of ( NP Current Corpora))) Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 3/54

Language Technology Chapter 2: Corpus Processing Tools Lexicography Writing dictionaries Dictionaries for language learners should be build on real usage They’re just trying to score brownie points with politicians The boss is pleased – that’s another brownie point Bank of English: brownie point (6 occs) brownie points (76 occs) Extensive use of corpora to: Find concordances and cite real examples Extract collocations and describe frequent pairs of words Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 4/54

Language Technology Chapter 2: Corpus Processing Tools Concordances A word and its context: Language Concordances English s beginning of miracles did Je n they saw the miracles which n can do these miracles that t ain the second miracle that Je e they saw his miracles which French le premier des miracles que fi i dirent: Quel miracle nous mo om, voyant les miracles qu’il peut faire ces miracles que tu s ne voyez des miracles et des Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 5/54

Language Technology Chapter 2: Corpus Processing Tools Collocations Word preferences: Words that occur together English French German You say Strong tea Thé fort Schmales Gesicht Powerful computer Ordinateur puissant Enge Kleidung You don’t Strong computer Thé puissant Schmale Kleidung say Powerful tea Ordinateur fort Enges Gesicht Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 6/54

Language Technology Chapter 2: Corpus Processing Tools Word Preferences Strong w Powerful w strong w powerful w w strong w powerful w w 161 0 showing 1 32 than 175 2 support 1 32 figure 106 0 defense 3 31 minority ... Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 7/54

Language Technology Chapter 2: Corpus Processing Tools Corpora as Knowledge Sources Short term: Describe usage more accurately Learn statistical/machine learning models for speech recognition, taggers, parsers Assess tools: part-of-speech taggers, parsers. Derive automatically patterns from annotated or unannotated corpora Longer term: Semantic processing Information and knowledge extraction from text Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 8/54

Language Technology Chapter 2: Corpus Processing Tools Finite-State Automata A flexible to tool to search and process text A FSA accepts and generates strings, here ac , abc , abbc , abbbc , abbbbbbbbbbbbc , etc. b a c q 0 q 1 q 2 Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 9/54

Language Technology Chapter 2: Corpus Processing Tools FSA Mathematically defined by Q a finite number of states; Σ a finite set of symbols or characters: the input alphabet; q 0 a start state, F a set of final states F ⊆ Q δ a transition function Q × Σ → Q where δ ( q , i ) returns the state where the automaton moves when it is in state q and consumes the input symbol i . Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 10/54

Language Technology Chapter 2: Corpus Processing Tools FSA in Prolog % The start state % The final states start(q0). final(q2). transition(q0, a, q1). transition(q1, b, q1). transition(q1, c, q2). accept(Symbols) :- start(StartState), accept(Symbols, StartState). accept([], State) :- final(State). accept([Symbol | Symbols], State) :- transition(State, Symbol, NextState), accept(Symbols, NextState). Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 11/54

Language Technology Chapter 2: Corpus Processing Tools FSA with OpenFst OpenFst ( http://openfst.org ) is a comprehensive library to build and process transducers. OpenFst represents automata in a tabular format The first transition is represented by the line: 0 1 a and the whole automaton by ( fsa1.fst ): 0 1 a 1 1 b 1 2 c 2 Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 12/54

Language Technology Chapter 2: Corpus Processing Tools FSA with OpenFst (II) OpenFst only accepts numbers and we need to provide it with a conversion table, where we encode the symbols as integers ( symbols.txt ): <epsilon> 0 a 1 b 2 c 3 OpenFst compiles the text files into a binary format ( fsa1.bin ): $ fstcompile --isymbols=symbols.txt --osymbols=symbols.txt \ --acceptor fsa1.fst fsa1.bin Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 13/54

Language Technology Chapter 2: Corpus Processing Tools FSA with OpenFst (III) Inputs, abbc or abbcb , are entered as linear chain automata: The sequence abbc in file The sequence abbcb in input1.fst input2.fst 0 1 a 0 1 a 1 2 b 1 2 b 2 3 b 2 3 b 3 4 c 3 4 c 4 4 5 b 5 $ fstcompose input1.bin fsa1.bin | fstprint --acceptor \ --isymbols=symbols.txt 0 1 a 1 2 b 2 3 b 3 4 c 4 Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 14/54

Language Technology Chapter 2: Corpus Processing Tools Regular Expressions Regexes are equivalent to FSA and generally easier to use Constant regular expressions: Pattern String regular A section on regular expressions The book of the life the The automaton above is described by the regex ab*c grep ’ab*c’ myFile1 myFile2 While grep was the first regex tool, most programming languages adopt the Perl syntax Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 15/54

Language Technology Chapter 2: Corpus Processing Tools regex101.com regex101.com : A site to experiment and test regular expressions. Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 16/54

Language Technology Chapter 2: Corpus Processing Tools Metacharacters Chars Descriptions Examples Matches any number of occur- ac*e matches strings ae , ace , * rences of the previous character acce , accce , etc. as in “The – zero or more aerial acceleration alerted the ace pilot” ? Matches at most one occur- ac?e matches ae and ace as in rence of the previous character “The aerial acceleration alerted – zero or one the ace pilot” + Matches one or more occur- ac+e matches ace , acce , rences of the previous character accce , etc. as in as in “The aerial acceleration alerted the ace pilot” Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 17/54

Language Technology Chapter 2: Corpus Processing Tools Metacharacters Chars Descriptions Examples Matches exactly n occurrences ac{2}e matches acce as in {n} of the previous character “The aerial acceleration alerted the ace pilot” Matches n or more occurrences ac{2,}e matches acce , accce , {n,} of the previous character etc. Matches from n to m occur- matches acce , {n,m} ac{2,4}e rences of the previous character accce , and acccce . Literal values of metacharacters must be quoted using \ Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 18/54

Language Technology Chapter 2: Corpus Processing Tools The Dot Metacharacter The dot . is a metacharacter that matches one occurrence of any character except a new line a.e matches the strings ale and ace in: The aerial acceleration alerted the ace pilot as well as age , ape , are , ate , awe , axe , or aae , aAe , abe , aBe , a1e , etc. .* matches any string of characters until we encounter a new line. Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/ August 28 and 31, 2017 19/54

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 2: - PowerPoint PPT Presentation

Language Technology EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 2: Corpus Processing Tools Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ August 28 and 31, 2017 Pierre Nugues EDAN20

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 17: Dialogue Pierre Nugues Lund

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 5: Counting Words Pierre Nugues

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 18: Speech Synthesis Pierre Nugues

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 13: Dependency Parsing Pierre

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 3: Encoding and Annotation Schemes

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 1: An Overview of Language

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 6: Words, Parts of Speech, and

Chapter 16: Discourse Pierre Nugues Lund University Pierre.Nugues@cs.lth.se

Language Processing with Perl and Prolog Chapter 17: Dialogue Pierre Nugues Lund University

Geo eographic D ic Disparit itie ies i in H Hea ealt lth a and H Hea ealt lth C Care

He a lth Ho me s Adults with Chro nic He a lth Co nditio ns CE NT RAL NE W YORK HE AL T

First half-year report 2005-2007 I mma cu la te C o nceptio n H ea lth C entre - Ka leo The

HE HEA LTH I H IN A GI GING G A GE GE FRIENDLY LY HE HEA LTH H SYST STEM S - ?

E nviro nme nta l He a lth F e e Sc he dule Cha ng e s WORK SHOP PRE SE NT AT I ON 2019

F e de ra l Ro le in Rura l He a lth: Ce nte rs fo r Me dic a re & Me dic a id Se rvic e s

L A Co mmunity He a lth Pro je c t I mpro ve the he a lth a nd we ll b e ing o f pe o ple a

Towards Detecting the HI 21cm signal from z=1.32 using the GMRT Abhik Ghosh I.I.T Kharagpur

2012/13 Financial Position and Recovery Plan Month 7 YTD Performance November 2012 0 Index 1.

Intertemporal topic correlations in online media A comparative study on weblogs and news websites

Spelling Correction and the Noisy Channel The$Spelling$ Correc/on$Task$ Dan$Jurafsky$

1Q Axis REIT Managers Berhad Results Presentation 2017 25 April 2017 Our Milestones Assets

4Q Axis REIT Managers Berhad Results Presentation 2016 19 January 2017 1 Our Milestones

Reducing the Risk of Plagiarism Applying the research to our courses for the greatest impact

We lc ome ! Re c o rding , po ll re sults, no te s, a nd Q&A de b rie f will b e se nt to

Sambuz

Useful Links

Newsletter

Mail Us

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 2: - PowerPoint PPT Presentation

Language Technology EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 2: Corpus Processing Tools Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ August 28 and 31, 2017 Pierre Nugues EDAN20

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 17: Dialogue Pierre Nugues Lund

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 5: Counting Words Pierre Nugues

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 18: Speech Synthesis Pierre Nugues

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 13: Dependency Parsing Pierre

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 3: Encoding and Annotation Schemes

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 1: An Overview of Language

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 6: Words, Parts of Speech, and

Chapter 16: Discourse Pierre Nugues Lund University Pierre.Nugues@cs.lth.se

Language Processing with Perl and Prolog Chapter 17: Dialogue Pierre Nugues Lund University

Geo eographic D ic Disparit itie ies i in H Hea ealt lth a and H Hea ealt lth C Care

He a lth Ho me s Adults with Chro nic He a lth Co nditio ns CE NT RAL NE W YORK HE AL T

First half-year report 2005-2007 I mma cu la te C o nceptio n H ea lth C entre - Ka leo The

HE HEA LTH I H IN A GI GING G A GE GE FRIENDLY LY HE HEA LTH H SYST STEM S - ?

E nviro nme nta l He a lth F e e Sc he dule Cha ng e s WORK SHOP PRE SE NT AT I ON 2019

F e de ra l Ro le in Rura l He a lth: Ce nte rs fo r Me dic a re &amp; Me dic a id Se rvic e s

L A Co mmunity He a lth Pro je c t I mpro ve the he a lth a nd we ll b e ing o f pe o ple a

Towards Detecting the HI 21cm signal from z=1.32 using the GMRT Abhik Ghosh I.I.T Kharagpur

2012/13 Financial Position and Recovery Plan Month 7 YTD Performance November 2012 0 Index 1.

Intertemporal topic correlations in online media A comparative study on weblogs and news websites

Spelling Correction and the Noisy Channel The$Spelling$ Correc/on$Task$ Dan$Jurafsky$

1Q Axis REIT Managers Berhad Results Presentation 2017 25 April 2017 Our Milestones Assets

4Q Axis REIT Managers Berhad Results Presentation 2016 19 January 2017 1 Our Milestones

Reducing the Risk of Plagiarism Applying the research to our courses for the greatest impact

We lc ome ! Re c o rding , po ll re sults, no te s, a nd Q&amp;A de b rie f will b e se nt to

Sambuz

Useful Links

Newsletter

Mail Us

F e de ra l Ro le in Rura l He a lth: Ce nte rs fo r Me dic a re & Me dic a id Se rvic e s

We lc ome ! Re c o rding , po ll re sults, no te s, a nd Q&A de b rie f will b e se nt to