Domain-Specific Corpora Many Document Features Grammatical Text - PowerPoint PPT Presentation

Domain-Specific Corpora

Many Document Features Grammatical Text Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial sentences paragraphs Intelligence from Carnegie Mellon University, where plus some without he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in formatting & formatting computer science are from Stanford University. His links work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, Tables Charts rich formatting & links 2

Pattern Complexity Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama… The CALD main office can be The big Wyoming sky… reached at 412-268-1299 Complex Ambiguous, needing context Unusual language models U.S. postal addresses Person names “YOU don't wanna miss out on ME :) University of Arkansas …was among the six houses sold Perfect lil booty Green eyes Long curly P.O. Box 140 by Hope Feldman that year. Hope, AR 71802 black hair Im a Irish, Armenian and Pawel Opalinski, Software Filipino mixed princess :) ❤ Kim ❤ Headquarters: Engineer at WhizBang Labs. 7 ○ 7~7two7~7four77 ❤ HH 80 roses ❤ 1128 Main Street, 4th Floor Hour 120 roses ❤ 15 mins 60 roses” Cincinnati, Ohio 45210 Courtesy of Andrew McCallum 3

small amount of relevant content irrelevant content very similar to relevant content 4

Spreadsheets Created For Human Consumption 5

Databases with PDF Code Books PDF 6

Data In Web Tables 7

Practical Considerations How good (precision/recall) is necessary? High precision when showing KG nodes to users High recall when used for ranking results How long does it take to construct? Minutes, hours, days, months What expertise do I need? None (domain expertise), patience (annotation), scripting, machine learning guru What tools can I use? Many … 8

Information Extraction Process Segmentation Data Extraction 9

Information Extraction Process Segmentation Data Extraction 1 0

Information Extraction Process Segmentation Data Extraction Name: Legacy Ventures Intl, Inc. Stock: LGYV Date: 2017-07-14 Market Cap: 391,030 1 1

Segmentation

Segmentation Homogeneous blocks 13

Segmentation Block Type Tool Repeating Web wrappers blocks (short tail) Tables Data table extractors (long tail) Main content https://code.google.com/archive/p/arc90labs-readability/ (long tail) https://github.com/kohlschutter/boilerpipe Microdata https://github.com/namsral/microdata (long tail) 14

Web Wrappers

myDIG Demo Focusing On Inferlink Web Wrapper

Table Extraction

Classification Of Web Tables Table type % total count “Tiny” tables 88.06 12.34B HTML forms 1.34 187.37M Calendars 0.04 5.50M Filtered Non- 89.44 12.53B relational, total Other non-rel (est.) 9.46 1.33B Relational (est.) 1.10 154.15M Cafarella’08

Tables In The Human Trafficking Domain number of rows number of columns

Data Tables Relational

Data Tables Matrix Table List Table Entity Table

Table Type Classification Feature-based supervised classification Cafarella’08 Crestan’11 Eberius’15 Deep Learning Nishida’2017

Identifying Data Tables Heuristic HTML tables that don’t contain nested tables and contain at least 2 rows and 2 columns

Extracting Data From Tables Co-embedding table structure and content words

Data Extraction

Data Extraction Techniques Glossary Regular expressions Natural language rules Named entity recognition Sequence labeling (Conditional Random Fields) 26

Glossary Extraction

Glossary Extraction Simple list of words or phrases to extract Challenges Ambiguity: Charlotte is a name of a person and a city Colloquial expressions: “Asia Broadband, Inc.” vs “Asia Broadband” Research Improving precision of glossary extractions using context Creating/extending glossaries automatically 28

Regex Extraction

Extraction Using Regular Expressions Too difficult for non-programmers regex for North American phone numbers: ^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:$\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02- 9])\s*$|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02- 9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0- 9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$ Brittle and difficult to adapt to specific domains unusual nomenclature and short-hands obfuscation 30

NLP Rule-Based Extraction

https://spacy.io/docs/usage/rule-based-matching Kejriwal, Szekely 32

NLP Rule-Based Extraction Pattern Tokenization Matching 33

Tokenization matters, a lot My name is Pedro My name is Pedro 310-822-1511 310-822-1511 310 - 822 - 1511 Candy is here Candy is here Candy is here 34

Token Properties Surface properties Literal, type, shape, capitalization, length, prefix, suffix, minimum, maximum Language properties Part of speech tag, lemma, dependency 35

Token Types

Patterns Pattern := Token-Spec Optional [Token-Spec] One or more Token-Spec + Token-Spec Pattern 37

Positive/Negative Patterns General Positive Generate candidates Specific Negative Remove candidates Output overlaps positive candidates 38

DIG Demo Kejriwal, Szekely 39

NLP Rule-Based Extraction Advantages Easy to define High precision Recall increases with number of rules Disadvantages Text must follow strict patterns 40

Named-Entity Recognizers

Named Entity Recognizers Machine learning models people, places, organizations and a few others SpaCy complete NLP toolkit, Python (Cython), MIT license code: https://github.com/explosion/spaCy demo: http://textanalysisonline.com/spacy-named-entity-recognition-ner Stanford NER part of Stanford’s NLP software library, Java, GNU license code: https://nlp.stanford.edu/software/CRF-NER.shtml demo: http://nlp.stanford.edu:8080/ner/process Kejriwal, Szekely 42

https://spacy.io/docs/usage/entity-recognition Kejriwal, Szekely 43

https://demos.explosion.ai/displacy-ent Kejriwal, Szekely 44

Named Entity Recognizers Advantages Easy to use Tolerant of some noise Easy to train Disadvantages Performance degrades rapidly for new genres, language models Requires hundreds to thousands of training examples 45

Conditional Random Fields

Conditional Random Fields (CRF) Good for fields that have regular text structure/context 47

Modeling Problems With CRF X1 X2 X3 Y i (word) (capitalized) (POS Tag) (entity) 1 My 1 Possessive Pron Other 2 name 0 Noun Other 3 is 0 Verb Other 4 Pedro 1 Proper Noun Person-Name 5 Szekely 1 Proper Noun Person-Name Other common features: lemma, prefix, suffix, length 48

CRF Advantages/Disadvantages Advantages Expressive Tolerant of noise Stood test of time Software packages available Disadvantages Requires feature engineering Requires thousands of training examples 49

Open Information Extraction

http://openie.allenai.org/ Kejriwal, Szekely 51

Practical IE Technologies Semi- Glossary Regex NLP Rules CRF NER Table Structured assemble O(1000) O(10) hours hours minutes zero Effort glossary annotations annotations high, minimal low minimal low-medium zero minimal Expertise programmer medium medium- medium- high high high high Precision (ambiguity) high high medium low medium high medium medium high Recall (formatting) f(# regex) f(# rules) single wide wide wide genre genre narrow Coverage site Kejriwal, Szekely 52

how to represent KGs? 53

KG Definition a directed, labeled multi-relational graph representing facts/assertions as triples (h, r, t) head entity, relation, tail entity (s, p, o) subject, predicate, object

Simplest Knowledge Graph Entities mentions LGYV mentions Legacy Ventures International Inc m e Damn Good n t i o n Penny Stocks s Easiest to build

Simple, But Useful KG Entities + properties stock-ticker LGYV company Legacy Ventures International Inc p r o m Damn Good o t e r Penny Stocks “Easy” to build 56

Semantic Web KG (RDF/OWL) Entities + properties + classes LGYV stock-ticker Company is-a Legacy Ventures is-a International Inc promoter Damn Good Penny Stocks Very hard to build Kejriwal, Szekely

“Ideal” KG Entities + properties + classes + qualifiers LGYV stock-ticker Company is-a Legacy Ventures is-a International Inc promoter Damn Good source Penny Stocks stockreads.com June start-date 2017 Very very hard to build

Semi-Structured KG Entities + properties + text + provenance + confidence image-id-123 0.92 isi-extractor source extraction (150,230)x(560,720) segment con fj dence method origin 0.72 media type image 0.14 reliability ambiguity 2 june 2014 date # sources e c n a n e v o 2 r p quali fj ers con fj dence o n c t i d u r r e r o e r e S n i z h n 0.81 location “Not so hard” to build event 123

Domain-Specific Corpora Many Document Features Grammatical Text - PowerPoint PPT Presentation

Domain-Specific Corpora Many Document Features Grammatical Text Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial sentences paragraphs Intelligence from Carnegie Mellon University, where plus some

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

Customizable Domain- Customizable Domain -Specific Computing Specific Computing Jason Cong

Domain Specific Languages Domain Specific Languages in Erlang Dennis Byrne

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

DSL Engineering with Sven Efftinge - itemis.com DOMAIN-SPECIFIC LANGUAGE A Domain Specific

Compiling topic-specific corpora from limited-access online databases Costas Gabrielatos

East Slavic parallel corpora: diachronic and diatopic variaton in Belarusian, Ukrainian, and

Data and Analysis Note 8 Introduction to Corpora Alex Simpson Note 8 Introduction to corpora

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data & Analysis,

Roadmap On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar

Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger

Towards Continuous Qvality Control for Spoken Language Corpora Anne Ferger and Hanna Hedeland

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora Francesca

(Domain-Specific) Modelling Language Engineering Hans Vangheluwe 5 September 2010, Lisboa,

Changes to the 20192020 Certificate Of Eligibility FL ID&R Office June, 2018

Third-Party DKIM Policy IETF 70 Transparent and Flexible Policy Compliance Douglas Otis

Computer Network Laboratory Assignment given on: 08-08-2011 Submission deadline: 23-08-2011 (2:00

Breaking the Myths of Extended Validation SSL Certificates Alexander Sotirov phmsecurity.com

New LEA Data Managers Training Office of the Chief Information Officer 2020-21 School Year

HUD Housing Counseling Program FHA Connection Application Process to Become a HUD Certified

Passive NFS Tracing of Email and Research Workloads Daniel Ellard, Jonathan Ledlie, Pia Malkani,

Web Development Web Hosting and Domain Names CSCI-GA 1122 Web Development Web Hosting and

Domain-Specific Corpora Many Document Features Grammatical Text - PowerPoint PPT Presentation

Domain-Specific Corpora Many Document Features Grammatical Text Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial sentences paragraphs Intelligence from Carnegie Mellon University, where plus some

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

Customizable Domain- Customizable Domain -Specific Computing Specific Computing Jason Cong

Domain Specific Languages Domain Specific Languages in Erlang Dennis Byrne

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

DSL Engineering with Sven Efftinge - itemis.com DOMAIN-SPECIFIC LANGUAGE A Domain Specific

Compiling topic-specific corpora from limited-access online databases Costas Gabrielatos

East Slavic parallel corpora: diachronic and diatopic variaton in Belarusian, Ukrainian, and

Data and Analysis Note 8 Introduction to Corpora Alex Simpson Note 8 Introduction to corpora

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data &amp; Analysis,

Roadmap On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar

Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger

Towards Continuous Qvality Control for Spoken Language Corpora Anne Ferger and Hanna Hedeland

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora Francesca

(Domain-Specific) Modelling Language Engineering Hans Vangheluwe 5 September 2010, Lisboa,

Changes to the 20192020 Certificate Of Eligibility FL ID&amp;R Office June, 2018

Third-Party DKIM Policy IETF 70 Transparent and Flexible Policy Compliance Douglas Otis

Computer Network Laboratory Assignment given on: 08-08-2011 Submission deadline: 23-08-2011 (2:00

Breaking the Myths of Extended Validation SSL Certificates Alexander Sotirov phmsecurity.com

New LEA Data Managers Training Office of the Chief Information Officer 2020-21 School Year

HUD Housing Counseling Program FHA Connection Application Process to Become a HUD Certified

Passive NFS Tracing of Email and Research Workloads Daniel Ellard, Jonathan Ledlie, Pia Malkani,

Web Development Web Hosting and Domain Names CSCI-GA 1122 Web Development Web Hosting and

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data & Analysis,

Changes to the 20192020 Certificate Of Eligibility FL ID&R Office June, 2018