Converting Fieldbooks to Databases Talk given by Carsten Ehrler for - PowerPoint PPT Presentation

Converting Fieldbooks to Databases Talk given by Carsten Ehrler for the Project Seminar “T ext Mining for Historical Documents”, Computational Linguistics Department Saarland University - 23.02.2009 Based on the publication: Sander Canisius and Caroline Sporleder. Bootstrapping information extraction from field books. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 827-836. 1

Introduction “Sander Canisius and Caroline Sporleder. Bootstrapping information extraction from field books. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 827-836.” 2

Introduction “Sander Canisius and Caroline Sporleder. Bootstrapping information Author: Canasius, Sander; Sporleder, Caroline extraction from field books. In Proceedings of the 2007 Joint Conference Title: Bootstrapping information extraction from field books on Empirical Methods in Natural Language Processing and Computational Type: Proceedings Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 827-836.” Conference: Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) Year: 2007 Location: Prague, Czech Republic Page: 827-836 2

Overview • Semi-structured documents • Field-segmentation • Field-segmentation methods • Practical examples 3

Data Sources Examples for semi-structured documents: • apartment advertisements • logs (e.g. archeological findings) • business cards • web-pages • ... 4

Example Leptophis ahaetulla, road to Overtoom, in bush above water in the process of eating Hyla minuta 16-V-1968. RMNH 15100 Hyla minuta 1 ♀ 2 ♂ Las Claritas, 9-VI-1978 quaking near water 50 cm above water surface, near secondary vegetation, 200 m, M.S. Hoogmoed, RMNH 27217 27219 Descriptions of two zoological specimen 5

Pitfalls Leptophis ahaetulla, road to Overtoom, in bush above water in the process of eating Hyla minuta 16-V-1968. RMNH 15100 Hyla minuta 1 ♀ 2 ♂ Las Claritas, 9-VI-1978 quaking near water 50 cm above water surface, near secondary vegetation, 200 m, M.S. Hoogmoed, RMNH 27217 27219 genus species gender place biotope remark date collector reg.no. 6

Pitfalls Leptophis ahaetulla, road to Overtoom, in bush above water in the process of eating Hyla minuta 16-V-1968. RMNH 15100 Hyla minuta 1 ♀ 2 ♂ Las Claritas, 9-VI-1978 quaking near water 50 cm above water surface, near secondary vegetation, 200 m, M.S. Hoogmoed, RMNH 27217 27219 • missing entries genus species • variable ordering gender • mixed delimiters place biotope • variable length remark date • encoding (e.g. date) collector reg.no. 6

Databases Goal: transform semi-structured text into database Field Entry 1 Entry 2 genus Leptophis Hyla species ahaetulla minuta gender - 1 male; 2 female road to Overtoom Las Claritas place biotope in bush above water quaking near water 50 cm remark in the process of eating - date 16/05/1968 09/06/1978 - M.S. Hoogmoed collector 15100 27217; 27219 reg.no 7

Databases Goal: transform semi-structured text into database Field Entry 1 Entry 2 genus Leptophis Hyla species ahaetulla minuta gender - 1 male; 2 female road to Overtoom Las Claritas place biotope in bush above water quaking near water 50 cm remark in the process of eating - date 16/05/1968 09/06/1978 - M.S. Hoogmoed collector 15100 27217; 27219 reg.no gain structure but implies loss of information! 7

Why use Databases? Structured text gives lots of advantages: We can formulate complex queries over database entries E.g. : All locations of a certain collector sorted by date => visualize by map Citation flow graph 8

Main Question How can we transform a semi-structured text into a database format? Task known as: Field Segmentation “Field segmentation refers to the automated finding and labeling in object or event descriptions” 9

Requirements How can we transform a semi-structured text into a database format? Requirements (for a good method): • Low error rate • Robust • Reusable • Unsupervised (or at least few training) 10

Methods • By manual inspection: expensive, error prone, often requires domain experts • Apply methods from CS: • Write a parser or rule set: not reusable, deals badly semi-structured text • Probabilistic methods: apply supervised or unsupervised machine learning techniques 11

Methods • Almost all common machine learning methods for field segmentation are supervised • e.g. using Hidden Markov Models or trained context free grammars. • Drawback: Requires effort to generate training data 12

Methods How to bootstrap a field segmentation algorithm from an existing database? => Approach by S. Canisius and C. Sporleder:

Dataset For the evaluation of the method two datasets were used: • RA dataset: field book about reptiles and amphibians; 16670 entries in DB; 19 fields • Pisces dataset: field book about fish specimen; 1375 entries in DB; 4 fields Both datasets provided by the Dutch National Museum of Natural History 14

Field Segmenter Token Main Ideas: Sequence • Use a trained language model to partition a semi-structured text into pre-segmentation Segmented Text • A Hidden Markov Model assigns the most likely label to each segment Labeled Text 15

Segmentation Model Token Assumption: Sequence Segment boundaries are due to unlikely tokens Segmented Train bigram language with entries in your database Text => Use Viterbi with the language model to obtain a segmentation Labeled Text

HMM Parameters Token For a HMM several parameters Sequence have to be derived from the data: • Initial distribution: P(X 0 =s i ) Segmented • State-transition distribution: Text P(X t =s i |X t-1 =s j ) • State-emission distribution: P(O t =o i |X t =s i ) Labeled Text 17

HMM Parameters Token For a HMM several parameters Sequence have to be derived from the data: • Initial distribution: P(X 0 =s i ) Segmented • State-transition distribution: Text Use your P(X t =s i |X t-1 =s j ) database • State-emission distribution: P(O t =o i |X t =s i ) Labeled Text 17

HMM Parameters Token For a HMM several parameters Sequence have to be derived from the data: • Initial distribution: P(X 0 =s i ) Segmented • State-transition distribution: Text Use your P(X t =s i |X t-1 =s j ) database • State-emission distribution: P(O t =o i |X t =s i ) Labeled Text For the rest: Use Baum-Welch algorithm 17

Baseline The HMM is evaluated on RA and Pisces against several baselines: • Majority: always assign • Exact: match longest substring with DB • Unigram: match most likely DB entry • Trigram: match most likely DB entry • Voted trigram: match most likely DB entry over all trigrams 18

Results RA dataset Pisces dataset 100 100 75 75 50 50 % 25 25 0 0 Token accuracy F-Score Token accuracy F-Score HMM Voted Trigram 19

Results RA dataset Pisces dataset 100 100 hard 75 75 50 50 % 25 25 0 0 Token accuracy F-Score Token accuracy F-Score HMM Voted Trigram 19

Conclusion • Bootstrapping a field segmenting method is possible • You won’t get it for free, but with very few training data • All necessary information can be derived from a preexisting database 20

That’s it... Thanks for your attention. Questions? 21

Converting Fieldbooks to Databases Talk given by Carsten Ehrler for - PowerPoint PPT Presentation

Converting Fieldbooks to Databases Talk given by Carsten Ehrler for the Project Seminar T ext Mining for Historical Documents, Computational Linguistics Department Saarland University - 23.02.2009 Based on the publication: Sander

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Module 3: Creating and Managing Databases Overview Creating Databases Creating

ZERO Digital Converting Machine Machine Overview The fastest Digital Converting Machine ZERO

4.8: Converting Regular Expressions and FA to Grammars In this section, we give simple algorithms

4.8: Converting Regular Expressions and FA to Grammars In this section, we give simple algorithms

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

Databases and PHP Accessing databases from PHP PHP & Databases l PHP can connect to

3. Text and document databases Normal databases: formatted records; document databases:

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Converting Relational to Graph Databases g n i d 2 3 J u n e 2 0 1 3 e e c o

Smart Converting Solutions Company Overview Market-leading at One of the worlds home with a

Integrated processes for converting coal to chemicals and fuels Maninder J it Singh Haldor

A Class-Based Agreement Model for Generating Accurately Inflected Translations ACL 2012 // Jeju

A Comparative Evaluation of Foreground/Background Sketch-based Mesh Segmentation Algorithms Min

Data Mining 2019 Classification Trees (1) Ad Feelders Universiteit Utrecht Ad Feelders (

Universal Dependencies Joakim Nivre, Dan Zeman, Filip Ginter, Sampo Pyysalo, Chris Manning,

Earnings Conference Call Third Quarter 2015 October 28, 2015 Cautionary Statements And Risk

Dan Halperin School of Computer Science Tel Aviv University Heraklion, January 2013 Overview

The Eurozone's awkward threesome: fiscal stance, macroeconomic stability and growth Professor

Lecture 21 : The Sample Total and Mean and The Central Limit Theorem 0/ 25 1. Statistics and

Converting Fieldbooks to Databases Talk given by Carsten Ehrler for - PowerPoint PPT Presentation

Converting Fieldbooks to Databases Talk given by Carsten Ehrler for the Project Seminar T ext Mining for Historical Documents, Computational Linguistics Department Saarland University - 23.02.2009 Based on the publication: Sander

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Module 3: Creating and Managing Databases Overview Creating Databases Creating

ZERO Digital Converting Machine Machine Overview The fastest Digital Converting Machine ZERO

4.8: Converting Regular Expressions and FA to Grammars In this section, we give simple algorithms

4.8: Converting Regular Expressions and FA to Grammars In this section, we give simple algorithms

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

Databases and PHP Accessing databases from PHP PHP &amp; Databases l PHP can connect to

3. Text and document databases Normal databases: formatted records; document databases:

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Converting Relational to Graph Databases g n i d 2 3 J u n e 2 0 1 3 e e c o

Smart Converting Solutions Company Overview Market-leading at One of the worlds home with a

Integrated processes for converting coal to chemicals and fuels Maninder J it Singh Haldor

A Class-Based Agreement Model for Generating Accurately Inflected Translations ACL 2012 // Jeju

A Comparative Evaluation of Foreground/Background Sketch-based Mesh Segmentation Algorithms Min

Data Mining 2019 Classification Trees (1) Ad Feelders Universiteit Utrecht Ad Feelders (

Universal Dependencies Joakim Nivre, Dan Zeman, Filip Ginter, Sampo Pyysalo, Chris Manning,

Earnings Conference Call Third Quarter 2015 October 28, 2015 Cautionary Statements And Risk

Dan Halperin School of Computer Science Tel Aviv University Heraklion, January 2013 Overview

The Eurozone's awkward threesome: fiscal stance, macroeconomic stability and growth Professor

Lecture 21 : The Sample Total and Mean and The Central Limit Theorem 0/ 25 1. Statistics and

Databases and PHP Accessing databases from PHP PHP & Databases l PHP can connect to