Recycling Named Entity Taggers Unsupervised Domain and Language - PowerPoint PPT Presentation

Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity Recognition based on Parallel Corpora Master thesis of Chrysoula Zerva EPFL supervisor SONY supervisor Dr Martin Rajman Dr Wilhelm Haag

Outline ● Named Entity Recognition task ○ Definition, Process, Evaluation ● Importance ○ of NER ○ of Language (& domain) adaptation ● Core System ○ Architecture, Early results, Problems ● Evaluation ○ Final Results and Error analysis ● Conclusions

Named Entity Recognition

Named Entity: Definition Named entities : Atomic elements (in a text) that consist of one or more consecutive words and belong to predefined categories (labels) .

Named Entity: Definition Named entities : Atomic elements (in a text) that consist of one or more consecutive words and belong to predefined categories (labels) . Common labels: ORGANISATION, PERSON, LOCATION

Named Entity: Definition Named entities : Atomic elements (in a text) that consist of one or more consecutive words and belong to predefined categories (labels) . Common labels: ORGANISATION, PERSON, LOCATION The word sequence has to refer to a particular representation of the label For example: The president failed to explain the new military policy→ NO NE The president Barack Obama failed to explain the new military policy

Named Entity Recognition : LABELS Name expressions: PERSON: People, including fictional Mr Thomson explained... NORP: Nationalities or religious or political groups The swiss law prohibits... FACILITY: Buildings, airports, highways, bridges Our reporter at the White House ... ORGANIZATION: Companies, agencies, institutions EPFL is located near... GPE: Countries, cities, states, administrative areas Lausanne has a population of... LOCATION: Non-GPE locations, mountains, rivers The situation in the Balkans is... PRODUCT: Vehicles, weapons, foods (Not services) He is driving an SUV car... EVENT: Named hurricanes, battles, sports events After the second world war the ... WORK OF ART: Titles of books, songs “ Lord of the Rings ” is a three ... LAW: Named documents made into laws In the European Constitution... LANGUAGE: Any named language English is an international...

Named Entity Recognition : LABELS Time and Date expressions: DATE: Absolute or relative dates or periods Last year the results... TIME: Times smaller than a day Tomorrow at noon ... PERCENT: Percentage (including “%”) An estimated 5% of the people... MONEY: Monetary values, including unit A monthly salary of 5000$ QUANTITY: Measurements, as of weight or distance It weighs 3 pounds. ORDINAL: “first”, “second”, etc The first time that I ... CARDINAL: Numerals that do not fall under another type At least three people

Named Entity Recognition : LABELS Choosing criterion : Sufficient training resources Label distribution vs F-score performance 1.00 0.80 Ontonotes (pre-annotated) Europarl (non-annotated) 0.60 Europarl TestSet F-Score (Europarl) 0.40 Fscore (Ontonotes) 0.20 0.00 GPE DATE NORP PERCENT LOC QUANTITY EVENT PRODUCT LANGUAGE ORG PERSON CARDINAL MONEY ORDINAL TIME WORK_OF_ART FAC LAW

Named Entity Recognition Step 1: Named Entity Identification Step 2: Named Entity Classification

Named Entity Recognition Step 1: Named Entity Identification Step 2: Named Entity Classification Classify every token under the following set of Classify the tokens that are part of a NE under labels: (BIOES scheme) given set of predefined labels Label Meaning PERCENT B beginning of NE ORDINAL CARDINAL ORGANISATION I inside NE DATE PERSON O outside NE LOCATION E end of NE GPE NORP S single NE

Feature Extraction

Feature Extraction: Performed always: Before training AND Before parsing Preprocessing: ● tokenization ● Part Of Speech tagging ● Use of gazetteers, lexicons containing NEs Feature categories: ● Character-based (N-grams, Capitalised, All-Capitalised, Special Character, Numeric) ● Lexical (included in a gazetteer, lexicon, wordForm, left WordForm, right wordForm) ● Grammatical (Genitive, POS tag, left POS tag, right POS tag) ● Other (position in sentence, context (sequence of words)) ++ Combined Features : pair combination of the above

Feature extraction: Example CLASSIFICATION We deal with a horrific story in Kosovo IDENTIFICATION Capitalised Numeric Genitive N-grams Right word Similarly for the rest of the features... ... ... ... We 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 1 1 O O ... ... ... are 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 O O ... ... ... dealing 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 O O ... ... ... with 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 O O ... ... ... a 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 1 O O horrific 1 0 1 ... 0 1 1 1 0 0 1 ... 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 ... 1 1 1 0 0 1 0 1 0 1 1 O O ... ... ... situation 1 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 O O ... ... ... in 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 0 O O Kosovo 1 0 0 ... 0 0 0 1 0 1 0 ... 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 ... 0 1 0 1 1 0 1 0 0 0 1 I GPE . 0 1 1 ... 0 1 0 1 1 1 1 ... 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 ... 0 0 1 1 0 0 0 1 1 1 0 O O

Evaluation: Metrics and Methods

Evaluation: Exact and Partial Matches Original Output1 Output2 Output3 Output 4 Output5 Output6 Output7 The B-L1 B-L1 O O O O B-L2 O European I-L1 I-L1 B-L1 O I-L1 I-L2 I-L2 O Parliament E-L1 E-L1 E-L1 I-L1 I-L1 I-L1 E-L2 O Evaluation metrics: Precision, Recall, F-score

Evaluation: Exact and Partial Matches Exact matches: Partial matches: Correctly identified NE: Assuming a NE (word Correctly identified NE: Assuming a NE (word sequence) that is labelled as L1, all tokens in the sequence) that is labelled as L1, at least one NE are attributed identical labelling to the original token in the NE is also labelled as L1

Evaluation: Example Tokens Original Attributed Tokens Original Attributed I O O to O O will O O talk O O leave O O to O O for O O Minister O O exact + partial Stockholm I-GPE I-GPE Ringholm I-PERSON O on O O , O O Monday B-DATE I-DATE to O O , I-DATE O members O O partial 6 I-DATE B-DATE of O O March E-DATE E-DATE the B-ORG O , O O Swedish I-ORG I-NORP in O O parliament E-ORG O order O O . O O

Importance of NER ...and of Recycling it...

Why is efficient NER important? Applications of NER in NLP Generally NER is an important first step in extracting meaningful information from text ● Provide keywords for indexing documents ○ news recommenders : document clustering, user profiles ○ document classification/retrieval ○ search engines ○ Automated keyword extraction ● Entities (especially proper names) point to objects about which we need to define relations, roles, events ○ question answering: refers to “grounding” named entities to a model, defining their scope and role ○ semantic parsing ○ coreference resolution

Why Recycle? Need for multilingual NLP applications multilingual NE recognition sufficient resources and tools for English BUT resources are fewer and expensive… manual annotation requires time and manpower not very flexible method to acquire a new corpus for every adaptation need

Why Recycle? Adaptation to other domains is also important: New domains require NER (biology, medicine, scientific texts) Even top scorers in evaluation campaigns fail to perform well on different test sets ( drop of 10%-30% ) [1],[2]

What to Recycle? Available : One NE tagger trained for ● English ● News Articles Ontonotes corpus English news Broadcasts Conll 2012 labels F-score performance : 74% - 79% Exact matches

What to Recycle? Available : One NE tagger trained for ● English ● News Articles Used for: ● news recommender (main application) ● conference management tool ● coreference resolution ● sentiment analysis

Recycling Scheme: Core System Architecture

Recycling Scheme: Core System Architecture SC TC Source Target SYSTEM Language Language transfer NE tagger NE tagger

Recycling Scheme: Core System Architecture SC TC Source Corpus Target Corpus Existing Existing NE tagger NE tagger European Parliament Proceedings (EuroParl) English - French English - Greek

Recycling Scheme: Core System Architecture Phase 2 Phase 1 Phase 3 Manually Source Target Annotated Source Language Language Target Transfer Source Train Language Parse Train Parallel Parallel Language Language NEs NE Tagger corpus corpus NE Tagger corpus

Early Results

Early Results: Exact Match Partial Match Precision Recall F-score Precision Recall F-score English Europarl 69.06 67.3 68.17 87.5 73.3 80.01 French EuroParl 63.23 53.41 57.91 74.88 74.05 74.46 Greek EuroParl 50.77 45.18 47.81 68.34 75.76 71.86 English Ontonotes 80.24 78.81 79.52 83.2 96.16 89.21 Need also to adapt to other domains...

Recycling Named Entity Taggers Unsupervised Domain and Language - PowerPoint PPT Presentation

Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity Recognition based on Parallel Corpora Master thesis of Chrysoula Zerva EPFL supervisor SONY supervisor Dr Martin Rajman Dr Wilhelm Haag Outline

Recycling 2 Recycling 2 Recycling 2 Recycling 2 Recycling 2 Recycling 2 Recycling 2

Named Entity Recognition Using BERT and ELMo Group 8 : Mikaela Guerrero Vikash Kumar Nitya

United Electronic Recycling Agenda United Electronic Recycling Introduction Customer

Named Entity WordNet *Istituto di Linguistica Computazionale (Pisa, Italy) ^University of

Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob

Evolve Recycling Industry Leading Inkjet, Toner and Small Electronics Recycling Program for

Multi-Task Transfer Learning for Fine-Grained Named Entity Recognition Masato Hagiwara 1 , Ryuji

Information Extraction Extracting limited forms of information from text Named entity

AIDA-light: High-Throughput Named-Entity Disambiguation Ba Dat Nguyen Johannes Hoffart Martin

VI.3 Named Entity Reconciliation Problem: Same entity appears in Different spellings

Agenda 1. Overview 2. Organics Recycling in the US Georgia Recycling Coalition 3. Organics

TOXIC RECYCLING How to Avoid Poisoning the Recycling Chain? The BIR World Recycling Convent ion

NYCHA-wide Residential Recycling Program May 2015 Recycling at NYCHA Agenda Introductions (5

Recycling Challenge Recycling Challenge Background The average Canadian recycles 112 kg of

Household Recycling How housing encourage recycling Billy Wells Acting Head of Service

Tennessee Recycling Symposium Providing total recycling and waste solutions to meet your

Principle of least decoherence in semiclassical gravity (1986-2017-?) Lajos Di osi Wigner

Advanced Structural Concrete Prof. Dr. W. Kaufmann Dr. J. Mata Falcn Autumn Semester 2020

WCET Analysis of ARM Processors using Real-Time Model Checking Andreas Engelbredt Dalsgaard Mads

QUANTLIB USER MEETING (QLUM) 2017 NOVEMBER 30 th 2017 Venue: IKB Deutsche InDustrIeBanK aG |

RIPE Internet Measurements Presentation of RIPEs tools, methodologies and datasets Vesna

5/2/2013 Mogens Flensted-Jensen - Department of Mathematical Sciences Mogens Flensted-Jensen -

Gravity duals of unquenched quark-gluon plasmas Francesco Bigazzi (Universit Libre de

Probes in holographic plasmas with unquenched quarks Liuba Mazzanti (University of Santiago de