Combining Labeled and Unlabeled Data in Statistical Natural Language - PowerPoint PPT Presentation

Combining Labeled and Unlabeled Data in Statistical Natural Language Parsing Simon Fraser University – April 18, 2002 Anoop Sarkar Department of Computer and Information Science University of Pennsylvania anoop@linc.cis.upenn.edu http://www.cis.upenn.edu/˜anoop 1

• Task: find the most likely parse for natural language sentences • Approach: rank alternative parses with statistical methods trained on data annotated by experts (labeled data) • Focus of this talk: 1. Motivate a particular probabilistic grammar formalism for statistical parsing: tree-adjoining grammar 2. Combine labeled data with unlabeled data to improve performance in parsing using co-training 2

Overview Introduction to Statistical Parsing • • Tree Adjoining Grammars and Statistical Parsing • Combining Labeled and Unlabeled Data in Statistical Parsing • Summary and Future Directions 3

Applications of Language Processing Algorithms • Information Extraction: converting unstructured data (text) into a structured form • Improving the word error rate in speech recognition • Human-Computer Interaction: dialog systems, machine translation, summarization, etc. • Cognitive Science: computational models of human linguistic behaviour • Biological structure prediction: formal grammars for RNA secondary structures 4

A Key Problem in Processing Language: Ambiguity (Church and Patil 1982; Collins 1999) • Part of Speech ambiguity saw → noun saw → verb • Structural ambiguity: Prepositional Phrases I saw (the man) with the telescope I saw (the man with the telescope) • Structural ambiguity: Coordination a program to promote safety in ((trucks) and (minivans)) a program to promote ((safety in trucks) and (minivans)) ((a program to promote safety in trucks) and (minivans)) 5

Ambiguity ← attachment choice in alternative parses NP NP NP VP NP VP a program to VP a program to VP promote NP promote NP NP PP NP and NP safety in NP safety PP minivans trucks and minivans in trucks 6

Parsing as a machine learning problem • S = a sentence T = a parse tree A statistical parsing model defines P ( T | S ) • Find best parse: arg max P ( T | S ) T • P ( T | S ) = P ( T , S ) P ( S ) = P ( T , S ) • Best parse: arg max P ( T , S ) T • e.g. for PCFGs: P ( T , S ) = � i = 1 ... n P (RHS i | LHS i ) 7

Parsing as a machine learning problem • Training data: the Penn WSJ Treebank (Marcus et al. 1993) • Learn probabilistic grammar from training data • Evaluate accuracy on test data • A standard evaluation: Train on 40,000 sentences Test on 2,300 sentences • The simplest technique: PCFGs perform badly Reason: not sensitive to the words 8

Machine Learning for ambiguity resolution: prepositional phrases V N1 P N2 Attachment making paper for filters N join board as director V is chairman of N.V. N using crocidolite in filters V bring attention to problem V is asbestos in products N including three with cancer N ↑ Supervised learning 9

Machine Learning for ambiguity resolution: prepositional phrases Method Accuracy Always noun attachment 59.0 Most likely for each preposition 72.2 Average Human (4 head words only) 88.2 Average Human (whole sentence) 93.2 Lexicalized Model (Collins and Brooks 1995) 84.0 Lexicalized Model + Wordnet (Stetina and Nagao 1998) 88.0 10

Statistical Parsing: the company ’s clinical trials of both its animal and human-based insulins indicated no difference in the level of hypoglycemia between users of either product S( indicated ) NP( trials ) VP( indicated ) the company ’s clinical trials . . . V( indicated ) NP( difference ) PP( in ) indicated no difference P( in ) NP( level ) in the level of . . 11

Bilexical CFG: dependencies between pairs of words • Full context-free rule: VP( indicated ) → V-hd( indicated ) NP( difference ) PP( in ) • Each rule is generated in three steps (Collins 1999) : 1. Generate head daughter of LHS: VP( indicated ) → V-hd( indicated ) 2. Generate non-terminals to left of head daughter:  . . . V-hd( indicated ) 3. Generate non-terminals to right of head daughter: – V-hd( indicated ) . . . NP( difference ) – V-hd( indicated ) . . . PP( in ) – V-hd( indicated ) . . .  12

Independence Assumptions 60.8% 0.7% VP VP VB NP VB PP NP 2.23% 0.06% VP VP VP VP PP . . . . . . VB NP PP VB NP 13

Overview • Introduction to Statistical Parsing Tree Adjoining Grammars and Statistical Parsing • • Combining Labeled and Unlabeled Data in Statistical Parsing • Summary and Future Directions 14

Lexicalization of Context-Free Grammars • CFG G : ( r 1 ) S → S S ( r 2 ) S → a • Tree-substitution Grammar G ′ : α 1 : α 2 : α 3 : S S S S S S ↓ S ↓ S a S S a a S S S S . . . . . . . . . . . . 15

Lexicalization of Context-Free Grammars α β γ X X X β X* X 16

Lexicalization of Context-Free Grammars • CFG G : ( r 1 ) S → S S ( r 2 ) S → a • Tree-adjoining Grammar G ′′ : α 1 : α 2 : α 3 : γ : S S S S S γ ′ : S S ∗ S ∗ S a S S S S a a a S S S S S S a a a a a a 17

Tree Adjoining Grammars: Different Modeling of Bilexical Dependencies NP NP NP ∗ SBAR VP the store WH ↓ S VP ∗ NP WH NP ↓ VP last week which NP bought NP IBM ǫ 18

Probabilistic TAGs: Substitution NP NP t : NP ∗ SBAR NP ∗ SBAR WH ↓ S WH ↓ S η : NP ↓ VP NP VP α : NP bought NP IBM bought NP IBM ǫ ǫ � P s ( t , η → α ) = 1 α 19

Probabilistic TAGs: Adjunction NP NP t : NP ∗ SBAR NP ∗ SBAR β : VP WH ↓ S WH ↓ S VP ∗ NP η : NP ↓ VP NP ↓ VP last week bought NP VP NP ǫ bought NP last week ǫ � P a ( t , η →  ) + P a ( t , η → β ) = 1 β 20

Tree Adjoining Grammars • Start of a derivation: � α P i ( α ) = 1 • Probability of a derivation: Pr ( D , w 0 . . . w n ) = � P s ( τ, η, w → α, w ′ ) × P i ( α, w i ) × p � � P a ( τ, η, w → β, w ′ ) × P a ( τ, η, w →  ) q r • Events for these probability models can be extracted from an expert-annotated set of derivations (e.g. Penn Treebank) 21

Performance of supervised statistical parsers ≤ 40 wds ≤ 40 wds ≤ 100 wds ≤ 100 wds System LP LR LP LR (Magerman 95) 84.9 84.6 84.3 84.0 (Collins 99) 88.5 88.7 88.1 88.3 (Charniak 97) 87.5 87.4 86.7 86.6 (Ratnaparkhi 97) 86.3 87.5 Current 86.0 85.2 (Chiang 2000) 87.7 87.7 86.9 87.0 • Labeled Precision = number of correct constituents in proposed parse number of constituents in proposed parse • Labeled Recall = number of correct constituents in proposed parse number of constituents in treebank parse 22

Theory of Probabilistic TAGs PCFGs: (Booth and Thompson 1973); (Jelinek and Lafferty 1991) • A probabilistic grammar is well-defined or consistent if: ∞ � � P ( s → a 1 a 2 . . . a n ) = 1 n = 1 a 1 a 2 ... a n ∈V • What is the single most likely parse (or derivation) for input string a 1 , . . . , a n ? • What is the probability of a 1 , . . . , a i , where a 1 , . . . , a i is a prefix of some string generated by the grammar? � w ∈ Σ ∗ P ( a 1 , . . . , a i w ) 23

Tree Adjoining Grammars • Locality and independence assumptions are captured elegantly with a simple and well-defined probability model. • Parsing can be treated in two steps: 1. Classification: structured labels (elementary trees) are assigned to each word in the sentence. 2. Attachment: the elementary trees are connected to each other to form the parse. • Produces more than just the phrase structure of each sentence. It directly gives the predicate-argument structure. 24

Overview • Introduction to Statistical Parsing • Tree Adjoining Grammars and Statistical Parsing • Combining Labeled and Unlabeled Data in Statistical Parsing • Summary and Future Directions 25

Training a Statistical Parser • How should the rule probabilities be chosen? • Alternatives: – EM algorithm: completely unsupervised (Schabes 1992) – Supervised training from a Treebank (Chiang 2000) – Weakly supervised learning: exploit new representation to combine labeled and unlabeled data 26

Co-Training • Pick two “views” of a classification problem. • Build separate models for each of these “views” and train each model on a small set of labeled data. • Sample an unlabeled data set and to find examples that each model independently labels with high confidence. • Pick confidently labeled examples and add to labeled data. Iterate. • Each model labels examples for the other in each iteration. 27

Co-training for simple classifiers (Blum and Mitchell 1998) • Task: Build a classifier that categorizes web pages into two classes, + : is a course web page , − : is not a course web page • Each labeled example has two views: 1. Text in hyperlink: <a href=" . . . "> CSE 120, Fall semester </a> 2. Text in web page: <html> . . . Assignment #1 . . . </html> • Combining labeled and unlabeled data outperforms only using labeled data 28

Pierre Vinken will join the board as a non-executive director S NP VP Pierre Vinken will VP VP PP join NP as NP the board a non-executive director 29

Combining Labeled and Unlabeled Data in Statistical Natural Language - PowerPoint PPT Presentation

Combining Labeled and Unlabeled Data in Statistical Natural Language Parsing Simon Fraser University April 18, 2002 Anoop Sarkar Department of Computer and Information Science University of Pennsylvania anoop@linc.cis.upenn.edu

Co-Training Based on Combining Labeled and Unlabeled Data with Co-Training by A. Blum

10701 Semi supervised learning Can Unlabeled Data improve supervised learning? Important

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein

10 Steps to Counting Unlabeled Planar Graphs: 20 Years Later Manuel Bodirsky October 2007

Learning from Limited Labeled Data (but a lot of unlabeled data) NELL as a case study Tom M.

Learning to Extract Entities from Labeled and Unlabeled Text Rosie Jones Language Technologies

Learning to Rank with Learning to Rank with Partially-Labeled Data Partially-Labeled Data Kevin

Learning to Rank Learning to Rank with Partially-Labeled Data with Partially-Labeled Data Kevin

Word2Vec Michael Collins, Columbia University Motivation We can easily collect very large

Clustering Clustering is an unsupervised classification method, i.e. unlabeled data is partitioned

Unlabeled Motzkin numbers Max Alekseyev Dept. Computer Science and Engineering 2013 Max

Visual Learning with Unlabeled Video and Look-Around Policies Kristen Grauman Department of

Combining Models Oliver Schulte - CMPT 726 Bishop PRML Ch. 14 Combining Models: Some Theory

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: Semi-Supervised Learning.

Natural Language Processing CSCI 4152/6509 Lecture 29 Context-Free Grammars for Natural

SUPERFLUIDTY OF ULTRACOLD ATOMIC GASES Sandro Stringari CNR-INO Universit di Trento

Monte Carlo Simulations in Statistical Physics Classical interacting many-particle systems;

Anomaly matching in QCD thermal phase transition Kazuya Yonekura Tohoku U. Based on

Model Uncertainty Quantification for Data Assimilation in partially observed Lorenz 96 Sahani

Regional Superparametrization of OpenIFS by 3D LES Gijs van den Oord (NLeSC), Fredrik Jansson

Julia Ric har APSE Soft F ac ilitie s Gr oup 01 st Nove mbe r 2016 WINNING NEW BUSINESS

Next Utterance Ranking Based On Context Response Similarity Basma El Amel Boussaha, Nicolas