natural language processing and weak supervision
play

Natural language processing and weak supervision L eon Bottou COS - PowerPoint PPT Presentation

Natural language processing and weak supervision L eon Bottou COS 424 4/27/2010 Introduction Natural language processing from scratch Natural language processing systems are heavily engineered. How much engineering can we


  1. Natural language processing and weak supervision L´ eon Bottou COS 424 – 4/27/2010

  2. Introduction Natural language processing “from scratch” – Natural language processing systems are heavily engineered. – How much engineering can we avoid by using more data ? – Work by Ronan Collobert, Jason Weston, and the NEC team. Summary – Natural language processing – Embeddings and models – Lots of unlabeled data – Task dependent hacks COS 424 – 4/27/2010 2/43

  3. I. Natural language processing

  4. The Goal We want to have a conversation with our computer . . . still a long way before HAL 9000 . . . Convert a piece of English into a computer-friendly data structure How to measure if the computer “understands” something? COS 424 – 4/27/2010 4/43

  5. Natural Language Processing Tasks Intermediate steps to reach the goal? Part-Of-Speech Tagging (POS): syntactic roles (noun, adverb...) Chunking (CHUNK): syntactic constituents (noun phrase, verb phrase...) Name Entity Recognition (NER): person/company/location... Semantic Role Labeling (SRL): semantic role [John] ARG 0 [ate] REL [the apple] ARG 1 [in the garden] ARGM − LOC COS 424 – 4/27/2010 5/43

  6. NLP Benchmarks Datasets: ⋆ POS, CHUNK, SRL: WSJ ( ≈ up to 1M labeled words) ⋆ NER: Reuters ( ≈ 200K labeled words) System Accuracy System F1 Shen, 2007 97.33% Shen, 2005 95.23% Toutanova, 2003 97.24% Sha, 2003 94.29% Gimenez, 2004 97.16% Kudoh, 2001 93.91% (a) POS: As in (Toutanova, 2003) (b) CHUNK: CoNLL 2000 System F1 System F1 Ando, 2005 89.31% Koomen, 2005 77.92% Florian, 2003 88.76% Pradhan, 2005 77.30% Kudoh, 2001 88.31% Haghighi, 2005 77.04% (c) NER: CoNLL 2003 (d) SRL: CoNLL 2005 We chose as benchmark systems: ⋆ Well-established systems ⋆ Systems avoiding external labeled data Notes: ⋆ Ando, 2005 uses external unlabeled data ⋆ Koomen, 2005 uses 4 parse trees not provided by the challenge COS 424 – 4/27/2010 6/43

  7. Complex Systems Two extreme choices to get a complex system ⋆ Large Scale Engineering: design a lot of complex features, use a fast existing linear machine learning algorithm COS 424 – 4/27/2010 7/43

  8. Complex Systems Two extreme choices to get a complex system ⋆ Large Scale Engineering: design a lot of complex features, use a fast existing linear machine learning algorithm ⋆ Large Scale Machine Learning: use simple features, design a complex model which will implicitly learn the right features COS 424 – 4/27/2010 8/43

  9. NLP: Large Scale Engineering (1/2) Choose some good hand-crafted features Predicate and POS tag of predicate Voice: active or passive (hand-built rules) Phrase type: adverbial phrase, prepositional phrase, . . . Governing category: Parent node’s phrase type(s) Head word and POS tag of the head word Position: left or right of verb Path: traversal from predicate to constituent Predicted named entity class Word-sense disambiguation of the verb Verb clustering Length of the target constituent (number of words) NEG feature: whether the verb chunk has a ”not” Partial Path: lowest common ancestor in path Head word replacement in prepositional phrases First and last words and POS in constituents Ordinal position from predicate + constituent type Constituent tree distance Temporal cue words (hand-built rules) Dynamic class context: previous node labels Constituent relative features: phrase type Constituent relative features: head word Constituent relative features: head word POS Constituent relative features: siblings Number of pirates existing in the world. . . Feed them to a simple classifier like a SVM COS 424 – 4/27/2010 9/43

  10. NLP: Large Scale Engineering (2/2) Cascade features: e.g. extract POS, construct a parse tree Extract hand-made features from the parse tree Feed these features to a simple classifier like a SVM COS 424 – 4/27/2010 10/43

  11. NLP: Large Scale Machine Learning Goals Task-specific engineering limits NLP scope Can we find unified hidden representations? Can we build unified NLP architecture? Means Start from scratch: forget (most of) NLP knowledge Compare against classical NLP benchmarks Avoid task-specific engineering COS 424 – 4/27/2010 11/43

  12. II. Embeddings and models

  13. Multilayer Networks Stack several layers together Input Vector x Matrix-vector W x Linear layer operation 1 f( ) Non-Linearity HardTanh Matrix-vector W Linear layer operation 2 Output Vector y Increasing level of abstraction at each layer Requires simpler features than “shallow” classifiers The “weights” W i are trained by gradient descent How can we feed words? COS 424 – 4/27/2010 13/43

  14. Words into Vectors Idea Words are embedded in a vector space jesus on car sits mat smoke the cat R 50 Embeddings are trained Implementation A word w is an index in a dictionary D ∈ N Use a lookup-table ( W ∼ feature size × dictionary size) LT W ( w ) = W • w Remarks Applicable to any discrete feature (words, caps, stems...) See (Bengio et al, 2001) COS 424 – 4/27/2010 14/43

  15. Words into Vectors Idea Words are embedded in a vector space jesus on car sits mat smoke the cat R 50 Embeddings are trained Implementation A word w is an index in a dictionary D ∈ N Use a lookup-table ( W ∼ feature size × dictionary size) LT W ( w ) = W • w Remarks Applicable to any discrete feature (words, caps, stems...) See (Bengio et al, 2001) COS 424 – 4/27/2010 15/43

  16. Window Approach Input Window word of interest Text cat sat on the mat w 1 w 1 w 1 Feature 1 . . . 1 2 N . . Tags one word at the time . w K w K w K Feature K . . . 1 2 N Feed a fixed-size window of text Lookup Table around each word to tag LT W 1 . d . Works fine for most tasks . LT W K concat How do deal with long-range Linear dependencies? M 1 × · n 1 E.g. in SRL, the verb of hu HardTanh interest might be outside the window! Linear M 2 × · n 2 hu = #tags COS 424 – 4/27/2010 16/43

  17. Sentence Approach (1/2) Feed the whole sentence to the network Tag one word at the time: add extra position features Convolutions to handle variable-length inputs time W × • Produces local features with higher level of abstraction Max over time to capture most relevant features Outputs a fixed-sized feature Max vector COS 424 – 4/27/2010 17/43

  18. Sentence Approach (2/2) Input Sentence Text The cat sat on the mat w 1 w 1 w 1 Feature 1 . . . Padding Padding 1 2 N . . . w K w K w K Feature K . . . 1 2 N Lookup Table LT W 1 . d . . LT W K Convolution M 1 × · n 1 hu Max Over Time max( · ) n 1 hu Linear M 2 × · n 2 hu HardTanh Linear M 3 × · n 3 hu = #tags COS 424 – 4/27/2010 18/43

  19. Training Given a training set T Convert network outputs into probabilities Maximize a log-likelihood � θ �− → log p ( y | x , θ ) ( x , y ) ∈T Use stochastic gradient (See Bottou, 1991) Fixed learning rate. “Tricks”: − θ + λ ∂ log p ( y | x , θ ) θ ← ⋆ Divide learning by “fan-in” ∂ θ ⋆ Initialization according to “fan-in” Use chain rule (“back-propagation”) for efficient gradient computation Network f ( · ) has L layers ∂ log p ( y | x , θ ) = ∂ log p ( y | x , θ ) · ∂f i f = f L ◦ · · · ◦ f 1 ∂ θ i ∂f i ∂ θ i Parameters ∂ log p ( y | x , θ ) = ∂ log p ( y | x , θ ) · ∂f i ∂f i − 1 ∂f i ∂f i − 1 θ = ( θ L , . . . , θ 1 ) How to interpret neural networks outputs as probabilities? COS 424 – 4/27/2010 19/43

  20. Word Tag Likelihood (WTL) The network has one output f ( x , i, θ ) per tag i Interpreted as a probability with a softmax over all tags e f ( x , i, θ ) p ( i | x , θ ) = j e f ( x , j, θ ) � Define the logadd operation � e z i ) logadd z i = log( i i Log-likelihood for example ( x , y ) log p ( y | x , θ ) = f ( x , y, θ ) − logadd f ( x , j, θ ) j How to leverage the sentence structure? COS 424 – 4/27/2010 20/43

  21. Sentence Tag Likelihood (STL) (1/2) The network score for tag k at the t th word is f ( x 1 ... x T , k, t, θ ) A kl transition score to jump from tag k to tag l on mat The cat sat the T f(x , k, t) Arg0 1 Arg1 Arg2 k ∈ A ij Verb Sentence score for a tag path i 1 ... i T T � � s ( x 1 ... x T , i 1 ... i T , ˜ � θ ) = A i t − 1 i t + f ( x 1 ... x T , i t , t, θ ) t =1 Conditional likelihood by normalizing w.r.t all possible paths: log p ( y 1 ... y T | x 1 ... x T , ˜ θ ) = s ( x 1 ... x T , y 1 ... y T , ˜ s ( x 1 ... x T , j 1 ... j T , ˜ θ ) − logadd θ ) j 1 ... j T How to efficiently compute the normalization? COS 424 – 4/27/2010 21/43

  22. Sentence Tag Likelihood (STL) (1/2) The network score for tag k at the t th word is f ( x 1 ... x T , k, t, θ ) A kl transition score to jump from tag k to tag l on mat The cat sat the Arg0 Arg1 Arg2 Verb Sentence score for a tag path i 1 ... i T T � � s ( x 1 ... x T , i 1 ... i T , ˜ � θ ) = A i t − 1 i t + f ( x 1 ... x T , i t , t, θ ) t =1 Conditional likelihood by normalizing w.r.t all possible paths: log p ( y 1 ... y T | x 1 ... x T , ˜ θ ) = s ( x 1 ... x T , y 1 ... y T , ˜ s ( x 1 ... x T , j 1 ... j T , ˜ θ ) − logadd θ ) j 1 ... j T How to efficiently compute the normalization? COS 424 – 4/27/2010 22/43

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend