exploring multi level distributional semantics for cross
play

Exploring Multi-level Distributional Semantics for Cross-lingual - PowerPoint PPT Presentation

Exploring Multi-level Distributional Semantics for Cross-lingual Entity Discovery and Linking Boliang Zhang, Xiaoman Pan, Lifu Huang, Ying Lin, Heng Ji jih@rpi.edu Noisy Training Data Acquisition 1: Chinese Room 2 Noisy Training Data


  1. Exploring Multi-level Distributional Semantics for Cross-lingual Entity Discovery and Linking Boliang Zhang, Xiaoman Pan, Lifu Huang, Ying Lin, Heng Ji jih@rpi.edu �

  2. Noisy Training Data Acquisition 1: Chinese Room 2

  3. Noisy Training Data Acquisition 1: Chinese Room 3

  4. Noisy Training Data Acquisition 2: Wikipedia Mining § Generate “silver-standard” training data automa4cally § Apply self-training to make training data for complete and consistent 4

  5. Exploit Non-traditional Universal Linguistic Resources • Grammar books from Lori Levin’s bookshelf and CIA Names from DARPA PM’s bookshelf Unicode Common Locale Data Repository, Wiki4onary, Panlex, Mul4lingual WordNet, • GeoNames, JRC Names, phrase pairs mined � from Wikipedia Phrase Books from Language Survival Kits and � • Elicita4on Corpus Ignored by NLP community • 5

  6. Linguistic Structure from WALS database and Syntactic Structures of the World's Languages� WALS and SSWL • Universal Morphology Analyzer based on Wikipedia Markups o Kıta Fransası, güneyde [[Akdeniz]] den kuzeyde [[Manş Denizi]] ve [[Kuzey Denizi]] ne , doğuda [[Ren Nehri]] nden ba@da [[Atlas Okyanusu]] na kadar yayılan topraklarda yer alır. (ConGnental France is located in the south [[Mediterranean Sea]] in the north [[English Sea]] and [[North Sea]] in the east [[Rhine River]] to the west [[AtlanGc Ocean]].) 6

  7. Character-Aware Word Embeddings Mo4va4on: men4ons of the same concept across languages may share a set of • similar characters, e.g., SemseSn Gunaltay (English) = ŞemseSn Günaltay (Turkish) = Semse4n Ganoltey (Somali) Compose word embeddings from shared character embeddings using • Convolu4onal Neural networks Further op4mized by language model based on Recurrent Neural Networks • maximize the predic4on of the current word based on previous words § 7

  8. Feed Non-traditional Linguistic Resources into DNN B/I/O CRF networks Hidden Layer LSTMs Left Right Left Right Hidden Layer LSTMs LSTMs LSTMs LSTMs Input Word Linguistic Feature Embedding Embedding Word Embedding Linguistic Features Left Right CNN - English and Low-resource Language LSTMs LSTMs Patterns - Low-resource Language to English Lexicons - Gazetteers Character Character - Low-resource Language Grammar Rules Embedding Embedding 8

  9. Common Semantic Space Construction 9

  10. Construct a Common Semantic Space for Thousands of Languages § Mo4va4ons § There are 3000+ languages with electronic record § NLP training data only available for several dominant languages § Goals § Build a common seman4c space across thousands of languages for resource sharing and richer seman4c con4nuous representa4on for words, concepts and en44es § Limita4ons of Previous A_empts (e.g., Upadhyay et al., 2016, Cho et al., 2017) § Mostly English-anchored, cannot capture all linguis4c phenomena § Heavily relied on bilingual dic4onaries and parallel data which are not always available § Only limited to dozens of languages 10

  11. Multi-Level Multi-lingual Alignment • When bilingual word dic4onaries are not available, back-off to shared linguis4c structures e.g., apposi4on, conjunc4on, plural suffix (English (-s / - es), Turkish (- § lar / -ler), Somali (-o)) Generalized from language universal resources such as WALS database § and SyntacGc Structures of the World's Languages Classify languages according to a large number of topological proper4es § (phonological, lexical, gramma4cal) 2,676 languages, 58,000+ (language, feature, feature value) tuples, e.g., § (English, canonical word order, SVO) • Project monolingual word embeddings into a common seman4c space, and align both representa4ons of words and linguis4c- structures in the common space 11

  12. Model Training • Model training o Language model predic4on loss o Mul4lingual alignment loss: o Overall loss: 12

  13. Linguistic Features MaNer:� More Robust to Noise Uzbek (Zhang et al., 2017) 13

  14. Impact of Character-Aware Word Embeddings Name Tagging F-Score (%) § Models Chinese English Spanish Before 64.1 67.4 64.6 Aoer 68.0 70.9 68.9 14

  15. Impact of Common Semantic Space • Chechen Name Tagging Models P (%) R (%) F (%) Randomly ini4alized 46.3 45.31 45.8 Pre-trained 54.8 41.3 47.1 + Common seman4c 62.1 50.1 55.4 space word embedding 15

  16. Something Old: Hierarchical Brown Clustering Languages w/o BC (%) with BC (%) Languages w/o BC (%) with BC (%) Albanian 72.4 74.6 Northern Sotho 90.2 90.8 Chechen 53.1 55.4 Polish 49.6 53.2 Chinese 66.3 68.0 Somali 76.9 78.5 English 69.5 70.9 Spanish 67.1 68.9 Kannada 51.9 56.0 Swahili 64.3 67.8 Kikuyu 84.2 88.7 Yoruba 46.1 49.5 Nepali 41.6 43.9 16

  17. � � Joint Learning of Word and Entity Embeddings from Wikipedia Consider all Wikipedia anchor links as en4ty annota4ons, a training corpus can • be created by replacing anchor links with unique en4ty IDs. e.g., [[en/Apple|apple]] is a fruit � [[en/Apple_Inc.|apple]] is a company � apple is a fruit � en/Apple is a fruit � apple is a company � en/Apple_Inc. is a company � Mul4-lingual • 17

  18. Joint Learning of Word and Entity Embeddings from Wikipedia e 1 Entity Representation Learning Philadelphia Fireworks o n s b r a t i e l e C Independence 
 born Day (US) P ( N ( e j ) | e j ) country N ( · ) e Independence Day ( film ) inlink Observed by Will 
 category Smith e Independence Day ( US ) , , e j , e N ( · ) g n i r r a United 
 e Memorial Day t s outlink Public holidays in States e 2 the United States word embeddings N ( · ) Independence 
 O Day (film) b y s ⇤ , e 3 e r o r g v e e a t d Knowledge Space c Memorial 
 b y outlink i n Day l i n k Mention Representation Learning Knowledge Base C ( · ) P ( e j |C ( m l ) , s ⇤ j ) bands played it during public events, played it during public Mention Sense C ( · ) such as events, such as 
 Mapping , , e j , e s ∗ s ⇤ [[Independence Day Independence Day ( film ) [[ ]] Independence Day ( US ) C ( · ) g ( July 4 th, e 1 ) (US)|July 4th]] celebrations w film celebrations , s ⇤ j , w Anchor ) , s ∗ … In the 1996 action film [[Independence Day Independence Day ( US ) Text Representation Learning d 1 (film)|Independence Day]], the United States P ( C ( w i ) | w i ) · P ( C ( m l ) | s ⇤ w celebrations j ) s ⇤ military uses alien technology captured … Memorial Day C ( · ) … holds annual [[Independence Day (US)| Text Space , d 2 , d Independence Day]] celebrations and other ⇤ , w i /s ⇤ festivals … C ( · ) j … early Confederate [[Memorial Day]] , d 3 , s celebrations were simple, somber occasions for C ( · ) veterans and their families to honor the dead … Text Representation Learning 18

  19. � � � � Learning Entity Embeddings from DBpedia Construct a weighted undirected graph G = (E, D) from DBpedia, where E • is a set of all en44es in DBpedia and d ij ∈ D indicates that two en44es e i and e j share some DBpedia proper4es. The weight of d ij , w ij is computed as: � | p i \ p j | w ij = max( | p i | , | p j | ) where p i , p j are the sets of DBpedia proper4es of e i and e j respec4vely. � Apply the graph embedding framework proposed by (Tang et al., 2015) to • generate knowledge representa4ons for all en44es 19

  20. Impact of Joint Embeddings on Entity Linking • Unsupervised en4ty linking based on salience, similarity and coherence • Tested on EDL16 perfect English NAM men4ons CEAFm P CEAFm R CEAFm F1 Baseline 0.762 0.843 0.801 + Joint word and en4ty 0.791 0.875 0.831 embeddings from Wikipedia + En4ty embedding from 0.812 0.897 0.852 DBpedia 20

  21. Resources and Demos 21

  22. Systems, Data and Resources Publicly Available § Re-trainable Systems: § h_p://blender02.cs.rpi.edu:3300/elisa_ie/api § Source code base available for government users upon requests § Tri-lingual EDL is being integrated into CoreNLP and hope to release in 2017 § Data and Resources: § h_p://nlp.cs.rpi.edu/wikiann/ § Demos: § h_p://blender02.cs.rpi.edu:3300/elisa_ie § h_p://blender02.cs.rpi.edu:3300/elisa_ie/heatmap 22

  23. Demo 1: Cross-lingual Entity Discovery and Linking for 282 Languages § h_p://blender02.cs.rpi.edu:3300/elisa_ie § h_p://blender02.cs.rpi.edu:3300/elisa_ie/heatmap 23

  24. Cross-lingual Entity Discovery and Linking for 282 Languages (Cont’) 24

  25. Cross-lingual Entity Discovery and Linking for 282 Languages (Cont’) 25

  26. IE Application Example: Disaster Relief 26

  27. Cross-lingual Entity Discovery and Linking for 282 Languages (Cont’) § h_p://blender02.cs.rpi.edu:3300/elisa_ie/heatmap 27 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend