Exploring Multi-level Distributional Semantics for Cross-lingual - - PowerPoint PPT Presentation

exploring multi level distributional semantics for cross
SMART_READER_LITE
LIVE PREVIEW

Exploring Multi-level Distributional Semantics for Cross-lingual - - PowerPoint PPT Presentation

Exploring Multi-level Distributional Semantics for Cross-lingual Entity Discovery and Linking Boliang Zhang, Xiaoman Pan, Lifu Huang, Ying Lin, Heng Ji jih@rpi.edu Noisy Training Data Acquisition 1: Chinese Room 2 Noisy Training Data


slide-1
SLIDE 1

Exploring Multi-level Distributional Semantics for Cross-lingual Entity Discovery and Linking

Boliang Zhang, Xiaoman Pan, Lifu Huang, Ying Lin, Heng Ji jih@rpi.edu

slide-2
SLIDE 2

Noisy Training Data Acquisition 1: Chinese Room

2

slide-3
SLIDE 3

Noisy Training Data Acquisition 1: Chinese Room

3

slide-4
SLIDE 4

Noisy Training Data Acquisition 2: Wikipedia Mining

§ Generate “silver-standard”

training data automa4cally

§ Apply self-training to make

training data for complete and consistent

4

slide-5
SLIDE 5

Exploit Non-traditional Universal Linguistic Resources

  • Grammar books from Lori Levin’s bookshelf and CIA Names from DARPA PM’s bookshelf
  • Unicode Common Locale Data Repository, Wiki4onary, Panlex, Mul4lingual WordNet,

GeoNames, JRC Names, phrase pairs mined from Wikipedia

  • Phrase Books from Language Survival Kits and

Elicita4on Corpus

  • Ignored by NLP community

5

slide-6
SLIDE 6

Linguistic Structure from WALS database and Syntactic Structures of the World's Languages WALS and SSWL

  • Universal Morphology Analyzer based on Wikipedia Markups
  • Kıta Fransası, güneyde [[Akdeniz]]den kuzeyde [[Manş Denizi]]ve [[Kuzey

Denizi]]ne, doğuda [[Ren Nehri]]nden ba@da [[Atlas Okyanusu]]na kadar yayılan topraklarda yer alır. (ConGnental France is located in the south [[Mediterranean Sea]] in the north [[English Sea]] and [[North Sea]] in the east [[Rhine River]] to the west [[AtlanGc Ocean]].)

6

slide-7
SLIDE 7
  • Mo4va4on: men4ons of the same concept across languages may share a set of

similar characters, e.g., SemseSn Gunaltay (English) = ŞemseSn Günaltay (Turkish) = Semse4n Ganoltey (Somali)

  • Compose word embeddings from shared character embeddings using

Convolu4onal Neural networks

  • Further op4mized by language model based on Recurrent Neural Networks

§

maximize the predic4on of the current word based on previous words

Character-Aware Word Embeddings

7

slide-8
SLIDE 8

8

Input Word Embedding Linguistic Feature Embedding

Left LSTMs Right LSTMs Left LSTMs Character Embedding Word Embedding Right LSTMs Left LSTMs Right LSTMs

LSTMs Hidden Layer Hidden Layer CRF networks B/I/O

Linguistic Features

  • English and Low-resource Language

Patterns

  • Low-resource Language to English

Lexicons

  • Gazetteers
  • Low-resource Language Grammar Rules

Character Embedding CNN

Feed Non-traditional Linguistic Resources into DNN

slide-9
SLIDE 9

Common Semantic Space Construction

9

slide-10
SLIDE 10

Construct a Common Semantic Space for Thousands of Languages

§ Mo4va4ons § There are 3000+ languages with electronic record § NLP training data only available for several dominant languages § Goals § Build a common seman4c space across thousands of languages

for resource sharing and richer seman4c con4nuous representa4on for words, concepts and en44es

§ Limita4ons of Previous A_empts (e.g., Upadhyay et al., 2016, Cho et

al., 2017)

§ Mostly English-anchored, cannot capture all linguis4c phenomena § Heavily relied on bilingual dic4onaries and parallel data which are

not always available

§ Only limited to dozens of languages

10

slide-11
SLIDE 11
  • When bilingual word dic4onaries are not available, back-off to

shared linguis4c structures

§

e.g., apposi4on, conjunc4on, plural suffix (English (-s / -es), Turkish (- lar / -ler), Somali (-o))

§

Generalized from language universal resources such as WALS database and SyntacGc Structures of the World's Languages

§

Classify languages according to a large number of topological proper4es (phonological, lexical, gramma4cal)

§

2,676 languages, 58,000+ (language, feature, feature value) tuples, e.g., (English, canonical word order, SVO)

  • Project monolingual word embeddings into a common seman4c

space, and align both representa4ons of words and linguis4c- structures in the common space

Multi-Level Multi-lingual Alignment

11

slide-12
SLIDE 12
  • Model training
  • Language model predic4on loss
  • Mul4lingual alignment loss:
  • Overall loss:

Model Training

12

slide-13
SLIDE 13

Linguistic Features MaNer: More Robust to Noise

Uzbek (Zhang et al., 2017)

13

slide-14
SLIDE 14

Impact of Character-Aware Word Embeddings

14

Models Chinese English Spanish Before 64.1 67.4 64.6 Aoer 68.0 70.9 68.9

§

Name Tagging F-Score (%)

slide-15
SLIDE 15
  • Chechen Name Tagging

Impact of Common Semantic Space

15

Models P (%) R (%) F (%) Randomly ini4alized 46.3 45.31 45.8 Pre-trained 54.8 41.3 47.1 + Common seman4c space word embedding 62.1 50.1 55.4

slide-16
SLIDE 16

Something Old: Hierarchical Brown Clustering

16

Languages w/o BC (%) with BC (%) Languages w/o BC (%) with BC (%) Albanian 72.4 74.6 Northern Sotho 90.2 90.8 Chechen 53.1 55.4 Polish 49.6 53.2 Chinese 66.3 68.0 Somali 76.9 78.5 English 69.5 70.9 Spanish 67.1 68.9 Kannada 51.9 56.0 Swahili 64.3 67.8 Kikuyu 84.2 88.7 Yoruba 46.1 49.5 Nepali 41.6 43.9

slide-17
SLIDE 17

Joint Learning of Word and Entity Embeddings from Wikipedia

  • Consider all Wikipedia anchor links as en4ty annota4ons, a training corpus can

be created by replacing anchor links with unique en4ty IDs.

17

e.g., [[en/Apple|apple]] is a fruit [[en/Apple_Inc.|apple]] is a company

  • apple is a fruit

apple is a company en/Apple is a fruit en/Apple_Inc. is a company

  • Mul4-lingual
slide-18
SLIDE 18

Joint Learning of Word and Entity Embeddings from Wikipedia

18

Representation Learning

Entity Representation Learning Text Representation Learning

bands played it during public events, such as [[Independence Day (US)|July 4th]] celebrations … In the 1996 action film [[Independence Day (film)|Independence Day]], the United States military uses alien technology captured …

eIndependence Day (film) eIndependence Day (US)

Anchor Text

wfilm

wcelebrations P(N(ej)|ej)

Independence
 Day (US)

United
 States Fireworks

Independence
 Day (film)

Memorial
 Day

C e l e b r a t i

  • n

s Observed by

Public holidays in the United States

c a t e g

  • r

y

Will
 Smith

s t a r r i n g

Philadelphia

born country inlink

  • utlink

i n l i n k

Knowledge Base

e1

e2

, , ej, e

C(·) C(·) C(·) Mention Representation Learning

, , ej, e

C(·) C(·) C(·)

N(·) N(·) N(·)

played it during public events, such as
 [[ ]] celebrations

Mention Sense Mapping

g(July 4th, e1)

), s∗ Independence Day (US)

s∗

Independence Day (film)

  • utlink

O b s e r v e d b y category

s⇤

Memorial Day

eMemorial Day word embeddings

… holds annual [[Independence Day (US)| Independence Day]] celebrations and other festivals … … early Confederate [[Memorial Day]] celebrations were simple, somber occasions for veterans and their families to honor the dead … P(C(wi)|wi) · P(C(ml)|s⇤

j)

P(ej|C(ml), s⇤

j)

s⇤

Independence Day (US)

⇤, wi/s⇤ j

, s⇤

j, w

d1 , d2, d , d3, s

⇤, e3

Knowledge Space Text Space

slide-19
SLIDE 19

Learning Entity Embeddings from DBpedia

  • Construct a weighted undirected graph G = (E, D) from DBpedia, where E

is a set of all en44es in DBpedia and dij ∈ D indicates that two en44es ei and ej share some DBpedia proper4es. The weight of dij , wij is computed as:

  • where pi, pj are the sets of DBpedia proper4es of ei and ej respec4vely.
  • Apply the graph embedding framework proposed by (Tang et al., 2015) to

generate knowledge representa4ons for all en44es

19

wij = |pi \ pj| max(|pi|, |pj|)

slide-20
SLIDE 20

Impact of Joint Embeddings on Entity Linking

20

CEAFm P CEAFm R CEAFm F1 Baseline 0.762 0.843 0.801 + Joint word and en4ty embeddings from Wikipedia 0.791 0.875 0.831 + En4ty embedding from DBpedia 0.812 0.897 0.852

  • Unsupervised en4ty linking based on salience, similarity

and coherence

  • Tested on EDL16 perfect English NAM men4ons
slide-21
SLIDE 21

Resources and Demos

21

slide-22
SLIDE 22

Systems, Data and Resources Publicly Available

§ Re-trainable Systems:

§ h_p://blender02.cs.rpi.edu:3300/elisa_ie/api § Source code base available for government users upon

requests

§ Tri-lingual EDL is being integrated into CoreNLP and hope

to release in 2017

§ Data and Resources: § h_p://nlp.cs.rpi.edu/wikiann/ § Demos: § h_p://blender02.cs.rpi.edu:3300/elisa_ie § h_p://blender02.cs.rpi.edu:3300/elisa_ie/heatmap

22

slide-23
SLIDE 23

Demo 1: Cross-lingual Entity Discovery and Linking for 282 Languages

§ h_p://blender02.cs.rpi.edu:3300/elisa_ie § h_p://blender02.cs.rpi.edu:3300/elisa_ie/heatmap

23

slide-24
SLIDE 24

Cross-lingual Entity Discovery and Linking for 282 Languages (Cont’)

24

slide-25
SLIDE 25

Cross-lingual Entity Discovery and Linking for 282 Languages (Cont’)

25

slide-26
SLIDE 26

26

IE Application Example: Disaster Relief

slide-27
SLIDE 27

27

Cross-lingual Entity Discovery and Linking for 282 Languages (Cont’)

§ h_p://blender02.cs.rpi.edu:3300/elisa_ie/heatmap

27