Framework for Supporting Multilingual Resource Development at - - PowerPoint PPT Presentation

framework for supporting multilingual resource
SMART_READER_LITE
LIVE PREVIEW

Framework for Supporting Multilingual Resource Development at - - PowerPoint PPT Presentation

Framework for Supporting Multilingual Resource Development at Expert System Jose Manuel Gomez-Perez jmgomez@expertsystem.com META-FORUM 2016, July 5th, 2016 Expert System About us Framework for Suppor.ng Mul.lingual Resource Development


slide-1
SLIDE 1

Framework for Supporting Multilingual Resource Development at Expert System

META-FORUM 2016, July 5th, 2016

Jose Manuel Gomez-Perez jmgomez@expertsystem.com

slide-2
SLIDE 2

Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016

Expert System – About us

slide-3
SLIDE 3

Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016

Expert System’s COGITO

  • COGITO interprets text to empower

beLer, more informed decision making

  • Based on Sensigrafo, a monolingual

representa5on of knowledge that is both deep and wide

  • Sensigrafo contains millions of word

defini5ons, related concepts and linguis5c informa5on

  • Several Person-Years each
  • COGITO leverages context

informa5on for disambigua5on based on Sensigrafo

  • Document categoriza5on and

informa.on extrac5on encoded on top of Sensigrafo in rule-based categoriza.on and extrac.on languages

  • Rule modeling supported by COGITO

Studio

slide-4
SLIDE 4

Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016

Expert System Today

14 languages na5vely supported

slide-5
SLIDE 5

Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016

Challenges and Opportuni.es

  • Due to Expert System's rapid expansion in the European

market, the company faced the challenge of crea5ng new monolingual resources from scratch, or…

  • Achieve na5ve mul5linguality in a cost-effec5ve manner,

while maintaining high accuracy and reducing .me to market

  • Generalized MT is not the solu5on - resul.ng accuracy drops

at least 10% average

  • Many of the projects in the new countries conceptually

similar to previous projects in other languages

  • Enable reuse of exis5ng seman5c and linguis5c resources,

including monolingual rule bases, across languages

slide-6
SLIDE 6

Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016

Approach

  • The goal is not to automate

the whole process, rather:

  • Bootstrap resources, providing

knowledge engineers with a solid base and allevia.ng the blank page syndrome par.cularly for rule development

  • Leverage context informa5on,

both in text and in the monolingual Sensigrafos, to improve transla.on quality

  • Provide confidence values to

guide valida5on efforts

  • Focus on the exis5ng

monolingual rule bases

Automa.c Rule Transla.on Word & Sense Embeddings Rule Learning

Context-based mapping iden.fica.on No previous rule base to reuse Large document corpus available Reusable rule base exists (in a different language)

slide-7
SLIDE 7

Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016

Automa.c Rule Transla.on

  • Transform rules in the original

language into Abstract Syntax Trees (AST). Main nodes include concepts (word senses), lemmas, and keywords

  • AST translator replicates ASTs,

modifying or replacing nodes from the source language to the target language

  • Different handling for each node and
  • perator type. Rely on concept

mapping between source and target Sensigrafos

  • Applied to 90K rules in IPTC,

EUROVOC, etc. and language pairs IT- ES, IT-FR, EN-DE ü 99.9% rules translated ü 55% to 70% accuracy

slide-8
SLIDE 8

Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016

Word and Sense Embeddings

  • Suggest missing links between

Sensigrafos using context for word sense disambigua.on

  • Builds on MT work to infer

missing dic.onary entries (Mikolov et al)

  • Learn monolingual models and a

linear projec5on between them

  • Learnt rela5ons display several

degrees of relatedness with different confidence values, e.g. equivalence, similarity, co-

  • ccurrence, etc.
  • Pleno (ES) -> full, plenary,

partsession, Hortefeux, approve, summarize (EN)

  • Tokenized, lemma.zed and

normalized the EUROPARL parallel corpora using COGITO

  • Skip-gram model, window size

10, vector dimensionality 400

  • Linear projec.on learnt from a

dic.onary with the 5,000 most frequent terms in the source language and their MT equivalent in the target

  • Transla.on matrix code in Java

available in GitHub

hZps://github.com/josemanuelgp/ word2vec_vector-transla5on-java

slide-9
SLIDE 9

Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016

Rule Learning

  • Automa5cally

bootstrap a rule base star5ng from a targeted

  • Focus on beginner’s

rules rather than perfect rules

  • Two main

approaches, based

  • n _-idf and

decisión tres ü Precision >34% ü Recall >65%

slide-10
SLIDE 10

Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016

Come see our poster!

Framework for Suppor.ng Mul.lingual Resource Development at Expert System

Jose Manuel Gomez-Perez jmgomez@expertsystem.com