Embedded Controlled Languages Aarne Ranta CNL 2014, Galway 20-22 - - PowerPoint PPT Presentation

embedded controlled languages
SMART_READER_LITE
LIVE PREVIEW

Embedded Controlled Languages Aarne Ranta CNL 2014, Galway 20-22 - - PowerPoint PPT Presentation

Embedded Controlled Languages Aarne Ranta CNL 2014, Galway 20-22 August 2014 CLT Joint work with Krasimir Angelov, Bjrn Bringert, Grgoire Dtrez, Ramona Enache, Erik de Graaf, Normunds Gruzitis, Qiao Haiyan, Thomas Hallgren, Prasanth


slide-1
SLIDE 1

Aarne Ranta

CNL 2014, Galway 20-22 August 2014

CLT

Embedded Controlled Languages

slide-2
SLIDE 2

Joint work with

Krasimir Angelov, Björn Bringert, Grégoire Détrez, Ramona Enache, Erik de Graaf, Normunds Gruzitis, Qiao Haiyan, Thomas Hallgren, Prasanth Kolachina, Inari Listenmaa, Peter Ljunglöf, K.V.S. Prasad, Scharolta Siencnik, Shafqat Virk 50+ GF Resource Grammar Library contributors

slide-3
SLIDE 3

Embedded programming languages

DSL = Domain Specific Language Embedded DSL = fragment (library) of a host language + low implementation effort + no additional learning if you know the host language + you can fall back to host language if DSL is not enough

  • reasoning about DSL properties more difficult
slide-4
SLIDE 4

Timeline

1998: GF = Grammatical Framework 2001: RGL = Resource Grammar Library 2008: CNL, explicitly 2010: MOLTO: CNL-based translation 2012: wide-coverage translation 2014: embedded CNL translation

slide-5
SLIDE 5

Outline

  • “CNL is a part of NL”
  • CNL embedded in NL
  • Example: translation
  • Demo: web and mobile app
slide-6
SLIDE 6

CNL as a part of NL

It is a part:

  • it is understandable without extra learning

It is a proper part:

  • it excludes parts that are not so good
  • it can be controlled, maybe even defined
slide-7
SLIDE 7

How to define and delimit a CNL

How to guarantee that it is a part

  • the CNL may be formal, the NL certainly isn’t

How to help keep within the limits

  • so that the user stays within the CNL
slide-8
SLIDE 8

Bottom-up vs. top-down CNL

Bottom-up: define CNL rule by rule

  • nothing is in the CNL unless given by rules
  • e.g. Attempto Controlled English

Top-down: delimit CNL by constraining NL

  • everything is in the CNL unless blocked by

rules

  • e.g. Simplified English
slide-9
SLIDE 9

Defining and delimiting CNL

Bottom-up:

  • How do we know that the rules are valid NL?

Top-down:

  • How do we decide what is in the CNL?
slide-10
SLIDE 10

Defining bottom-up

Message ::= “you have” Number “points” you have five points you have one points

slide-11
SLIDE 11

Delimiting top-down

Passives must be avoided. How to recognize them in all contexts? Tenses, questions, infinitives, separate from adjectives...

slide-12
SLIDE 12

An answer to both problems

Define CNL formally as a part of NL

  • use a grammar of the whole NL
  • bottom-up: rules defined as applications of

NL rules

  • top-down: constraints written as conditions
  • n NL trees
slide-13
SLIDE 13

The whole NL?

An approximation: GF Resource Grammar Library (RGL)

  • morphology
  • syntactic structures
  • lexicon
  • common syntax API
  • 29 languages
slide-14
SLIDE 14

Bottom-up CNL

Use RGL as library

  • use its API function calls rather than plain strings

HavePoints p n = mkCl p have_V2 (mkNP n point_N) This generates you have five points, she has one point, etc Also in other languages

slide-15
SLIDE 15

Top-down CNL

Use RGL as run-time grammar

  • use its parser to produce trees
  • filter trees by pattern matching

hasPassive t = case t of

PassVPSlash _ -> return True _ -> composOp hasPassive t

(Bringert & Ranta, A Pattern for Almost Compositional Operations, JFP 2008)

slide-16
SLIDE 16

Top-down CNL

Use RGL as run-time grammar

  • change unwanted input

unPassive t = case t of

PredVP np (PassVPSlash vps) -> liftM2 PredVP (unPassive np) (unPassive vps) _ -> composOp unPassive t

Non-CNL input is recognized but corrected.

slide-17
SLIDE 17

Embedded bottom-up CNL

  • 1. Define CNL as usual, maybe with RGL as library
  • 2. Build a module that inherits both CNL and RGL

abstract Embedded = CNL, RGL ** { cat Start ; fun UseCNL : CNL_Start -> Start ; fun UseRGL : RGL_Start -> Start ; }

slide-18
SLIDE 18

Using embedded CNL

Parsing will try both CNL and RGL. You can give priority to CNL trees. The parser is robust (if RGL has enough coverage) Non-CNL input is not a failure, but can be processed further.

slide-19
SLIDE 19

Example: translation

We want to have machine translation that

  • delivers publication quality in areas where reasonable

effort is invested

  • degrades gracefully to browsing quality in other areas
  • shows a clear distinction between these

We do this by using grammars and type-theoretical interlinguas implemented in GF, Grammatical Framework

slide-20
SLIDE 20

GF translation app in greyscale

slide-21
SLIDE 21

GF translation app in full colour

slide-22
SLIDE 22

translation by meaning

  • correct
  • idiomatic

translation by syntax

  • grammatical
  • often strange
  • often wrong

translation by chunks

  • probably ungrammatical
  • probably wrong
slide-23
SLIDE 23

word to word transfer syntactic transfer semantic interlingua

The Vauquois triangle

slide-24
SLIDE 24

word to word transfer syntactic transfer semantic interlingua

The Vauquois triangle

slide-25
SLIDE 25

What is it good for?

slide-26
SLIDE 26

get an idea get the grammar right publish the content

slide-27
SLIDE 27

Who is doing it?

slide-28
SLIDE 28

Google, Bing, Apertium GF the last 15 months GF in MOLTO

slide-29
SLIDE 29

What should we work on?

slide-30
SLIDE 30

chunks for robustness and speed syntax for grammaticality semantics for full quality and speed

All!

slide-31
SLIDE 31

We want a system that

  • can reach perfect quality
  • has robustness as back-up
  • tells the user which is which

We “combine GF, Apertium, and Google” But we do it all in GF!

slide-32
SLIDE 32

How to do it?

a brief summary

slide-33
SLIDE 33

translator chunk grammar resource grammar CNL grammar

slide-34
SLIDE 34

How much work is needed?

slide-35
SLIDE 35

translator

chunk grammar

resource grammar

CNL grammars

slide-36
SLIDE 36

resource grammar

  • morphology
  • syntax
  • generic lexicon

precise linguistic knowledge manual work can’t be escaped

slide-37
SLIDE 37

CNL grammars

domain semantics, domain idioms

  • need domain expertise

use resource grammar as library

  • minimize hand-hacking

the work never ends

  • we can only cover some domains
slide-38
SLIDE 38

chunk grammar

words suitable word sequences

  • local agreement
  • local reordering

easily derived from resource grammar easily varied minimize hand-hacking

slide-39
SLIDE 39

translator

PGF run-time system

  • parsing
  • linearization
  • disambiguation

generic for all grammars portable to different user interfaces

  • web
  • mobile
slide-40
SLIDE 40

Disambiguation?

Grammatical: give priority to green over yellow, yellow over red Statistical: use a distribution model for grammatical constructs (incl. word senses) Interactive: for the last mile in the green zone

slide-41
SLIDE 41

Advantages of GF

Expressivity: easy to express complex rules

  • agreement
  • word order
  • discontinuity

Abstractions: easy to manage complex code Interlinguality: easy to add new languages

slide-42
SLIDE 42

Resources: basic and bigger

Norwegian Danish Afrikaans Maltese Romanian Catalan Polish Estonian Russian Latvian Thai Japanese Urdu Punjabi Sindhi Greek Nepali Persian English Swedish German Dutch French Italian Spanish Bulgarian Finnish Chinese Hindi

slide-43
SLIDE 43
slide-44
SLIDE 44

How to do it?

some more details

slide-45
SLIDE 45

Translation model: multi-source multi-target compiler

slide-46
SLIDE 46

Translation model: multi-source multi-target compiler-decompiler

Abstract Syntax

Hindi Chinese Finnish Swedish English Spanish German French Bulgarian Italian

slide-47
SLIDE 47

Word alignment: compiler

1 + 2 * 3 00000011 00000100 00000101 01101000 01100000

slide-48
SLIDE 48

Abstract syntax

Add : Exp -> Exp -> Exp Mul : Exp -> Exp -> Exp E1, E2, E3 : Exp Add E1 (Mul E2 E3)

slide-49
SLIDE 49

Concrete syntax

abstrakt Java JVM Add x y x “+” y x y “01100000” Mul x y x “*” y x y “01101000” E1 “1” “00000011” E2 “2” “00000100” E3 “3” “00000101”

slide-50
SLIDE 50

Compiling natural language

Abstract syntax Pred : NP -> V2 -> NP -> S Mod : AP -> CN -> CN Love : V2 Concrete syntax: English Latin Pred s v o s v o s o v Mod a n a n n a Love “love” “amare”

slide-51
SLIDE 51

Word alignment

the clever woman loves the handsome man femina sapiens virum formosum amat Pred (Def (Mod Clever Woman)) Love (Def (Mod Handsome Man))

slide-52
SLIDE 52

Linearization types

English Latin CN {s : Number => Str} {s : Number => Case => Str ; g : Gender} AP {s : Str} {s : Gender => Number => Case => Str} Mod ap cn {s = \\n => ap.s ++ cn.s ! n} {s = \\n,c => cn.s ! n ! c ++ ap.s ! cn.g ! n ! c ;

g = cn.g }

slide-53
SLIDE 53

Abstract syntax trees

my name is John HasName I (Name “John”)

slide-54
SLIDE 54

Abstract syntax trees

my name is John HasName I (Name “John”) Pred (Det (Poss i_NP) name_N)) (NameNP “John”)

slide-55
SLIDE 55

Abstract syntax trees

my name is John HasName I (Name “John”) Pred (Det (Poss i_NP) name_N)) (NameNP “John”) [DetChunk (Poss i_NP), NChunk name_N, copulaChunk, NPChunk (NameNP “John”)]

slide-56
SLIDE 56

Building the yellow part

slide-57
SLIDE 57

Building a basic resource grammar

Programming skills Theoretical knowledge of language 3-6 months work 3000-5000 lines of GF code

  • not easy to automate

+ only done once per language

slide-58
SLIDE 58

Building a large lexicon

Monolingual (morphology + valencies)

  • extraction from open sources (SALDO etc)
  • extraction from text (extract)
  • smart paradigms

Multilingual (mapping from abstract syntax)

  • extraction from open sources (Wordnet, Wiktionary)
  • extraction from parallel corpora (Giza++)

Manual quality control at some point needed

slide-59
SLIDE 59

Improving the resources

Multiwords: non-compositional translation

  • kick the bucket - ta ner skylten

Constructions: multiwords with arguments

  • i sötaste laget - excessively sweet

Extraction from free resources (Konstruktikon) Extraction from phrase tables

  • example-based grammar writing
slide-60
SLIDE 60

Building the green part

slide-61
SLIDE 61

Define semantically based abstract syntax fun HasName : Person -> Name -> Fact Define concrete syntax by mapping to resource grammar structures lin HasName p n = mkCl (possNP p name_N) y

my name is John lin HasName p n = mkCl p heta_V2 y jag heter John lin HasName p n = mkCl p (reflV chiamare_V) y (io) mi chiamo John

slide-62
SLIDE 62

Resource grammars give crucial help

  • CNL grammarians need not know linguistics
  • a substantial grammar can be built in a few

days

  • adding new languages is a matter of a few

hours MOLTO’s goal was to make this possible.

slide-63
SLIDE 63

Automatic extraction of CNLs?

  • abstract syntax from ontologies
  • concrete syntax from examples

○ including phrase tables As always, full green quality needs expert verification

  • formal methods help (REMU project)
slide-64
SLIDE 64

These grammars are a source of

  • “non-compositional” translations
  • compile-time transfer
  • idiomatic language
  • translating meaning, not syntax

Constructions are the generalized form of this idea, originally domain-specific.

slide-65
SLIDE 65

Building the red part

slide-66
SLIDE 66
  • 1. Write a grammar that builds sentences

from sequences of chunks

cat Chunk fun SChunks : [Chunk] -> S

  • 2. Introduce chunks to cover phrases

fun NP_nom_Chunk : NP -> Chunk fun NP_acc_Chunk : NP -> Chunk fun AP_sg_masc_Chunk : AP -> Chunk fun AP_pl_fem_Chunk : AP -> Chunk

slide-67
SLIDE 67

Do this for all categories and feature combinations you want to cover. Include both long and short phrases

  • long phrases have better quality
  • short phrases add to robustness

Give long phrases priority by probability settings.

slide-68
SLIDE 68

Long chunks are better: [this yellow house] - [det här gula huset] [this] [yellow house] - [den här] [gult hus] [this] [yellow] [house] - [den här] [gul] [hus] Limiting case: whole sentences as chunks.

slide-69
SLIDE 69

Accurate feature distinctions are good, especially between closely related language pairs. god bon buono good gott bonne buona goda bons buoni bonnes buone Apertium does this for every language pair.

slide-70
SLIDE 70

Resource grammar chunks of course come with reordering and internal agreement Prep Det+Fem+Sg N+Fem+Sg A+Fem+Sg dans la maison bleue im blauen Haus

Prep-Det+Neutr+Sg+Dat A+Weak+Dat N+Neutr+Sg

slide-71
SLIDE 71

Recall: chunks are just a by-product of the real grammar. Their size span is single words <---> entire sentences A wide-coverage chunking grammar can be built in a couple of hours by using the RGL.

slide-72
SLIDE 72

Building the translation system

slide-73
SLIDE 73

GF source

slide-74
SLIDE 74

GF source

probability model

slide-75
SLIDE 75

GF source

probability model PGF binary GF compiler

slide-76
SLIDE 76

PGF binary PGF runtime system

slide-77
SLIDE 77

PGF binary PGF runtime system user interface

slide-78
SLIDE 78

PGF binary PGF runtime system user interface another PGF binary

slide-79
SLIDE 79

PGF binary PGF runtime system user interface another PGF binary

CNL

slide-80
SLIDE 80

PGF binary PGF runtime system user interface another PGF binary

another CNL

slide-81
SLIDE 81

PGF binary PGF runtime system custom user interface generic user interface PGF runtime system generic grammar CNL White: free, open-source. Green: a business idea (Digital Grammars)

slide-82
SLIDE 82

User interfaces

command-line shell web server web applications mobile applications

slide-83
SLIDE 83

Demos

slide-84
SLIDE 84

To test it yourself

Android app

http://www.grammaticalframework.org/demos/app.html

Web app

http://www.grammaticalframework.org/demos/translation.html

slide-85
SLIDE 85

Take home

slide-86
SLIDE 86

Implementing CNL in GF using RGL

  • less work and linguistic expertise
  • multilinguality (29 languages)

Embedding CNL in RGL

  • robustness
  • confidence control

On-going effort: translation

  • CNL as semantic model
  • contributions wanted to lexicon etc!

Other CNL applications: to do!

slide-87
SLIDE 87