P RO G EN IE: Biographical descriptions for Intelligence Analysis - - PowerPoint PPT Presentation

p ro g en ie biographical descriptions for intelligence
SMART_READER_LITE
LIVE PREVIEW

P RO G EN IE: Biographical descriptions for Intelligence Analysis - - PowerPoint PPT Presentation

P RO G EN IE: Biographical descriptions for Intelligence Analysis Pablo Duboue, Kathleen McKeown and Vassileios Hatzivassiloglou Computer Science Department Columbia University in the city of New York Goals Provide final users with quick


slide-1
SLIDE 1

PROGENIE: Biographical descriptions for Intelligence Analysis

Pablo Duboue, Kathleen McKeown and Vassileios Hatzivassiloglou

Computer Science Department

Columbia University

in the city of New York

slide-2
SLIDE 2

Goals

  • Provide final users with quick and concise descriptions

– Foreign military personnel – Foreign political personnel – Terrorists – Criminal

  • Customizable

– Different users – Different scenarios – Different requirements

  • PROGENIE’s approach

On the fly generation of person’s descriptions

slide-3
SLIDE 3

Motivation and Relevance

  • Information Retrieval

– Look for existing biographies

  • Summarization

– Integrate pieces of text from various textual sources

  • Natural Language Generation (NLG)

– Create text from structured information sources

PROGENIE’s Approach

– Builds on the NLG tradition ∗ Diverges from it, automatically construct content plans – Combine a generator with an agent-based infrastructure – Mix textual with non-textual sources

slide-4
SLIDE 4

System Description

Knowledge Component Generation Component Schema Knowledge Generated Biographies Internet Knowledge Sources Learning Component Content Planner Text and knowledge resource

text KB

slide-5
SLIDE 5

Learning Component

  • Content Planner

– Structuring: Distribution of the information among textual elements – Selection: Filtering of the available data

  • Schemas

– An implementation for Content Planners (McKeown, 1983)

  • Construct Content Planning Schemas, from training data

– Training material: data and biographies – The learned schemas will be used with new, unseen people

slide-6
SLIDE 6

Text and Knowledge Resource

  • Celebrities

– Easily available – Representative of the learning issues – Possibility of corpus re-distribution

  • Size

– Data frames for 1,100 different celebrities – assorted biographies, ranging from 110 to 500 words – Data and biographies crawled from independent web sites

slide-7
SLIDE 7

Example of Text and Knowledge Resource

Actor, born Thomas Connery on August 25, 1930, in Fountainbridge, Edinburgh, Scot- land, the son of a truck driver and char-

  • woman. He has a brother, Neil, born in 1938.

Connery dropped out of school at age fif- teen to join the British Navy. Connery is best known for his portrayal of the suave, sophisti- cated British spy, James Bond, in the 1960s. . . .

person−2654 person−7312 birth−1

  • ccupation−1

relative−1 relative−2 name−1 name−2 name−2 date−1 ... ... ... ... ... ... "Thomas" "Jason" "Dashiel" "Sean" "Connery" 1930 c−actor c−son c−grand−son birth

  • ccupation

relative relative TYPE TYPE TYPE person person name name name date year ... ... ... ... ... ... first first first middle last

slide-8
SLIDE 8

Learning of Content Selection Rules (1)

  • To appear

– Duboue and McKeown, “Statistical Acquisition of Content Selec- tion Rules for Natural Language Generation”, EMNLP 2003

  • Goals

– Analyze how variation on the data influence variations in the text – Obtain high-level content selection rules, to filter out the input

slide-9
SLIDE 9

Learning of Content Selection Rules (2)

  • Example

Given: – (KB-1,Bio-1),(KB-2,Bio-2),(KB-3,Bio-3),(KB-4,Bio-4) If: – KB-{1, 2} contain birth place state ‘MD′ – KB-{3, 4} contain birth place state ‘NY ′ Then: – Compare the language models of Bio-{1, 2} against Bio- {3, 4}. – If the models differ (cross entropy), content select birth place state.

slide-10
SLIDE 10

Learning of Content Planning Schemas

  • Earlier experiments performed

in a medical domain.

  • Corpus collected during the

evaluation described in McKe-

  • wn et al. (2001).
  • In

Duboue and McKeown (2001), we mined the corpus to extract ordered constraints between semantic elements.

  • In

Duboue and McKeown (2002), we used the corpus to learn content planning schemas using an alignment- based fitness function.

semantic input genetic search genetic pool transcripts

  • rder constraints

generation system planner

  • perators

fitness fn

atomic operators structure atomic operators structure atomic operators structure
slide-11
SLIDE 11

Knowledge Component

  • Data for Learning

– Supplied by internal databases and networks – E.g., Intelink, IAFIS

  • Data for Execution

– Information Extraction Agents on the Internet – Publicly available data as a test bed – Data represented in RDF (Semantic Web)

slide-12
SLIDE 12

Generation Component

  • 1. Inference Module

Limited world knowledge inferencing

  • 2. Content Planner

McKeown’s schemas

  • 3. Text Planner

Splits a rhetorical tree into paragraphs

  • 4. Referring Expression Generator

Handles pronominalization

  • 5. Aggregation

Mixes together clauses with similar structure

  • 6. Lexical Chooser

Selects words for concepts

  • 7. Surface Realizer

FUF/SURGE unification based realizer

slide-13
SLIDE 13

Generated Example Osama Bin Laden

  • overview:

– name of the person: ∗ He is Usama Bin Laden. – place of birth: ∗ He was born in Saudi Arabia. – nationality of the person: ∗ He was a national of Saudi Arabia. ∗ He does not currently have a nationality. – occupation: ∗ He is a terrorist. ∗ He is the leader of Al-Qaeda. ∗ He is a civil engineer. ∗ He is a constructor. – education received: ∗ He attended the primary school in Jeddah, Saudi Arabia. ∗ He attended the secondary school in Jeddah, Saudi Arabia. ∗ To study security, the CIA gave him training according to Hazhir Teimourian.

slide-14
SLIDE 14

Conclusions

  • PROGENIE

– Solves an existing requirement for intelligence and law enforce- ment personnel

  • Status

– Prototype Learning Component implemented in an earlier do- main ∗ New version, acquired Content Selection rules – Generation Component, five operational modules – Knowledge Component, under construction