In a Nutshell Eurovoc thesaurus descriptors, here displayed in - - PDF document

in a nutshell
SMART_READER_LITE
LIVE PREVIEW

In a Nutshell Eurovoc thesaurus descriptors, here displayed in - - PDF document

Automatic Identification of Document Translations in Large Multilingual Document Collections Automatic Identification of Document Translations in Large Multilingual Document Collections RANLP Conference, Borovets, Bulgaria 11 September 2003


slide-1
SLIDE 1

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 1

RANLP'2003, Bulgaria, 11.09.03

Automatic Identification of Document Translations in Large Multilingual Document Collections

RANLP Conference, Borovets, Bulgaria 11 September 2003 Bruno Pouliquen, Ralf Steinberger & Camelia Ignat Joint Research Centre, Ispra, Italy http://www.jrc.it/langtech

RANLP'2003, Bulgaria, 11.09.03

In a Nutshell

Spanish Spanish Text Text

Resolución sobre los residuos radioactivos

English English Text Text

Resolution

  • n radio-

active waste

Eurovoc thesaurus descriptors, here displayed in English

6621020304 52160104

slide-2
SLIDE 2

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 2

RANLP'2003, Bulgaria, 11.09.03

Agenda

Who we are and what we do Eurovoc Thesaurus Automatic assignment of thesaurus descriptors to text Training Phase Assignment Phase Document Similarity Calculation and Translation Identification Application Areas ot the Technology

RANLP'2003, Bulgaria, 11.09.03

Goal of JRC’s Language Technology work

IDoRA System: Intelligent Document Retrieval and Analysis

Retrieval of potentially relevant texts Text analysis and extraction of information from texts Visualisation of the contents

slide-3
SLIDE 3

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 3

RANLP'2003, Bulgaria, 11.09.03

Focus of JRC’s Language Technology work

Applications using more statistics and less

language-specific resources

Multilingual and cross-lingual applications Also for languages of EU Candidate Countries Many languages; few human resources

RANLP'2003, Bulgaria, 11.09.03

Multilingual list of terms about many different subject areas (wide coverage) Developed by the European Parliament (EP) and others Actively used to index (catalogue) and retrieve documents in large collections

(fine-grained classification and cataloguing system)

Eurovoc Thesaurus

http://europa.eu.int/celex/eurovoc

Hierarchically organised into a maximum of 8 levels

top level: 21 fields next level: 127 micro-thesauri total: 5933 descriptors (version 3.0)

  • 5877 reciprocal relations (BT, NT)
  • 2730 reciprocal associations (RT)
slide-4
SLIDE 4

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 4

RANLP'2003, Bulgaria, 11.09.03

Eurovoc (Top Level and Detail)

04 Politics 08 International Relations 10 European Communities 12 Law 16 Economics 20 Trade 24 Finance 28 Social Questions 32 Education and Competition 36 Science 40 Business and Competition 44 Employment and Working Conditions 48 Transport 52 Environment 56 Agriculture, Forestry and Fisheries 60 Agri-Foodstuffs 64 Production, Technology and Research 66 Energy 68 Industry 72 Geography 76 International Organisations 28 SOCIAL QUESTIONS 2806 family 2811 migration 2816 demography and population 2821 social framework 2826 social affairs 2831 culture and religion arts cultural policy culture acculturation civilization cultural difference cultural identity RT: protection of minorities (1236) RT: socio-cultural group (2821) cultural pluralism popular culture regional culture religion 2836 social protection 2841 health 2846 construction and town planning

RANLP'2003, Bulgaria, 11.09.03

Eurovoc Users

  • C

z e c h R e p u b l i c

Chamber of Deputies Euro Info Centre European Documentation Centre Info Centre of the EU Supreme Audit Office Parliamentary Library

  • L

i t h u a n i a n S e i m a s

  • P
  • l

i s h S e j m

  • S

l

  • v

e n i a n D r ž a v n i z b

  • r
  • R
  • m

a n i a n C a m e r a D e p u t a t i l

  • r
  • R

u s s i a n D u m a

  • A

l b a n i a n P a r l i a m e n t

  • C

r

  • a

t i a

  • U

k r a i n e

European Paliament DG OPOCE Belgium:

Senate La Chambre

Portugal: Assambleia da Republica Sweden: Riksdag Spain:

El Senado Congreso de los Diputados

Switzerland: Assemblée Fédérale

Documentation Centres and Libraries of:

slide-5
SLIDE 5

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 5

RANLP'2003, Bulgaria, 11.09.03

Used by the EP and DG OPOCE for all 11 official EU languages

Eurovoc Languages

Also exists for:

Albanian, Czech, Croatian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Russian, Slovak, Slovenian

Consider using Eurovoc: Armenia, Bosnia-Herzegovina, Bulgaria, Estonia,

France, Georgia, Iceland , Macedonia, Turkey

Most multilingual thesaurus in existence? (currently 22 languages)

RANLP'2003, Bulgaria, 11.09.03

Automatic Indexing: Challenge

Descriptors are mostly abstract multi-word concepts, e.g.

PROTECTION OF MINORITIES FISHERY MANAGEMENT CONSTRUCTION AND TOWN PLANNING SIMPLIFICATION OF FORMALITIES PLUTONIUM FRANCE

Searching for descriptors (baseline) in text is not a solution:

Maximum recall ~ 30%, Maximum precision ~ 7%

Keyword Assignment as opposed to keyword extraction

slide-6
SLIDE 6

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 6

RANLP'2003, Bulgaria, 11.09.03

JRC's Statistical / Associative Approach

2.

Assignment phase: Assign descriptor if many of its associates are present in text.

1.

Training Phase: Identify many (statistically or semantically) related words (associates)

FISHERY MANAGEMENT

RANLP'2003, Bulgaria, 11.09.03

Training: Text Normalisation

Linguistic pre-processing = normalisation of the text

  • Lemmatisation (base-form reduction of words) and lower-casing:

Transporting transport

  • Mark-up of multi-word expressions

'plant' 'green_plant' vs. 'power_plant'

  • Stop word lists to avoid words that are not content-bearing

general: are, they, having, in_spite_of, interesting, domain-specific: question, answer, commission, article

slide-7
SLIDE 7

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 7

RANLP'2003, Bulgaria, 11.09.03

Training: Produce Associate Lists

join these lists of statistically salient words, e.g. RADIOACTIVE MATERIALS

radioactive ukraine resolution plutonium deuterium parliament nuclear blottnitz ... plutonium deuterium assembly nuclear schmidt radioactive korea iaea ... Illegal_traffic chernobyl radioactive ukrainian plutonium lithium dangerous mox ...

radioactive (3) plutonium (3) nuclear (2) deuterium (2) Illegal_traffic (1) chernobyl (1) ...

+ + =

Using a large collection of manually indexed documents (training corpus) For each descriptor D1, take all documents indexed with D1 identify the statistically salient words in each of these texts Normalise the weight according to a number of different criteria. Result of Training: Weighed associate lists for all descriptors

RANLP'2003, Bulgaria, 11.09.03

Associate List: RADIOACTIVE MATERIALS

slide-8
SLIDE 8

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 8

RANLP'2003, Bulgaria, 11.09.03

Associate List: FISHERY MANAGEMENT

fishery-related management-related

RANLP'2003, Bulgaria, 11.09.03

Assignment Phase

Normalise new document

(lemmatise, multi-word mark-up)

Produce lemma frequency list

(excluding stop words)

∈ ∩ ∈

=

d l t l t l d l t d l t l d l

TFIDF TFIDF TFIDF TFIDF t d COSINE ) ).( ( . ) , (

2 , 2 , , ,

  • Calculate similarity

between lemma frequency list and descriptor associate lists, using statistical formulae

...

slide-9
SLIDE 9

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 9

RANLP'2003, Bulgaria, 11.09.03

Formulae tested for descriptor assignment

) 1 ) .((log2

, ,

+ =

l d l d l

DF N TF TFIDF

∈ ∩ ∈

=

d l t l t l d l t d l t l d l

TFIDF TFIDF TFIDF TFIDF t d COSINE ) ).( ( . ) , (

2 , 2 , , ,

M d TF TF DF DF N Okapi

d l d l d t l l l d t

+ − =

∩ ∈ , , ,

) log(

) max( 18 . ) max( 21 . ) max( 61 . Sproduct Sproduct Okapi Okapi COSINE COSINE + + = Φ

=

t d l t l d l

TFIDF TFIDF t d Sproduct

, , .

) , (

Cosine uses TF.IDF; computes the angle of two multi-dimensional vectors (of the document (t) and of the descriptor associate list) Term Frequency, Inverse Document Frequency Considers occurrence frequency

  • f lemma (l) in meta-text (TFl,t) and number of

descriptors (d) for which the lemma is an associate (DFl) Okapi considers occurrence frequency of lemma as an associate (DFl); the number of associates in the associate list (size, |d|); the average size of descriptor associate lists (M); the total number of descriptors used (N) ‘622’ mixed formula, uses all of the above ‘Scalar Product’ adds product of TF.IDF values

  • f associates and text lemmas

RANLP'2003, Bulgaria, 11.09.03

Manual Evaluation of the Assignment

slide-10
SLIDE 10

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 10

RANLP'2003, Bulgaria, 11.09.03

Manual Evaluation of Automatic Assignment

Correct descriptors compared to benchmark of manual assignment English: 65 / 78 = 83% Spanish: 69 / 87 = 80%

RANLP'2003, Bulgaria, 11.09.03

Languages Currently Covered

System is currently optimised for English and Spanish, partially for French System is trained for another seven languages without pre-processing:

De, It, Pt, Nl, Da, Sv, Fi

10 20 30 40 50 60 En Es Da De Fi* Fr It Nl Pt Sv*

Without linguistic pre-processing With pre-processing

slide-11
SLIDE 11

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 11

RANLP'2003, Bulgaria, 11.09.03

Document Similarity Calculation

Spanish Spanish Text Text

Resolución sobre los residuos radioactivos

English English Text Text

Resolution

  • n radio-

active waste

6621020304 52160104

RANLP'2003, Bulgaria, 11.09.03

Results for Similarity Calculation and Translation Spotting

Task: find Spanish translations of English source document in a parallel text collection 1) Simple document similarity (DS) 2) DS considering the length of documents 5) DS correcting mono- lingual bias (83%) 4) Mixed-language search space 3) Different text type

slide-12
SLIDE 12

Automatic Identification of Document Translations in Large Multilingual Document Collections EC - Joint Research Centre - IPSC --- Ralf Steinberger 12

RANLP'2003, Bulgaria, 11.09.03

Is there a Translation?

  • S

e t t i n g a t h r e s h

  • l

d ; j u g g l i n g p r e c i s i

  • n

a n d r e c a l l

Searching for a translation where there is none:

Searching in T2 for documents of T1

4.15% noise Best threshold depends on: Document set Requirement: high recall or high precision

RANLP'2003, Bulgaria, 11.09.03

Application Areas

Translation Spotting, e.g. to produce a parallel corpus Finding similar documents to a given text, independent of language Identification of cross-lingual

document plagiarism

Cross-lingual classification

and clustering

Multilingual document maps

Map produced with ThemeScape, by CARTIA Inc.