Collecting Aligned Textual Corpora from the Hidden Web Botjan - - PowerPoint PPT Presentation

collecting aligned textual corpora from the hidden web
SMART_READER_LITE
LIVE PREVIEW

Collecting Aligned Textual Corpora from the Hidden Web Botjan - - PowerPoint PPT Presentation

Collecting Aligned Textual Corpora from the Hidden Web Botjan Pajntar bostjan.pajntar@ijs.si ailab.ijs.si Aligned Parallel Corpus Definition (wikipedia): A parallel text is a text placed alongside its translation or translations


slide-1
SLIDE 1

ailab.ijs.si

Boštjan Pajntar

bostjan.pajntar@ijs.si

Collecting Aligned Textual Corpora from the Hidden Web

slide-2
SLIDE 2

ailab.ijs.si

Aligned Parallel Corpus

Definition (wikipedia):

“A parallel text is a text placed alongside its translation or translations”

Usage:

Translation Memory Machine Translation Natural Language Processing

Standards:

TMX – Translation Memory eXchange TBX – TermBase eXchange UTX – Universal Terminology eXchange (SRX, GMX-GILT, OLIF, XLIFF, TransWS, ...)

slide-3
SLIDE 3

ailab.ijs.si

But Where to Get the Data?

Non-English professional websites Huge amounts of translated text Generally quality translations We call this the Hidden Web

slide-4
SLIDE 4

ailab.ijs.si

Problems

Translation Memory is hard / expensive to obtain

Idea: Automatic harnessing of existing data

Data should have very high precision

What precision is needed?

No standard fully supports automatic:

Harnessing of the data Cleaning of the data

slide-5
SLIDE 5

ailab.ijs.si

Crawling

WEB

Candidates Extraction Filtering Parsing

List of HTML candidates Relational Database List of text candidates Parallel Corpora Parallel Corpora Extraction

Proposed Solution

Available at: http://kameleon.ijs.si/t4me

slide-6
SLIDE 6

ailab.ijs.si

Discussion on Standards

We build on TMX:

Is this the right choice? Source language must be defined! An optional parameter to define the source of each segment

Proposals for automatic harnessing of TM:

Provide a new standard Build on an existing one Ideas?

slide-7
SLIDE 7

ailab.ijs.si

Future Work

Optimizing Crawling:

Two phase crawling Character Encodings Enhanced candidates extraction

Optimizing Extraction:

Segmentation Language identification Enhanced filtering

Web service / Web application

Translation Memory distribution Filtering (Web 2.0 style)