Collecting Aligned Textual Corpora from the Hidden Web Botjan - - PowerPoint PPT Presentation

▶

Aug 04, 2023 346 likes •428 views

Collecting Aligned Textual Corpora from the Hidden Web Botjan Pajntar bostjan.pajntar@ijs.si ailab.ijs.si Aligned Parallel Corpus Definition (wikipedia): A parallel text is a text placed alongside its translation or translations

SLIDE 1

ailab.ijs.si

Boštjan Pajntar

bostjan.pajntar@ijs.si

Collecting Aligned Textual Corpora from the Hidden Web

SLIDE 2

ailab.ijs.si

Aligned Parallel Corpus

Definition (wikipedia):

“A parallel text is a text placed alongside its translation or translations”

Usage:

Translation Memory Machine Translation Natural Language Processing

Standards:

TMX – Translation Memory eXchange TBX – TermBase eXchange UTX – Universal Terminology eXchange (SRX, GMX-GILT, OLIF, XLIFF, TransWS, ...)

SLIDE 3

ailab.ijs.si

But Where to Get the Data?

Non-English professional websites Huge amounts of translated text Generally quality translations We call this the Hidden Web

SLIDE 4

ailab.ijs.si

Problems

Translation Memory is hard / expensive to obtain

Idea: Automatic harnessing of existing data

Data should have very high precision

What precision is needed?

No standard fully supports automatic:

Harnessing of the data Cleaning of the data

SLIDE 5

ailab.ijs.si

Crawling

WEB

Candidates Extraction Filtering Parsing

List of HTML candidates Relational Database List of text candidates Parallel Corpora Parallel Corpora Extraction

Proposed Solution

Available at: http://kameleon.ijs.si/t4me

SLIDE 6

ailab.ijs.si

Discussion on Standards

We build on TMX:

Is this the right choice? Source language must be defined! An optional parameter to define the source of each segment

Proposals for automatic harnessing of TM:

Provide a new standard Build on an existing one Ideas?

Collecting Aligned Textual Corpora from the Hidden Web Botjan - - PowerPoint PPT Presentation

ailab.ijs.si

Boštjan Pajntar

bostjan.pajntar@ijs.si

Collecting Aligned Textual Corpora from the Hidden Web

ailab.ijs.si

Aligned Parallel Corpus

Definition (wikipedia):

Usage:

Translation Memory Machine Translation Natural Language Processing

Standards:

TMX – Translation Memory eXchange TBX – TermBase eXchange UTX – Universal Terminology eXchange (SRX, GMX-GILT, OLIF, XLIFF, TransWS, ...)

ailab.ijs.si

But Where to Get the Data?

Non-English professional websites Huge amounts of translated text Generally quality translations We call this the Hidden Web

ailab.ijs.si

Problems

Translation Memory is hard / expensive to obtain

Idea: Automatic harnessing of existing data

Data should have very high precision

What precision is needed?

No standard fully supports automatic:

Harnessing of the data Cleaning of the data

ailab.ijs.si

Proposed Solution

ailab.ijs.si

Discussion on Standards

We build on TMX:

Is this the right choice? Source language must be defined! An optional parameter to define the source of each segment

Proposals for automatic harnessing of TM:

Provide a new standard Build on an existing one Ideas?

ailab.ijs.si

Future Work

Optimizing Crawling:

Two phase crawling Character Encodings Enhanced candidates extraction

Optimizing Extraction:

Segmentation Language identification Enhanced filtering

Web service / Web application

Translation Memory distribution Filtering (Web 2.0 style)