What Data Is Needed? Why? Dr. Khalid Choukri (Evaluations and - - PowerPoint PPT Presentation

what data is needed why
SMART_READER_LITE
LIVE PREVIEW

What Data Is Needed? Why? Dr. Khalid Choukri (Evaluations and - - PowerPoint PPT Presentation

What Data Is Needed? Why? Dr. Khalid Choukri (Evaluations and Language Resource Association) ELRC Workshop in Deutschland, 29.09.2015 1 What types of data? Translation From previous session, we have seen the predominant


slide-1
SLIDE 1

ELRC Workshop in Deutschland, 29.09.2015

“What Data Is Needed? Why?”

  • Dr. Khalid Choukri

(Evaluations and Language Resource Association)

1

slide-2
SLIDE 2

ELRC Workshop in Deutschland, 29.09.2015

  • From

previous session, we have seen the predominant approach of data-driven paradigm

  • We “learn” from existing data
  • How are Language Resources produced:
  • from

documents and data to valuable Language Resources for MT

  • Why it is important that you help us with the data you

have / you know about

  • The focus is on data in all languages (EU/CEF).

What types of data? Translation

2

slide-3
SLIDE 3

ELRC Workshop in Deutschland, 29.09.2015

What Data

4

slide-4
SLIDE 4

ELRC Workshop in Deutschland, 29.09.2015

Translations & Automated Translations

5

slide-5
SLIDE 5

ELRC Workshop in Deutschland, 29.09.2015

Translations

6

slide-6
SLIDE 6

ELRC Workshop in Deutschland, 29.09.2015

What types of data? Translation

7

slide-7
SLIDE 7

ELRC Workshop in Deutschland, 29.09.2015

What types of data? “Aligned” Translation

8

English French

slide-8
SLIDE 8

ELRC Workshop in Deutschland, 29.09.2015

What types of data? “Aligned” Translation

9

slide-9
SLIDE 9

ELRC Workshop in Deutschland, 29.09.2015

Comparable Collections

10 Source: First sentences of articles for Telecommunications in the English, Greek and Spanish Wikipedias German page is slightly different but these are (never) translations of one source!!

English Τelecommunication occurs when the exchange of information between two or more entities (communication) includes the use of technology. Communication technology uses channels to transmit information (as electrical signals), either over a physical medium (such as signal cables), or in the form

  • f electromagnetic waves.

The word is often used in its plural form, telecommunications, because it involves many different technologies.

Greek

Με τον γενικό όρο τηλεπικοινωνίες, (telecommunications), χαρακτηρίζεται η κάθε μορφής ενσύρματη ή ασύρματη, ηλεκτρομαγνητική, ηλεκτρική, κ.λπ., ακουστική και οπτική επικοινωνία που πραγματοποιείται ανεξαρτήτως απόστασης. Στους σύγχρονους καιρούς, αυτή η διαδικασία σχεδόν πάντα περιλαμβάνει την αποστολή ηλεκτρομαγνητικών κυμάτων ή ηλεκτρικών σημάτων από κατάλληλες ηλεκτρονικές συσκευές, όπως το τηλέφωνο ή ο ασύρματος, αλλά παλαιότερα περιελάμβανε τη χρήση ακουστικών σημάτων, όπως τυμπάνων, ή οπτικών, όπως ο σηματοφόρος καπνός ή η λάμψη της φωτιάς.

Spanish

Una telecomunicación es toda transmisión y recepción de señales de cualquier naturaleza, típicamente electromagnéticas, que contengan signos, sonidos, imágenes o, en definitiva, cualquier tipo de información que se desee comunicar a cierta distancia. Por metonimia, también se denomina telecomunicación (o telecomunicaciones, indistintamente) a la disciplina que estudia, diseña, desarrolla y explota aquellos sistemas que permiten dichas comunicaciones; de forma análoga, la ingeniería de telecomunicaciones resuelve los problemas técnicos asociados a esta disciplina.

slide-10
SLIDE 10

ELRC Workshop in Deutschland, 29.09.2015

Dictionaries / Terminologies /Ontologies

12

ID FR ES EL 6905 abandon scolaire abandono escolar διακοπή της σχολικής φοίτησης 920 abats despojo παραπροϊόντα σφαγίων 1857 abattage d'animaux sacrificio de animales σφαγή ζώων 6621 abrogation derogación κατάργηση 5075 Abruzzes Abruzos Αβρουζία 5339 absentéisme absentismo συστηματική απουσία από την εργασία 5984 abstentionnisme abstencionismo αποχή 2 abus de confiance abuso de confianza απιστία 96 abus de droit abuso de derecho κατάχρηση δικαιώματος 186 abus de pouvoir abuso de poder κατάχρηση εξουσίας 280 accès à l'éducation acceso a la educación πρόσβαση στην εκπαίδευση 372 accès à l'emploi acceso al empleo πρόσβαση στην αγορά εργασίας

slide-11
SLIDE 11

ELRC Workshop in Deutschland, 29.09.2015

Where can we find such data? Digital World

  • Archives

13

  • Internet
slide-12
SLIDE 12

ELRC Workshop in Deutschland, 29.09.2015

Digital word … Internet

14

slide-13
SLIDE 13

ELRC Workshop in Deutschland, 29.09.2015

Internet Era & Digital Data

15

slide-14
SLIDE 14

ELRC Workshop in Deutschland, 29.09.2015

Of course need for digital textual data !!

16

slide-15
SLIDE 15

ELRC Workshop in Deutschland, 29.09.2015

Various Formats

17

slide-16
SLIDE 16

ELRC Workshop in Deutschland, 29.09.2015

Documented Data (Meta-data)

18

Dublin Core Metadata Element Set

  • 1. Title
  • 2. Creator
  • 3. Subject
  • 4. Description
  • 5. Publisher
  • 6. Contributor
  • 7. Date
  • 8. Type
  • 9. Format

10.Identifier 11.Source 12.Language 13.Relation 14.Coverage 15.Rights

slide-17
SLIDE 17

ELRC Workshop in Deutschland, 29.09.2015

  • Let us see some examples of raw data (html with

tables, pictures, etc.) and how they become LRs

– Discover & identify sources – Clear IPR and Get the data (Download, Harvest, Crawl, …) – Clean the data (e.g. detect and remove the “boilerplate”, “templates”, pictures, html tags, etc., convert format) – Example of tools (Boilerpipe) – Document the data – Align the translations when identified and break into “sentences” – Compute some alignment confidence – Share

How LRs are produced

19

slide-18
SLIDE 18

ELRC Workshop in Deutschland, 29.09.2015

  • How can this process be turned into a factory of LR

production (Automation of the Procedure)

  • Some simple illustrations
  • We rather start from the Digital word

– OCR may be considered for the less-resourced languages

A Language Resource Factory

25

slide-19
SLIDE 19

ELRC Workshop in Deutschland, 29.09.2015

Many web sites…

slide-20
SLIDE 20

ELRC Workshop in Deutschland, 29.09.2015

… are rich in multilingual content

slide-21
SLIDE 21

ELRC Workshop in Deutschland, 29.09.2015

How can we obtain this content…

slide-22
SLIDE 22

ELRC Workshop in Deutschland, 29.09.2015 31

slide-23
SLIDE 23

ELRC Workshop in Deutschland, 29.09.2015

… and convert it to valuable Language Resources for Machine Translation?

slide-24
SLIDE 24

ELRC Workshop in Deutschland, 29.09.2015

From a web page to the Factory

33

  • How does this process scale up:
  • Identify a “useful” source (good candidate for multilingual data)
  • Review and visit all the links (the URLs referenced in each page)
  • “Click on each link” and move forward
  • Get each page and its “potentially” associated one in the other language
  • Identify the “domains”, “genre”, etc. if possible
  • Get rid of the “noise” (ads, format, boilerplate, etc.)
  • Align (documents/files, chapters, paragraphs, sentences,)
  • Check accuracy of alignment
  • Use …. And share
slide-25
SLIDE 25

ELRC Workshop in Deutschland, 29.09.2015

A Journey in the meandering lines of Internet

slide-26
SLIDE 26

ELRC Workshop in Deutschland, 29.09.2015

(automatically) Follow all referenced links

35

slide-27
SLIDE 27

ELRC Workshop in Deutschland, 29.09.2015

referenced links … Automated process

37

  • http://portal.elda.org/ http://portal.elda.org/en/
  • http://portal.elda.org/news/rss/
  • http://portal.elda.org/login/
  • http://portal.elda.org/en/login/
  • http://portal.elda.org/reset/
  • http://portal.elda.org/about/elra/contact/
  • http://portal.elda.org/en/about/elra/contact/
  • http://portal.elda.org/tag/85/
  • http://portal.elda.org/en/tag/85/
  • http://portal.elda.org/tag/86/
  • http://portal.elda.org/en/tag/86/
slide-28
SLIDE 28

ELRC Workshop in Deutschland, 29.09.2015

ILSP Focused Crawler

  • Research prototype for acquiring general
  • r domain-specific, monolingual and

bilingual corpora

  • Input:
  • Domain definitions (lists of terms)
  • Seed URLs
  • Modules (open source libraries/toolkits):

– Page Fetching/Text Extraction – Normalization and Metadata Extraction – Boilerplate Detection (Boilerpipe) – Language Detection (covering > 50 langs ) – Text Classification – Exact and near de-duplication – Detection of pairs of parallel documents – Sentence alignment (Hunalign and others)

  • Generates lists of

– document pairs and – segment pairs in TMX files

39

slide-29
SLIDE 29

ELRC Workshop in Deutschland, 29.09.2015

… it integrates technologies to crawl (part of a /multiple pages) website…

slide-30
SLIDE 30

ELRC Workshop in Deutschland, 29.09.2015

… identify the language of each crawled page …

slide-31
SLIDE 31

ELRC Workshop in Deutschland, 29.09.2015

… identify the language of each crawled page

slide-32
SLIDE 32

ELRC Workshop in Deutschland, 29.09.2015

… extract several types of data descriptors (metadata)

slide-33
SLIDE 33

ELRC Workshop in Deutschland, 29.09.2015

… and optionally classify each page as relevant

  • r not to a user-defined domain
slide-34
SLIDE 34

ELRC Workshop in Deutschland, 29.09.2015

It can detect boilerplate text …

slide-35
SLIDE 35

ELRC Workshop in Deutschland, 29.09.2015

… HTML structure and/or URL similarity to detect document pairs

slide-36
SLIDE 36

ELRC Workshop in Deutschland, 29.09.2015

… HTML structure and/or URL similarity to detect document pairs

slide-37
SLIDE 37

ELRC Workshop in Deutschland, 29.09.2015

Sometimes URLs are not enough for finding document pairs…

slide-38
SLIDE 38

ELRC Workshop in Deutschland, 29.09.2015

…. segment content in paragraphs …

slide-39
SLIDE 39

ELRC Workshop in Deutschland, 29.09.2015

… sentences and align it

slide-40
SLIDE 40

ELRC Workshop in Deutschland, 29.09.2015

Our tools supports all EU languages!

EN-GA

slide-41
SLIDE 41

ELRC Workshop in Deutschland, 29.09.2015

it supports all EU languages!

EN-FR

Score: 5.038181

slide-42
SLIDE 42

ELRC Workshop in Deutschland, 29.09.2015

It supports all EU languages!

EN-LV

slide-43
SLIDE 43

ELRC Workshop in Deutschland, 29.09.2015

  • This process can be turned into a factory of LR

production (Automation of the Procedure)

– Identify sources of data – Browse through the page links

  • BUT what we can get is the “visible” part, there are

many more in your organizations

A better alternative

55

slide-44
SLIDE 44

ELRC Workshop in Deutschland, 29.09.2015

Visible data versus existing ones

56

slide-45
SLIDE 45

ELRC Workshop in Deutschland, 29.09.2015

Our contributions … Deep web

57

slide-46
SLIDE 46

ELRC Workshop in Deutschland, 29.09.2015

  • Such documents exist already:

– At the various documentation centers (translated reports, leaflets, brochures, speeches, web pages, etc.) – At the Language Service Providers (LSP), to whom translation works are subcontracted

  • Help us identify and liaise with both sources

– (see next Panel interactions)

A better alternative

58

slide-47
SLIDE 47

ELRC Workshop in Deutschland, 29.09.2015

Your involvement is essential so please let us work together

59

BRING YOUR OWN Resources BRING YOUR OWN Language Resources BRING YOUR OWN Language Resources

slide-48
SLIDE 48

ELRC Workshop in Deutschland, 29.09.2015

Your involvement is essential so please let us work together

60

slide-49
SLIDE 49

ELRC Workshop in Deutschland, 29.09.2015 61

slide-50
SLIDE 50

ELRC Workshop in Deutschland, 29.09.2015 62

  • How to help upload the data
  • See the information on the REPOSITORY set-up for this
  • How much data is needed ?
slide-51
SLIDE 51

ELRC Workshop in Deutschland, 29.09.2015 63

Nb of pages of texts/Million words

A Commonly used metric

slide-52
SLIDE 52

ELRC Workshop in Deutschland, 29.09.2015

  • How data is produced: repurposing and repackaging

existing data

  • Why is important: the data driven paradigm is very efficient

– results improve as more data become available

  • Let us not under estimate the value of our resources
  • How can you contribute and benefit from CEF.AT

– (next sessions)

CONCLUSIONS

64