PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story - - PowerPoint PPT Presentation

pan fire 2013 overview of the cross language ndian news
SMART_READER_LITE
LIVE PREVIEW

PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story - - PowerPoint PPT Presentation

PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track Parth Gupta 1 , Paul Clough 2 , Paolo Rosso 1 , Mark Stevenson 2 , and Rafael E. Banchs 3 1 Technical University of Valencia (UPV), Spain 2 University of


slide-1
SLIDE 1

PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track

Parth Gupta1, Paul Clough2, Paolo Rosso1, Mark Stevenson2, and Rafael E. Banchs3

1Technical University of Valencia (UPV), Spain 2University of Sheffield, UK 3Institute for Infocomm Research (I2R), Singapore

http://www.dsic.upv.es/grupos/nle/clinss.html

December 4, 2013

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 1 / 23

slide-2
SLIDE 2

Outline

1

Motivation

2

Task Description

3

Corpus

4

Evaluation

5

Participation Overview

6

References

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 2 / 23

slide-3
SLIDE 3

Motivation

Cross-language NLP and IR heavily rely on parallel and comparable data Parallel data is precious but scarce Most of the available data is quasi-comparable - not topically aligned The technologies to extract parallel or comparable fragments from quasi-comparable data will be very useful in such scenarios

Current Scene

All languages don’t have parallel data - and the available data is too small to rely Comparable corpus (Wikipedia) is not reliable in many languages In fact many languages do not have enough data

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 3 / 23

slide-4
SLIDE 4

Two Questions:

1 What can be considered a constant source of text across languages? 2 ... that can contain parallel or comparable fragments?

Answer

Wikipedia articles - often, people create pages by translating English pages! News stories - journalistic text re-use!

Which languages to work on?

Resource Poor Languages

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 4 / 23

slide-5
SLIDE 5

Background - Web and Languages

Language Web Representationa Rank Language Percentage 1 English 54.9% 2 Russian 6.1% 3 German 5.3% 4 Spanish 4.8% 5 Chinese 4.4% 6 French 4.3% 7 Japanese 4.2% 8 Arabic 3.0% 9 Portuguese 2.3% 10 Polish 1.8% . . . 36 Latvian 0.1% 37 Estonian 0.1%

aWikipedia page: “Languages used on the Internet” b Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 5 / 23

slide-6
SLIDE 6

Background - Web and Languages

Language Populationa Rank Language Speakers (millions) % of world 1 Mandarin 955 14.1 2 Spanish 407 5.85 3 English 359 5.52 4 Hindi 311 4.46 5 Arabic 293 4.23 6 Portuguese 216 3.08 7 Bengali 206 3.05 8 Russian 154 2.42 9 Japanese 126 1.92 10 Punjabi 102 1.44

aThe estimates used for this list are those of Nationalencyclopedin and is

based on estimates published in 2010 - Wikipedia.

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 6 / 23

slide-7
SLIDE 7

Motivation Contd..

How do such algorithms perform? [Platt et al., 2010]

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 7 / 23

slide-8
SLIDE 8

Wikipedias and News data

Wikipedia Size English 4,392,107 Spanish 1,061,460 German 1,658,515 . . . Hindi 109,046 Tamil 57,828 Year NT1 Size TOI2 Size 2011 117,411 243,773 2012 128,610 254,036

1Navbharat Times: Hindi Daily 2Times of India: English Daily Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 8 / 23

slide-9
SLIDE 9

Task Description

Observation

News stories covering the same event published in different languages may be rich sources of parallel and comparable text. Some fragments in these stories are parallel, for example, personal quotes and translated versions of the same content.

Definitions [Barker and Gaizauskas, 2012]

Focal Event: The main event or events which provide a focus for the news story

◮ e.g. Romney vs. Obama in Ohio: With superior ground operations, the

president widens his lead Background Event: an event that plays a supporting role in the text, providing context for the focal events

◮ e.g. Probable the last encounter between the two

News Event: a group of related events, broader than and including the focal event, which may be reported over time in different news text installments

◮ e.g. Presidential election polls Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 9 / 23

slide-10
SLIDE 10

Task Description

Statement

For each t ∈ T, find s ∈ S covering the same focal event and news event

Source Collection Target Collection S = L1 L2 · · · Ln T = English Articles Link each story t in T to s in S which share same news event or focal event for each L

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 10 / 23

slide-11
SLIDE 11

Flow Diagram

Pair(A,B) Same News Event Different News Event Same News Event Same Focal Event Same News Event Different Focal Event Year 2012/13 Task: Story Detection Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 11 / 23

slide-12
SLIDE 12

Article Title Relevance Level Target There’s lot more to talk than my 50th Test ton: Tendulkar english-document-00006.txt Source1

  • 50
  • к

a ки

  • :
  • к

2 (same focal event) There are many things except my 50th century: Tendulkar hindi-document-24799.txt Source2 и

  • к

1 (same news event) Sachin makes fifty in century hindi-document-08018.txt

Table: Example English-Hindi text pairs describing the same news event but different focal events

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 12 / 23

slide-13
SLIDE 13

Corpus Statistics

Table: CL!NSS 2012 corpus statistics. The statistics are shown for the source partition Dhi (Hindi) and a target collection Den. The column headers stand for: |D| number of documents in the corpus (partition), |Dtokens| total number of tokens, |Dvoc| total size of vocabulary (unique terms). k= thousand, M = million.

Partition |D| |Dtokens| |Dvoc| Den 25 9.3k 2.5k Dhi 50691 15.6M 143k Metadata

◮ Title of the news story ◮ Date of publication ◮ Content of the Story Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 13 / 23

slide-14
SLIDE 14

Evaluation Framework

Relevance

The relevance level of the source news stories for the given test queries will be in 2,1,0 where,

◮ 2 = “same news event + same focal event” ◮ 1 = “same news event + different focal event” and ◮ 0 = “different news event”

Measures

NDCG@k, k = 1, 5, 10

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 14 / 23

slide-15
SLIDE 15

Evaluation: Relevance Judgment Tool

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 15 / 23

slide-16
SLIDE 16

Relevance Overview

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 16 / 23

slide-17
SLIDE 17

Timeline

6 May, 2013 Release of training corpus 4 Sept, 2013 Release of test corpus 27 Oct, 2013 Submission of runs 10 Nov, 2013 Release of qrels (result notification) 15 Nov, 2013 Working notes due 05 Dec, 2013 CL!NSS @ FIRE in New Delhi!

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 17 / 23

slide-18
SLIDE 18

Participation Overview

Submission details

Teams were asked to submit results in terms of rank-list for each language pair. Each team could submit up to 3 runs to try different approaches or configurations.

Participation

Teams 2012 2013 Registered 10 16 Participated 3 8 Runs 8 23 Working notes 2 6

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 18 / 23

slide-19
SLIDE 19

Results

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 19 / 23

slide-20
SLIDE 20

Lessons Learnt

Sometimes manually determining the focal/news events is quite difficult. The scores achieved this year are quite high NGCD@1 0.78 vs. last year’s best 0.32 Incorporating meta-information explicitly in similarity estimation helps It is also observed that carefully selecting query terms from target documents help to improve the performance Although, the approaches are motivated to treat the problem as ranking, more sophisticated modeling of stories would certainly help determining same focal events

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 20 / 23

slide-21
SLIDE 21

CL!NSS Programme

Time Details Speaker/s 4th December 12:00 Overview Talk Parth Gupta 5th December 15:30 Participant Talk Amogh Param 15:45 Participant Talk Piyush Arora 16:00 Participant Talk Aarti Kumar 16:15 Participant Talk Sujoy Das

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 21 / 23

slide-22
SLIDE 22

Thank You! ¨ ⌣ (on behalf of CL!NSS Team) http://www.dsic.upv.es/grupos/nle/clinss.html

Supported By

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 22 / 23

slide-23
SLIDE 23

References I

Barker, E. and Gaizauskas, R. J. (2012). Assessing the comparability of news texts. In LREC. Platt, J. C., Toutanova, K., and tau Yih, W. (2010). Translingual document representations from discriminative projections. In EMNLP, pages 251–261. Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 23 / 23