LREC 2004, Lisbon, May 2004
1
Progress Report from the Linguistic Data Consortium: recent - - PowerPoint PPT Presentation
Progress Report from the Linguistic Data Consortium: recent activities in resource creation and distribution and the development of tools and standards Christopher Cieri, Mark Liberman {ccieri,myl}@ldc.upenn.edu University of Pennsylvania
LREC 2004, Lisbon, May 2004
1
LREC 2004, Lisbon, May 2004
2
LREC 2004, Lisbon, May 2004
3
LREC 2004, Lisbon, May 2004
4
LREC 2004, Lisbon, May 2004
5
5000 10000 15000 20000 25000 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 Experimental Regular
LREC 2004, Lisbon, May 2004
6
Commercial 1 9% Government 5% Non- Profit 76%
LREC 2004, Lisbon, May 2004
7
LREC 2004, Lisbon, May 2004
8
LREC 2004, Lisbon, May 2004
9
and generate readable transcripts, adapted for downstream processing
balance; 2742 hours of audio of which 2035 have been transcribed
retrieval and summarization of multilingual, multimodal news translated back into input language
the same input text at the sentence level; with human assessments of adequacy and fluency
LREC 2004, Lisbon, May 2004
10
LREC 2004, Lisbon, May 2004
11
LREC 2004, Lisbon, May 2004
12
LREC 2004, Lisbon, May 2004
13
Arbitrary length audio files AG-compliant XML User defined tag set Functions: Listen to audio Segment easily Transcribe Code Output results in table format for further analysis Free and Extensible via distributed source code
LREC 2004, Lisbon, May 2004
14
– Fillers: filled pauses and discourse markers – Edit disfluencies » Type: repetition, revision, restart, complex » Structure: original, interruption point, editing term, correction – SUs: semantic/syntactic units » Sentence-level: statement, question, backchannel, incomplete » Phrase-level
LREC 2004, Lisbon, May 2004
15
Entities PER, ORG, FAC Relations ROLE.member-
Events
LREC 2004, Lisbon, May 2004
16
LREC 2004, Lisbon, May 2004
17