LREC 2004, Lisbon, May 2004
1
Next Generations of Speech-to-Text Christopher Cieri, David Miller, - - PowerPoint PPT Presentation
The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text Christopher Cieri, David Miller, Kevin Walker {ccieri,damiller,walkerk}@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium and Department of Linguistics
LREC 2004, Lisbon, May 2004
1
LREC 2004, Lisbon, May 2004
2
– greater volumes of data in more languages – with more sophisticated annotation – for use in an expanding number of disciplines – requiring standards, tools and best practices
– specific projects in data collection, annotation and publications – incorporating annotation, research and tool development
– more intensive collaboration between researchers and data providers – yielding more data creators, researchers with better appreciation for data creation and data creators with better appreciation of data uses
LREC 2004, Lisbon, May 2004
3
Arabic > Chinese) and volumes of data for training and testing
corpora
volume
LREC 2004, Lisbon, May 2004
4
– 2000 hours – each subject makes 1-3 calls – maximum call length is 10 minutes
– Sex: m/f – Age: 16-29, 30-49, 50+ – Region: North, Midland, South, West, Canada, Other (?) – also monitor handset, education, occupation in collection
LREC 2004, Lisbon, May 2004
5
LREC 2004, Lisbon, May 2004
6
LREC 2004, Lisbon, May 2004
7
LREC 2004, Lisbon, May 2004
8
LREC 2004, Lisbon, May 2004
9
LREC 2004, Lisbon, May 2004
10
LREC 2004, Lisbon, May 2004
11
LREC 2004, Lisbon, May 2004
12
LREC 2004, Lisbon, May 2004
13
LREC 2004, Lisbon, May 2004
14
LREC 2004, Lisbon, May 2004
15
LREC 2004, Lisbon, May 2004
16
LREC 2004, Lisbon, May 2004
17
– 7 days per week – noon (EST) > midnight (PST)
– availability schedules of participants » given by day and hour » robot operator called at least once in each available block – caller activity » in Fisher, callers had little motivation to initiate calls » Mixer offer incentives for call-ins and volume is much higher » platform functioned well in both cases » non-participation = de-selection – total platform activity (energy)
LREC 2004, Lisbon, May 2004
18
LREC 2004, Lisbon, May 2004
19
LREC 2004, Lisbon, May 2004
20
LREC 2004, Lisbon, May 2004
21
LREC 2004, Lisbon, May 2004
22
LREC 2004, Lisbon, May 2004
23
2000 4000 6000 8000 10000 12000 14000 16000 18000 11/27/2002 12/11/2002 12/25/2002 1/8/2003 1/22/2003 2/5/2003 2/19/2003 3/5/2003 3/19/2003 4/2/2003 4/16/2003 4/30/2003 5/14/2003 5/28/2003 6/11/2003 6/25/2003 7/9/2003 7/23/2003 8/6/2003 8/20/2003 9/3/2003 9/17/2003 10/1/2003 10/15/2003 10/29/2003 11/12/2003 11/26/2003 12/10/2003
1000 2000 3000 4000 5000 6000 7000 8000 9000 1 2 3
LREC 2004, Lisbon, May 2004
24
Group
– 16-29 38% – 30-49 45% – 50+ 17%
Male Female 16-29 30-43 50+
LREC 2004, Lisbon, May 2004
25
Distribution by Region
– North 24% – Midland 26% – South 19% – West 17% – Canada 1% – Non-USA 3% – Non-Native 10%
North Midland South West Canada Non-USA Non-Native
LREC 2004, Lisbon, May 2004
26
– 160 seconds, 4 segments – Grade: A, C, F
– Language: Is it English? Is it understandable? – Speaker: Does speaker seem to belong to age, gender registered? – Channel: Do noise, echo, distortion levels interfere with comprehension – Call Content: Is discussion directed speech on assigned topic?
LREC 2004, Lisbon, May 2004
27
– complete, verbatim – without punctuation, special symbols, talker/background noise – with limited interjections, non-lexemes – (( )) for unclear speech, – for truncated speech – annotators may insert other special symbols, punctuation if natural
– Segmentation: 3xRT > 0xRT (automatic or forced aligned) – Transcription: 5xRT – Post Processing 1xRT: QC on spelling, format, numbers
– spelled acronyms, numbers, spacing, proper names, disfluencies
– all new EARS English and Arabic training data is QTr style – most English produced by WordWave under contract to BBNT. – LDC provides some English QTr and all Levantine Arabic
LREC 2004, Lisbon, May 2004
28
– similar model – >1000 hours new collection – subjects allowed to make up to 20 calls
LREC 2004, Lisbon, May 2004
29