Parsing transcripts of speech
Andrew Caines1, Michael McCarthy2 & Paula Buttery1
1University of Cambridge 2University of Nottingham
Parsing transcripts of speech Andrew Caines 1 , Michael McCarthy 2 - - PowerPoint PPT Presentation
Parsing transcripts of speech Andrew Caines 1 , Michael McCarthy 2 & Paula Buttery 1 1 University of Cambridge 2 University of Nottingham Speech-Centric NLP, 7 September 2017 Background Speech (can be) very different from writing Put
1University of Cambridge 2University of Nottingham
2/17
2/17
2/17
2/17
2/17
◮ filled pauses: um he’s a closet yuppie is what he is ◮ repetitions: I played, I played against um ◮ false starts: You’re happy to – welcome to include it
3/17
◮ filled pauses: um he’s a closet yuppie is what he is ◮ repetitions: I played, I played against um ◮ false starts: You’re happy to – welcome to include it
3/17
◮ filled pauses: um he’s a closet yuppie is what he is ◮ repetitions: I played, I played against um ◮ false starts: You’re happy to – welcome to include it
3/17
◮ filled pauses: um he’s a closet yuppie is what he is ◮ repetitions: I played, I played against um ◮ false starts: You’re happy to – welcome to include it
3/17
◮ filled pauses: um he’s a closet yuppie is what he is ◮ repetitions: I played, I played against um ◮ false starts: You’re happy to – welcome to include it
3/17
◮ filled pauses: um he’s a closet yuppie is what he is ◮ repetitions: I played, I played against um ◮ false starts: You’re happy to – welcome to include it
3/17
◮ PTB Switchboard Corpus of transcribed telephone
◮ UD English Web Treebank (EWT) ◮ UD English LinES (LinES), parallel corpus of English novels
◮ UD Treebank of Learner English (TLE), subset of CLC-FCE 4/17
◮ PTB Switchboard Corpus of transcribed telephone
◮ UD English Web Treebank (EWT) ◮ UD English LinES (LinES), parallel corpus of English novels
◮ UD Treebank of Learner English (TLE), subset of CLC-FCE 4/17
◮ PTB Switchboard Corpus of transcribed telephone
◮ UD English Web Treebank (EWT) ◮ UD English LinES (LinES), parallel corpus of English novels
◮ UD Treebank of Learner English (TLE), subset of CLC-FCE 4/17
◮ PTB Switchboard Corpus of transcribed telephone
◮ UD English Web Treebank (EWT) ◮ UD English LinES (LinES), parallel corpus of English novels
◮ UD Treebank of Learner English (TLE), subset of CLC-FCE 4/17
◮ PTB Switchboard Corpus of transcribed telephone
◮ UD English Web Treebank (EWT) ◮ UD English LinES (LinES), parallel corpus of English novels
◮ UD Treebank of Learner English (TLE), subset of CLC-FCE 4/17
5/17
6/17
7/17
8/17
◮ PTB Switchboard Corpus of transcribed telephone
◮ UD English Web Treebank (EWT) ◮ UD English LinES (LinES), parallel corpus of English novels
◮ UD Treebank of Learner English (TLE), subset of CLC-FCE
9/17
◮ PTB Switchboard Corpus of transcribed telephone
◮ UD English Web Treebank (EWT) ◮ UD English LinES (LinES), parallel corpus of English novels
◮ UD Treebank of Learner English (TLE), subset of CLC-FCE
9/17
◮ PTB Switchboard Corpus of transcribed telephone
◮ UD English Web Treebank (EWT) ◮ UD English LinES (LinES), parallel corpus of English novels
◮ UD Treebank of Learner English (TLE), subset of CLC-FCE
9/17
10/17
SWB EWT LinES TLE 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
unit length (tokens) unlabelled attachment score
11/17
12/17
12/17
12/17
13/17
SWB EWT LinES TLE 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
unit length (tokens) UAS Δ(PCFG_WSJ_SWB_40, CoreNLP)
14/17
15/17
15/17
15/17
15/17
15/17
16/17
16/17
16/17
16/17
16/17
◮ Cambridge English Language Assessment ◮ Sebastian Schuster & Chris Manning re UD corpora ◮ SCNLP organisers ◮ 3 anonymous reviewers
17/17