Exploring Measures of Readability for Spoken Language Introduction - - PowerPoint PPT Presentation

exploring measures of readability for spoken language
SMART_READER_LITE
LIVE PREVIEW

Exploring Measures of Readability for Spoken Language Introduction - - PowerPoint PPT Presentation

Readability for Spoken Language Sowmya Vajjala and Detmar Meurers Exploring Measures of Readability for Spoken Language Introduction Analyzing linguistic features of subtitles to identify age-specific TV programs Our Approach The


slide-1
SLIDE 1

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Exploring Measures of “Readability” for Spoken Language

Analyzing linguistic features of subtitles to identify age-specific TV programs

Sowmya Vajjala and Detmar Meurers

University of T¨ ubingen, Germany The 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations @ EACL, Gothenburg, April 27, 2014

1 / 19

slide-2
SLIDE 2

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

The talk in a nutshell

◮ Idea: investigate if features from readability assessment

can be used to characterize age-specific TV programs.

◮ based on a corpus of BBC subtitles ◮ using a text classification approach

◮ We show that the authentic materials targeting specific

age groups exhibit

◮ a broad range of linguistic and psycholinguistic

characteristics

◮ indicative of the complexity of the language used.

◮ Our approach reaches an accuracy of 95.9%.

2 / 19

slide-3
SLIDE 3

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Motivation

◮ Reading, listening and watching TV are all ways to

  • btain information.

◮ Some TV programs are also created for particular

age-groups (similar to graded readers).

◮ Audio-visual presentation and language are important

factors in creating age-specific TV programs.

◮ How characteristic of the targeted age group is

language by itself?

◮ We hypothesize that the linguistic complexity of the

subtitles is a good predictor.

◮ We explore this hypothesis using features from

automatic readability assessment.

3 / 19

slide-4
SLIDE 4

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Our Approach: Overview

◮ Corpus: BBC subtitles (Van Heuven et al. 2014)

◮ TV programs targeting different age groups

◮ Features: range of properties, mostly from Second

Language Acquisition and Psycholinguistic research

◮ Modeling: three-class text classification ◮ Evaluation: accuracy, with 10-fold cross-validation

4 / 19

slide-5
SLIDE 5

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

The BBC Subtitles Corpus

◮ The BBC started subtitling all scheduled programs on

its main channels in 2008.

◮ Van Heuven et al. (2014) compiled a subtitles corpus

from nine BBC TV channels.

◮ Subtitles of four channels are annotated:

CBeebies, CBBC, News and Parliament.

◮ Corpus in numbers:

Program Category Age group # texts

  • avg. tokens
  • avg. sentence length

per text (in words) CBEEBIES < 6 years 4846 1144 4.9 CBBC 6–12 years 4840 2710 6.7 Adults (News + Parliament) > 12 years 3776 4182 12.9

◮ We use a balanced set consisting of 3776 texts per class.

5 / 19

slide-6
SLIDE 6

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Features 1

◮ Lexical Features

◮ lexical richness features from Second Language

Acquisition (SLA) research

◮ e.g., Type-Token ratio, noun variation, . . . ◮ POS density features ◮ e.g., # nouns/# words, # adverbs/# words, . . . ◮ traditional features and formulae ◮ e.g., # characters per word, Flesch-Kincaid score, . . .

◮ Syntactic Features

◮ syntactic complexity features from SLA research. ◮ e.g., # dep. clauses/clause, average clause length, . . . ◮ other parse tree features ◮ e.g., # NPs per sentence, avg. parse tree height, . . .

= Features from Vajjala & Meurers (2012)

6 / 19

slide-7
SLIDE 7

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Features 2

◮ Morphological properties of words

◮ e.g., Does the word contain a stem along with an affix?

abundant=abound+ant

◮ Age of Acquisition (AoA)

◮ average age-of-acquisition of words in a text

◮ Other Psycholinguistic features

◮ e.g., word abstractness

◮ Avg. number of senses per word (obtained from WordNet)

7 / 19

slide-8
SLIDE 8

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Implementation details

Tools, Resources and Algorithms used

◮ Tools:

◮ For Lexical Features ◮ Stanford Tagger (Toutanova et al. 2003) ◮ For Syntactic Features ◮ Berkeley Parser (Petrov & Klein 2007) ◮ Tregex Pattern Matcher (Levy & Andrew 2006) ◮ For Classification: ◮ algorithms implemented in WEKA (http://www.cs.waikato.ac.nz/ml/weka/).

◮ Resources:

◮ Celex Lexical Database (http://celex.mpi.nl) ◮ Kuperman et al. (2012)’s AoA ratings ◮ MRC Psycholinguistic database (http://ota.oucs.ox.ac.uk/headers/1054.xml) ◮ Wordnet Database (http://wordnet.princeton.edu) 8 / 19

slide-9
SLIDE 9

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Classification Experiments

◮ We explored several classification algorithms

(SMO, J48 Decision tree, Logistic Regression)

◮ SMO marginally outperformed the others (1–1.5%). ◮ So, all further experiments were performed with SMO.

◮ Random baseline: 33% ◮ Sentence length baseline: 71.4% ◮ Accuracy using the full set of 152 features: 95.9%.

9 / 19

slide-10
SLIDE 10

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Feature Selection

We explored two feature selection approaches to understand which features contribute the most to classification accuracy.

  • 1. Select features individually based on Information Gain (IG)

◮ implemented as InfoGainAttributeEval in WEKA

  • 2. Select a subset of those features that do not correlate

with each other but are highly predictive.

◮ implemented as CfsSubsetEval (Hall 1999) in WEKA 10 / 19

slide-11
SLIDE 11

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Results for Top 10 IG features

Rank Feature Accuracy 1

  • avg. AoA (Kuperman et al. 2012)

82.4% 2

  • avg. # PPs in a sentence

74.0% 3

  • avg. # instances where the lemma

has stem and affix 77.7% 4

  • avg. parse tree height

73.4% 5

  • avg. # NPs in a sentence

73.0% 6

  • avg. # instances of affix substitution

74.3% 7

  • avg. # prep. in a sentence

72.0% 8

  • avg. # instances where a lemma is

not a count noun 68.3% 9

  • avg. # clauses per sentence

72.5% 10 sentence length 71.4% Accuracy with all the 10 features together: 84.5%.

11 / 19

slide-12
SLIDE 12

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Feature selection with CfsSubsetEval

◮ 41 features of the total of 152 features are selected.

◮ The full list of selected features is provided in the paper.

◮ Classification accuracy with 41 features: 93.9%.

→ only 2% less than classification with all the features

12 / 19

slide-13
SLIDE 13

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Feature selection: Result Summary

Feature Subset (#) Accuracy SD on 10-fold CV All Features (152) 95.9% 0.37 Cfs on all features (41) 93.9% 0.59 Top-10 IG features (10) 84.5% 0.70

  • Avg. SD for the test sets in all CV folds is given to make

comparisons in terms of statistical significance possible.

13 / 19

slide-14
SLIDE 14

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Ablation Test Results

Features Acc. SD All Features 95.9% 0.37 All − AoA Kup Lem 95.9% 0.37 All − All AoA Features 95.6% 0.58 All − psych 95.8% 0.31 All − celex 94.7%* 0.51 All − celex−psych 93.6%* 0.66 All − celex−psych−lex (= syntax only) 77.5%* 0.99 lex 93.1%* 0.70 celex 90.0%* 0.79 psych 84.5%* 1.12 The * indicates that the results are statistically different from the model with all features.

14 / 19

slide-15
SLIDE 15

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Confusion matrix for model with all features

classified as → CBeebies CBBC Adults CBeebies (0–6) 3619 156 1 CBBC (6–12) 214 3526 36 Adults (12+) 2 58 3716

15 / 19

slide-16
SLIDE 16

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Effect of Text Size on Classification Accuracy

65 70 75 80 85 90 95 100 100 200 300 400 500 600 700 800 900 classification accuracy (in percent)

  • max. text size (in number of words)

All Features PSYCH LEX SYN CELEX

◮ Longer texts support higher accuracy. ◮ But even with 100 words per text, we reach >80%.

16 / 19

slide-17
SLIDE 17

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Summary

◮ The rich (psycho)linguistic feature set performs very

well, achieving a classification accuracy of 95.9%.

◮ Single most predictive feature: AoA (accuracy: 82.4%)

◮ But, removing this feature did not affect the accuracy.

⇒ The age-specific nature of authentic material is reflected

in a wide range of linguistic and psychological properties.

◮ For practical tasks, accuracies above 90% can also be

achieved with feature subsets and relatively short texts.

◮ Longer texts helped in predicting with more accuracy.

17 / 19

slide-18
SLIDE 18

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Outlook

◮ Explore effects of using a parser tuned to spoken

language, to see if syntactic features can be improved.

◮ More qualitative error analysis of misclassified cases ◮ Perform cross-genre evaluation comparing written and

spoken texts in terms of their complexity.

18 / 19

slide-19
SLIDE 19

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Thank you!

◮ Questions? :-) ◮ Contact: sowmya@sfs.uni-tuebingen.de

19 / 19

slide-20
SLIDE 20

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

References

Hall, M. A. (1999). Correlation-based Feature Selection for Machine Learning. Ph.D. thesis, The University of Waikato, Hamilton, NewZealand. Kuperman, V., H. Stadthagen-Gonzalez & M. Brysbaert (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods 44(4), 978–990. URL http://crr.ugent.be/archives/806. Levy, R. & G. Andrew (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In 5th International Conference on Language Resources and Evaluation. Genoa, Italy: European Language Resources Association (ELRA), pp. 2231–2234. Petrov, S. & D. Klein (2007). Improved Inference for Unlexicalized Parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. Rochester, New York, pp. 404–411. Toutanova, K., D. Klein, C. D. Manning & Y. Singer (2003). Feature-rich part-ofspeech tagging with a cyclic dependency network. In HLT-NAACL. Edmonton, Canada, pp. 252–259. Vajjala, S. & D. Meurers (2012). On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition. In In Proceedings of the 7th Workshop on Innovative Use of NLP for Building Educational Applications (BEA) at NAACL-HLT. Montr´ eal, Canada: ACL, pp. 163—-173. URL http://aclweb.org/anthology/W12-2019.pdf. Van Heuven, W. J., P . Mandera, E. Keuleers & M. Brysbaert (2014). Subtlex-UK: A new and improved word frequency database for British English. The Quarterly

19 / 19

slide-21
SLIDE 21

’Readability’ for Spoken Language

Sowmya Vajjala and Detmar Meurers

Introduction Our Approach

The Corpus Features Tools and Resources

Experiments

Setup and General Results Feature Selection Ablation Test Results Confusion Matrix Effect of Text Size

Conclusions

Journal of Experimental Psychology pp. 1–15. URL http://dx.doi.org/10.1080/17470218.2013.850521.

19 / 19