types2
Exploring word-frequency differences in corpora
Tanja Säily & Jukka Suomela
types 2 Exploring word-frequency differences in corpora Tanja Sily - - PowerPoint PPT Presentation
types 2 Exploring word-frequency differences in corpora Tanja Sily & Jukka Suomela Comparing word frequencies Corpus linguists do this all the time Significance of differences observed? Bag-of-words tests (e.g. chi-square,
Exploring word-frequency differences in corpora
Tanja Säily & Jukka Suomela
– Bag-of-words tests (e.g. chi-square, log-likelihood ratio test) assume words occur randomly in texts,
– Tests based on resampling: assumption-free, yield confidence intervals
– Not conducive to rapid exploration
– Need to go back to the concordances & metadata
figures, linked data
– BNCweb, WordSmith Tools…
corpus metadata
– Sociolinguistic variation in their productivity? – Productivity ~ type frequency
– Demographically sampled spoken component, both gender & social class known: 2.6 Mw – BNCweb (Lancaster University) – MorphoQuantics (Laws & Ryder 2014)
BNC raw search results word, POS, class, lemma relevant hits meta- data, word counts data- base plots, results, web pages types2 BNCweb Morpho- Quantics
– Generating hypotheses
– Interpreting results
focus on tools & occupations?
corpora & concordancers
http://morphoquantics.co.uk
derivational morphology in adult speech: A corpus analysis using MorphoQuantics. Language Studies Working Papers: University of Reading, Vol. 6, 3–17.
accumulation curves. http://users.ics.aalto.fi/suomela/types2/
affixes with BNC & MorphoQuantics data. https://github.com/suomela/bnc-affix
– Sociolinguistic variation in their productivity?
Correspondence
– Long 18th century, 1680–1800: 2.2 Mw – WordSmith Tools (Mike Scott) – Pruned down to relevant hits in Excel
“If you find any thing that has the least appearance of coxcombicality, affectation, importance, conceit, &c., have no mercy upon it.”
Thomas Twining to Richard Twining, 1788