CORPUS APPROACHES TO THE LANGUAGE OF
INTERDISCIPLINARY RESEARCH ARTICLES
Aug 2013 – Aug 2015 ESRC-funded project: ES/K007300/1
Paul Thompson
with Susan Hunston, Akira Murakami, and Dominik Vajn
Look for monodisciplinary journals by finding those with a high - - PowerPoint PPT Presentation
C ORPUS APPROACHES TO THE LANGUAGE OF INTERDISCIPLINARY RESEARCH ARTICLES Aug 2013 Aug 2015 ESRC-funded project: ES/K007300/1 Paul Thompson with Susan Hunston, Akira Murakami, and Dominik Vajn B ACKGROUND Substantial amount of work
INTERDISCIPLINARY RESEARCH ARTICLES
Aug 2013 – Aug 2015 ESRC-funded project: ES/K007300/1
with Susan Hunston, Akira Murakami, and Dominik Vajn
2
¢ Substantial amount of work carried out on academic
discourse in corpus linguistics (e.g., Biber; Hyland).
¢ The research has typically drawn clear boundaries
between disciplines (e.g., history vs biology) or between levels of research (e.g, pure vs applied).
¢ In recent time there has been an expansion in
‘interdisciplinary research’
¢ Little work, nevertheless, on the linguistic nature of
interdisciplinary research as opposed to general research discourse and disciplinary research discourse.
Ø to achieve a fuller understanding of the
interdisciplinary journal published by Elsevier
¢ * Full holdings of a successful IDR journal,
675 articles ¢ * Holdings of 5 IDR journals (interdisciplinary)
¢ Surveys, interviews with editors, board members,
Data-driven approach – collect data and see what comes out it
Look for monodisciplinary journals by finding those with a high clustering coefficient, and multidisciplinary journals are those with a low clustering coefficient.
Journals with connections to journals which are themselves well-connected to one another are said to have a high clustering coefficient
R & Sketch Engine
¢ Categories
1.Empirical 2.Policy discussion 3.Research agenda / Research Framework 4.Other
¢ Disagreements were resolved by negotiation.
5 10 15 20 25 30 35 40 45 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Empirical Policy Agenda/Framework Other
¢ Corpus data can be explored ‘top-down’ or
¢ Texts can be grouped by text-external criteria
¢ We have taken different bottom-up approaches
Multidimensional Analysis [6 new dimensions] Keyword analysis Topic modelling ¢ Following slides present topic modelling and
¢ All the full research papers in the 11 journals ¢ Main text only ¢ 11,703 papers ¢ 56 million words
1 1
¢ Probabilistic TM is a machine learning technique that
automatically identifies “topics” in a given corpus (Blei, 2012)
¢ Latent Dirichlet allocation (LDA) ¢ Automatically identifies “topics” in a given corpus
keywords in each topic distribution of topics in each document
¢ Methodologically,
Bag-of-words approach → single words
¢A research paper
¢Better to divide papers into multiple
¢This allows the investigation of topic
¢ Punctuations, numbers, and the standard list of stop
words were removed.
¢ All the words were stemmed using the Porter stemmer
(e.g., require → requir, analysis → analysi).
¢ Only targeted terms that appear in at least 0.1% of all
the documents.
¢ Each text was assigned with the information on where in
the paper the paragraph(s) appeared.
e.g., 70% from the beginning of the paper
¢ This created 140,816 ‘texts’ with an average length of
248 words per text (SD = 57.6) after removal of the stop words
¢ topicmodels package in R
¢ No agreed way to automatically determine the
¢ We built topic models with 10, 20, 30, . . . ,
¢ Each topic in the model with 100 topics looked
¢ Identify prominent topics at different positions
¢ Compare prominent topics across papers or
¢ Cluster papers according to topic distribution
countri develop global world nation econom industri intern million year popul growth environ ment import major trade decad centuri economi
Each line is a different topic – there are a hundred in total
¢ Topic 3: use, dna, primer, sequenc, clone, pcr, fragment,
rna, cdna, probe, perform, hybrid, isol, amplifi, min, total, follow, gene, cycl, product
¢ Topic 40: min, buffer, use, extract, incub, contain, solut,
centrifug, assay, protein, describ, determin, wash, mixtur, supernat, gel, prepar, homogen, acid, follow
¢ Topic 55: sampl, use, extract, standard, determin, analysi,
column, analyz, method, analys, digest, analyt, concentr, abc, dri, acid, filter, mass, min, detect
¢ Topic 97: collect, use, place, sampl, water, chamber, day,
dri, remov, week, solut, filter, contain, experi, store, replic, plastic, diamet, pot, tube
¢Topic 50: will, can, may, need, requir,
¢Topic 53: may, suggest, like, might,
¢ Topic 1: pollut, deposit, sourc, atmospher, air, area, lichen,
element, moss, main, monitor, load, industri, isotop, contribut, anthropogen, particul, dust, concentr, major
¢ Topic 7: countri, develop, global, world, nation, econom,
industri, intern, million, year, popul, growth, environment, import, major, trade, mani, decad, centuri, economi
¢ Topic 32: etaal [=et al], sediment, lake, river, contamin,
concentr, water, mercuri, effluent, organ, studi, estuari, environ, wastewat, bay, mehg, figa, pollut, aquat, sourc
¢ Topic 72: studi, rice, mani, china, recent, howev, wide,
includ, also, sever, import, high, common, various, major, larg, well, report, varieti, although
TOPICS PROMINENT AT THE BEGINNING AND THE END OF PAPERS
TOPICS PROMINENT AT THE BEGINNING AND THE END OF PAPERS
¢
Topic 31: rural, social, communiti, local, cultur, econom, polit, place, within, new, way, discours, relat, peopl, argu, particular, ident, societi, natur, construct
¢
Topic 33: method, use, approach, techniqu, can, applic, appli, develop, requir, propos, base, provid, limit, altern, advantag, allow, howev, combin, work, need
¢
Topic 58: research, paper, section, discuss, focus, approach, work, develop, analysi, issu, understand, framework, scienc, literatur, process, address, knowledg, review, studi, provid
¢
Topic 81: chang, climat, scenario, adapt, impact, vulner, futur, global, assess, capac, project, polici, uncertainti, respons, will, current, region, rise, warm, ipcc
¢
Topic 87: process, system, biolog, agent, physic, theori, inform, organ, natur, can, concept, dynam, intern, quantum, principl, mechan, one, environ, space, idea
area urban citi popul local hous household rural resid locat park counti migrat district residenti access build town centr villag
farm agricultur farmer product food
produc market practic household incom local livestock labour manag consum convent econom coffe activ Topic revolves around agriculture from a ‘human’ point of view
100
veget graze grassland grass pastur cover manag intens studi cattl product year area biomass stock anim nativ forag meadow livestock
crop yield wheat harvest maiz system year grain product weed rotat cultiv fertil manag fallow practic field tillag corn cereal
Water
etaal sediment lake river contamin concentr water mercuri effluent
studi estuari environ wastewat bay mehg figa pollut aquat sourc Predominantly Environmental Pollution
Different senses of water:
Resource and part of ecological system
water irrig reservoir potenti surfac suppli storag capac condit use avail releas limit qualiti system also demand evapor inflow balanc 84 rainfal river runoff event stream catchment flood basin watersh hydrolog flow area discharg wetland rain drainag eros slope storm intens
¢ Topic models are useful in exploring the
¢ “Topics” identified in topic models are generally
¢ Topic models help us identify keywords at
chang climat scenario adapt impact vulner futur global assess capac project polici uncertainti respons will current region rise warm ipcc
research paper section discuss focus approach work develop analysi issu understand framework scienc literatur process address knowledg review studi provid
How about Global Environmental Change? Which topics are typical of GEC?
¢ Semantically unspecific abstract nouns that
¢ Variously called: ‘general nouns’ (Halliday & Hasan 1976; Mahlberg
2005)
‘Vocabulary 3 items’ (Winter 1977) ‘enumerables’ and ‘advance labelling’ (Tadros 1985) ‘anaphoric nouns’ (Francis 1986) ‘carrier nouns’ (Ivanič 1991) ‘labels’ (Francis 1994) ‘shell nouns’ (Hunston & Francis 2000; Schmid 2000) ‘signalling nouns’ (Flowerdew 2003).
¢ Frequently the head of definite or demonstrative
¢ Shell-noun phrases expedite cognitive processing
¢ They also offer writers the opportunity to express
¢ Scenario (6882), model (6851), study (6369),
Top four journals in terms of relative use of these shell nouns are three ID journals and one ‘transdisciplinary’
¢ 3463 for example ¢ 244 an example ¢ 64 another example ¢ 60 one example
rephrasing, explaining, elaborating, and exemplifying
rephrasing, explaining, elaborating, and exemplifying
Multidisciplinary journals 7K 6K 5K 4K Same four journals have relatively high use of code glosses as of shell nouns
¢ All these toxic substances should have direct impact
reducing protoplast survival [29,14]. Driselase, for instance , has often been used after purification [29]. PS [monodisciplinary]
¢ Public activity generally occurred in the aftermath of
the Montreal Protocol, and it simply accelerated the movement which had already been put in place. In the UK, for instance , Friends of the Earth ... In the USA ... [GEC]
¢ We are talking about gradual and slow processes,
where both positive and negative effects may only be seen years and decades after changes in emissions. Take, for instance , the Norwegian situation ... [GEC]
In GEC, the examples are typically extended elaborations on propositions
research articles: topic-modelling
used to identify probable topics in sections of texts, enabling us to assess 'topic' relations to places within texts
something about what writers write about, and how, in different parts of texts
some shell nouns in ID articles, suggesting that such texts are more outward-facing. The use of metadiscourse I s constrained by text length, however.
¢Visit: idrd-bham.info/ ¢Or: @IDRD_bham