look for monodisciplinary journals by finding those with
play

Look for monodisciplinary journals by finding those with a high - PowerPoint PPT Presentation

C ORPUS APPROACHES TO THE LANGUAGE OF INTERDISCIPLINARY RESEARCH ARTICLES Aug 2013 Aug 2015 ESRC-funded project: ES/K007300/1 Paul Thompson with Susan Hunston, Akira Murakami, and Dominik Vajn B ACKGROUND Substantial amount of work


  1. C ORPUS APPROACHES TO THE LANGUAGE OF INTERDISCIPLINARY RESEARCH ARTICLES Aug 2013 – Aug 2015 ESRC-funded project: ES/K007300/1 Paul Thompson with Susan Hunston, Akira Murakami, and Dominik Vajn

  2. B ACKGROUND ¢ Substantial amount of work carried out on academic discourse in corpus linguistics (e.g., Biber; Hyland). ¢ The research has typically drawn clear boundaries between disciplines (e.g., history vs biology) or between levels of research (e.g, pure vs applied). ¢ In recent time there has been an expansion in ‘interdisciplinary research’ ¢ Little work, nevertheless, on the linguistic nature of interdisciplinary research as opposed to general research discourse and disciplinary research discourse. 2

  3. M AIN AIM Ø to achieve a fuller understanding of the distinctive features of discourse practices in interdisciplinary research and of how they differ from discourse practices in conventional disciplines Global Environmental Change – a successful interdisciplinary journal published by Elsevier

  4. D ATA ¢ * Full holdings of a successful IDR journal, Global Environmental Change, 1990-2010 — 675 articles ¢ * Holdings of 5 IDR journals (interdisciplinary) and 5 specialist journals (monodisciplinary), 2001-2010 ¢ Surveys, interviews with editors, board members, authors Data-driven approach – collect data and see what comes out it

  5. Journals with connections to journals which are themselves well-connected to one another are said to have a high clustering coefficient Look for monodisciplinary journals by finding those with a high clustering coefficient, and multidisciplinary journals are those with a low clustering coefficient.

  6. R & Sketch Engine

  7. P APER LABELLING ¢ Categories 1. Empirical 2. Policy discussion 3. Research agenda / Research Framework 4. Other ¢ Disagreements were resolved by negotiation.

  8. A GREED LABELLING 45 40 35 30 25 Empirical Policy 20 Agenda/Framework 15 Other 10 5 0 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

  9. I NTRODUCTION ¢ Corpus data can be explored ‘top-down’ or ‘bottom-up’ ¢ Texts can be grouped by text-external criteria or by text-internal ¢ We have taken different bottom-up approaches to the data: — Multidimensional Analysis [6 new dimensions] — Keyword analysis — Topic modelling ¢ Following slides present topic modelling and the results of our analyses.

  10. C ORPUS ¢ All the full research papers in the 11 journals ¢ Main text only ¢ 11,703 papers ¢ 56 million words

  11. F EATURES OF T OPIC M ODELS ¢ Probabilistic TM is a machine learning technique that automatically identifies “topics” in a given corpus (Blei, 2012) ¢ Latent Dirichlet allocation (LDA) ¢ Automatically identifies “topics” in a given corpus — keywords in each topic — distribution of topics in each document ‣ A document consists of multiple topics ¢ Methodologically, — Bag-of-words approach → single words 1 1

  12. D IVISION OF P APERS ¢ A research paper — is longer than a typical document targeted in topic models — can contain multiple topics ¢ Better to divide papers into multiple parts ¢ This allows the investigation of topic transition within papers as well.

  13. D ETAILS ¢ Punctuations, numbers, and the standard list of stop words were removed. ¢ All the words were stemmed using the Porter stemmer (e.g., require → requir , analysis → analysi ). ¢ Only targeted terms that appear in at least 0.1% of all the documents. ¢ Each text was assigned with the information on where in the paper the paragraph(s) appeared. — e.g., 70% from the beginning of the paper ¢ This created 140,816 ‘texts’ with an average length of 248 words per text ( SD = 57.6) after removal of the stop words ¢ topicmodels package in R

  14. N UMBER OF T OPICS ¢ No agreed way to automatically determine the number of topics. ¢ We built topic models with 10, 20, 30, . . . , 90,100 topics ¢ Each topic in the model with 100 topics looked interpretable. → 100 topics

  15. B Y -T EXT T OPIC D ISTRIBUTION

  16. T HINGS WE CAN DO ¢ Identify prominent topics at different positions of a paper ¢ Compare prominent topics across papers or across journals ¢ Cluster papers according to topic distribution inter alia

  17. B Y -P APER T OPIC D ISTRIBUTION

  18. By-Paper Topic Distribution Topic 7

  19. countri develop global world nation econom industri intern million year popul growth environ ment import major trade decad centuri economi

  20. W ITHIN -P APER T OPIC D ISTRIBUTION Each line is a different topic – there are a hundred in total

  21. W ITHIN -P APER T OPIC D ISTRIBUTION

  22. K EYWORDS OF THE F OUR T OPICS ¢ Topic 3: use, dna, primer, sequenc, clone, pcr, fragment, rna, cdna, probe, perform, hybrid, isol, amplifi, min, total, follow, gene, cycl, product ¢ Topic 40: min, buffer, use, extract, incub, contain, solut, centrifug, assay, protein, describ, determin, wash, mixtur, supernat, gel, prepar, homogen, acid, follow ¢ Topic 55: sampl, use, extract, standard, determin, analysi, column, analyz, method, analys, digest, analyt, concentr, abc, dri, acid, filter, mass, min, detect ¢ Topic 97: collect, use, place, sampl, water, chamber, day, dri, remov, week, solut, filter, contain, experi, store, replic, plastic, diamet, pot, tube

  23. T OPICS P ROMINENT T OWARDS THE END OF P APERS

  24. T OPICS P ROMINENT T OWARDS THE END OF P APERS ¢ Topic 50: will, can, may, need, requir, must, target, howev, limit, possibl, current, futur, like, make, becom, potenti, necessari, provid, exist, exampl ¢ Topic 53: may, suggest, like, might, howev, possibl, evid, support, appear, associ, although, result, also, seem, find, hypothesi, strong, fact, explain, occur

  25. T OPICS P ROMINENT AT THE B EGINNING OF P APERS

  26. T OPICS P ROMINENT AT THE B EGINNING OF P APERS ¢ Topic 1: pollut, deposit, sourc, atmospher, air, area, lichen, element, moss, main, monitor, load, industri, isotop, contribut, anthropogen, particul, dust, concentr, major ¢ Topic 7: countri, develop, global, world, nation, econom, industri, intern, million, year, popul, growth, environment, import, major, trade, mani, decad, centuri, economi ¢ Topic 32: etaal [=et al], sediment, lake, river, contamin, concentr, water, mercuri, effluent, organ, studi, estuari, environ, wastewat, bay, mehg, figa, pollut, aquat, sourc ¢ Topic 72: studi, rice, mani, china, recent, howev, wide, includ, also, sever, import, high, common, various, major, larg, well, report, varieti, although

  27. T OPICS P ROMINENT AT THE B EGINNING AND THE E ND OF P APERS

  28. T OPICS P ROMINENT AT THE B EGINNING AND THE E ND OF P APERS Topic 31: rural, social, communiti, local, cultur, econom, polit, place, ¢ within, new, way, discours, relat, peopl, argu, particular, ident, societi, natur, construct Topic 33: method, use, approach, techniqu, can, applic, appli, develop, ¢ requir, propos, base, provid, limit, altern, advantag, allow, howev, combin, work, need Topic 58: research, paper, section, discuss, focus, approach, work, develop, ¢ analysi, issu, understand, framework, scienc, literatur, process, address, knowledg, review, studi, provid Topic 81: chang, climat, scenario, adapt, impact, vulner, futur, global, ¢ assess, capac, project, polici, uncertainti, respons, will, current, region, rise, warm, ipcc Topic 87: process, system, biolog, agent, physic, theori, inform, organ, ¢ natur, can, concept, dynam, intern, quantum, principl, mechan, one, environ, space, idea

  29. area urban citi popul local hous household rural resid locat park counti migrat district residenti access build town centr villag

  30. farm agricultur farmer product food organ produc market practic household incom local livestock labour manag consum convent econom Topic revolves around agriculture from a coffe ‘human’ point of view activ

  31. 100 11 veget graze crop grassland yield grass wheat harvest pastur maiz cover system manag year intens grain studi product cattl weed product rotat cultiv year fertil area manag biomass fallow stock practic anim field nativ tillag forag corn meadow cereal livestock

  32. Water 32 etaal sediment lake river contamin concentr water mercuri effluent organ studi estuari environ wastewat bay mehg figa pollut Predominantly Environmental Pollution aquat sourc

  33. 19 84 water rainfal irrig river reservoir runoff potenti event surfac stream suppli catchment storag flood capac basin Different senses of water: condit watersh Resource and part of ecological system use hydrolog avail flow releas area limit discharg qualiti wetland system rain also drainag demand eros evapor slope inflow storm balanc intens

  34. C ONCLUSION ¢ Topic models are useful in exploring the content of the papers in the corpus. ¢ “Topics” identified in topic models are generally interpretable based on domain knowledge. ¢ Topic models help us identify keywords at different positions in papers.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend