Introduction This project explores the use of computational - - PDF document

introduction
SMART_READER_LITE
LIVE PREVIEW

Introduction This project explores the use of computational - - PDF document

Adams 1 Introduction This project explores the use of computational technologies to improve research and archiving methodologies, specifically for Early Modern ballads. I embarked on this project based on the possibility of unifying my


slide-1
SLIDE 1

Adams 1

Introduction

This project explores the use of computational technologies to improve research and archiving methodologies, specifically for Early Modern ballads. I embarked on this project based on the possibility of unifying my disparate interests in literature, programming, linguistics, and geography, and all of these components are represented. This research is primarily exploratory, and hopefully will be useful for other researchers hoping to find methods to explore specific questions about ballads. All software used for this project is open-source, and therefore downloadable and usable for free. Mentored by Dr. Kris McAbee. Additional help and resources from Dr. Jeremy Ecke and Dr. Jess Porter. 


slide-2
SLIDE 2

Adams 2

Ballads

Broadside ballads are early print artifacts from England around the 17th century. They were cheaply produced and intended for a popular audience so they are not frequently

  • studied. Their cultural and literary importance, however, may be underestimated, as

they were often composed for public events (notably, executions and royal news) of at least some historical note, and their uniquely “un-literary” or popular and sensational qualities can yield much information about people and language during the time of their composition. Ballads also make for interesting subjects of computational research because of their digital availability from the English Broadside Ballad Archive (EBBA) as well as their odd characteristics and conventions. For example, they often do not attribute authors, but many identify a printer/publisher; computational methods for verifying authorship claims are well-established for other types of texts, and one can easily imagine using the data available in ballads in the same way. (I will discuss this methodology and its possibilities in further detail below.)


slide-3
SLIDE 3

Adams 3

Plain Text and XML

EBBA stores scanned ballads (as seen on the previous slide), as well as plain text and XML transcriptions, shown here. The plain text is useful for legibility and simple search functions, but the XML can store extra data about the text, like keywords for enhanced search-ability and formatting information like italics and columns. Good XML can help the researcher, looking for very specific features, as well as the casual user, looking for a smooth search/browse experience. EBBA specifically uses the standards of the Text Encoding Initiative (TEI) to mark up

  • ballads. TEI is defined by its maintaining organization as “an international and

interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally

  • bsolescent.” Hence, useful metadata can be stored in headers (e.g. the inclusion of

“keywords” and “notes” above) and text information can be marked within the text (e.g. the “title” designation and formatting cues like “col[umn]” and “l[ine]”) in a standardized but flexible markup framework that has been designed for the archiving of literary texts.


slide-4
SLIDE 4

Adams 4

The Part of Speech Tagger

The first technology tool I explored was the Part of Speech Tagger from Stanford’s Natural Language Processing Group (the POS Tagger). It assigns grammatical categories based on syntactic context, operating similarly to the way that a human reader can assign a part of speech. In the example above, you can see the tagged text from part of a ballad. It is easy to identify or search for proper nouns (identified by the affixed _NNP label and bolded here for emphasis); using this software saves a lot of time for the researcher and allows a much broader corpus to be analyzed. You can also see in this example that the proper noun identifications are not completely

  • correct. The human reader can discern that “Birds” and “Stares” (stars) should not be

marked as proper nouns. The software is currently “trained” to identify parts of speech using the patterns of modern, journalistic English, not Early Modern poetic usages. Fortunately, it can be “retrained” on a new corpus, taking input from a human reader to learn new patterns of probability for its part of speech assignments.

slide-5
SLIDE 5

Adams 5

Dispersion Plot of a Key Word

The next couple of slides show an application of the POS Tagger to EBBA research, and introduce another useful tool, R Statistics, which is a “language and environment for statistical computing and graphics” and allows the user to explore and manipulate data, in this case text and information about text, and includes a variety of tools for displaying findings graphically. Using just R, you can examine, chart, and compare the dispersion of key words in a piece of text. Here, I compare the appearance of the word “bird” in two ballads which are accounts of the same event (a strange day in Cork, during which the city is overrun by flocks of birds and then burnt). One clearly focuses on the birds, and the other mentions the birds early in the narrative and moves on. While this is interesting narrative information, the scope of this methodology’s usefulness is limited, since you can only examine one word/phrase at a time. Adding another step before processing text in R enriches the research possibilities, as we will see.


slide-6
SLIDE 6

Adams 6

Dispersion Plot of a Word Type

When we use the same comparative method using the POS Tagger first, we can examine not just single key words but parts of speech. Here, I compare noun usage in the same two ballads as in the previous analysis. This does not yield results that are as immediately interesting, but by adding other factors for comparison — like other parts of speech, sentence length, etc — we can use these tools to examine and compare the writing style of ballads computationally. This type of processing can be used for authorship research, a particularly interesting possibility for ballads, which are mostly un-“signed” but often from the same handful of printing houses in London, and therefore likely to have common writers or “house style.”


slide-7
SLIDE 7

Adams 7

Topic Modeling

The next tool I explored was topic modeling, which is really a series of tools: the POS Tagger, R, and R extensions MALLET and WordCloud. This is a process that uses colocation of words in “chunks” of text to construct topics. Topics are visualized in “word clouds” and can be human-labelled. Topic labels can then be reapplied to ballad metadata, based on each ballad’s match to a given topic’s words, to improve the searchability of the archive.

slide-8
SLIDE 8

Adams 8

Word Cloud: Hunting

This is an example of a word cloud generated by topic modeling. The topic contains words like “huntsman,” “tally,” “fox,” etc.; the words have been ranked by the topic modeling software, and the ranking is represented visually by size and proximity to the center. The words are all clearly related and easily identifiable as a topic or theme, but I would label this topic with a word that does not actually appear in the cloud: “hunting,” or something similar. This is a good example of an opportunity for metadata improvement in the archive, then, since users who might be searching for ballads about “hunting” would not find any based on the plain text of the ballads in the group processed here. Using the process I described before, then, would enhance usability of the archive by improving search computationally, and therefore quickly (with minimal human work).

slide-9
SLIDE 9

Adams 9

Word Cloud: Flocks of Birds

Sometimes, too, the topics themselves reveal interesting data for the researcher. This topic, about birds or flocks of birds, appeared in the group I processed (ballads already categorized as relating to “animals and nature”). You can see in the word cloud an undertone of more than just birds, though, with words like “squadrons,” “troupes,” “legions,” and “swords” appearing to suggest that the ballads relate or conceptualize birds with military terminology. This would be an interesting thread to follow with a close reading of some of the ballads that match this topic.


slide-10
SLIDE 10

Adams 10

Named Entity Recognizer

The final tool I used is the Named Entity Recognizer (NER, also from Stanford NLP). It works much like the POS Tagger, using syntax clues to label names of locations, persons, and organizations. Also like the POS Tagger, it can be retrained. This tool could have interesting applications to ballad research as well, providing the researcher with an easy method of finding shared references to people or places, by searching across many ballads at once and eliminating false matches. As a simple example, the tagged text on this slide contains the location “Corke,” which is also a word denoting a material, but would be usefully separated as a location from other matches to “corke” by tagging the corpus with NER and using the more precise search term “<LOCATION>Corke</LOCATION>”.

slide-11
SLIDE 11

Adams 11

Recap: Immediate Applications

To recap some of the applications of these tools to ballad research and archiving: topic modeling and NER could be particularly useful for search and metadata improvement; the POS Tagger can be used to examine and quantify language or literary features across many ballads at once, which can generate data applicable to authorship research; the NER can be used to discover shared references to characters/people and places, suggesting intertextual connections among ballads.

slide-12
SLIDE 12

Adams 12

Recap: Future Directions for Exploration

Some specific possibilities are suggested by these tools and their application to EBBA: Researchers might retrain the POS Tagger and the NER on the Early Modern language

  • f the ballads to improve the accuracy of data.

One could explore connections that emerge from the NER data, including mapping location references found across ballads and examining the language surrounding those references to inform the mapping (e.g. Are references to London positive or negative? What other features do they share?) or use something like social network visualization to similarly “map” shared characters/names. The archive itself could enrich the ballads’ XML metadata, using the NER to mark locations and names and topic modeling to improve theme/topic information available to search. Research could focus on exploring or grouping linguistic “signatures” based on POS Tagger data and correlating these signatures to available ballad origin information (printers’ names and locations, etc.).


slide-13
SLIDE 13

Adams 13

Resources: Works Cited or Consulted

Gould, Peter, and Rodney White. Mental Maps. New York: Penguin, 1974. Print. Guldi, Jo. “What is the Spatial Turn?” and assorted links (e.g., “The Spatial Turn in Literature”). Spatial Humanities. Scholar’s Lab, University of Virginia Library. <http:// spatial.scholarslab.org/spatial-turn/what-is-the-spatial-turn/>. Web. Jockers, Matthew. Macroanalysis: Digital Methods and Literary History. Champaign: University of Illinois Press, 2013. Print. Jockers, Matthew. Text Analysis with R for Students of Literature. New York: Springer,

  • 2014. E-book.

Liu, Alan. “Where is Cultural Criticism in the Digital Humanities?” Alan Liu’s Blog. <http:// liu.english.ucsb.edu/where-is-cultural-criticism-in-the-digital-humanities/>. Web. Moretti, Franco. Distant Reading. London: Verso, 2013. Print. Moretti, Franco. Graphs, Maps, Trees: Abstract Models for Literary History. London: Verso, 2005. Print. Posner, Miriam. “Some things to think about before you exhort everyone to code.” Miriam Posner’s Blog. <http://miriamposner.com/blog/some-things-to-think-about- before-you-exhort-everyone-to-code/>. Web. Ramsay, Stephen. “On Building.” Stephen Ramsay’s Blog and Papers. <http:// stephenramsay.us/text/2011/01/11/on-building/>. Web. “The Academic Call to Code and the Networked Self.” Cacophony: Communication Across the Curriculum. Bernard L. Schwartz Communication Institute. <http:// cac.ophony.org/2012/02/06/the-academic-call-to-code-and-the-networked-self/>. Web. “Ballad Culture and Printing.” The English Broadside Ballad Archive. University of California at Santa Barbara. <http://ebba.english.ucsb.edu/page/ballad-culture>. Web. (Essay series by multiple contributors; several of the individual essays were consulted throughout the project.) “Pepys Categories.” The English Broadside Ballad Archive. University of California at Santa Barbara. <http://ebba.english.ucsb.edu/page/Pepys-categories>. Web. (Essay series by multiple contributors; several of the individual essays were consulted throughout the project.)

slide-14
SLIDE 14

Adams 14 “TEI-XML.” The English Broadside Ballad Archive. University of California at Santa

  • Barbara. <http://ebba.english.ucsb.edu/page/tei-xml>. Web.

Resources: Software Tools

The R Foundation. R Statistics. <https://www.r-project.org/about.html>. Stanford Natural Language Processing Group. Named Entity Recognizer. <http:// nlp.stanford.edu/software/CRF-NER.shtml>. Stanford Natural Language Processing Group. Part of Speech Tagger. <http:// nlp.stanford.edu/software/tagger.shtml>. Text Encoding Initiative Consortium. TEI-XML. “About” <http://www.tei-c.org/About/> and “Stylesheets” <http://www.tei-c.org/Tools/Stylesheets/>.