introduction
play

Introduction This project explores the use of computational - PDF document

Adams 1 Introduction This project explores the use of computational technologies to improve research and archiving methodologies, specifically for Early Modern ballads. I embarked on this project based on the possibility of unifying my


  1. Adams � 1 Introduction This project explores the use of computational technologies to improve research and archiving methodologies, specifically for Early Modern ballads. I embarked on this project based on the possibility of unifying my disparate interests in literature, programming, linguistics, and geography, and all of these components are represented. This research is primarily exploratory, and hopefully will be useful for other researchers hoping to find methods to explore specific questions about ballads. All software used for this project is open-source, and therefore downloadable and usable for free. Mentored by Dr. Kris McAbee. Additional help and resources from Dr. Jeremy Ecke and Dr. Jess Porter. 


  2. Adams � 2 Ballads Broadside ballads are early print artifacts from England around the 17th century. They were cheaply produced and intended for a popular audience so they are not frequently studied. Their cultural and literary importance, however, may be underestimated, as they were often composed for public events (notably, executions and royal news) of at least some historical note, and their uniquely “un-literary” or popular and sensational qualities can yield much information about people and language during the time of their composition. Ballads also make for interesting subjects of computational research because of their digital availability from the English Broadside Ballad Archive (EBBA) as well as their odd characteristics and conventions. For example, they often do not attribute authors, but many identify a printer/publisher; computational methods for verifying authorship claims are well-established for other types of texts, and one can easily imagine using the data available in ballads in the same way. (I will discuss this methodology and its possibilities in further detail below.) 


  3. Adams � 3 Plain Text and XML EBBA stores scanned ballads (as seen on the previous slide), as well as plain text and XML transcriptions, shown here. The plain text is useful for legibility and simple search functions, but the XML can store extra data about the text, like keywords for enhanced search-ability and formatting information like italics and columns. Good XML can help the researcher, looking for very specific features, as well as the casual user, looking for a smooth search/browse experience. EBBA specifically uses the standards of the Text Encoding Initiative (TEI) to mark up ballads. TEI is defined by its maintaining organization as “an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.” Hence, useful metadata can be stored in headers (e.g. the inclusion of “keywords” and “notes” above) and text information can be marked within the text (e.g. the “title” designation and formatting cues like “col[umn]” and “l[ine]”) in a standardized but flexible markup framework that has been designed for the archiving of literary texts. 


  4. Adams � 4 The Part of Speech Tagger The first technology tool I explored was the Part of Speech Tagger from Stanford’s Natural Language Processing Group (the POS Tagger). It assigns grammatical categories based on syntactic context, operating similarly to the way that a human reader can assign a part of speech. In the example above, you can see the tagged text from part of a ballad. It is easy to identify or search for proper nouns (identified by the affixed _NNP label and bolded here for emphasis); using this software saves a lot of time for the researcher and allows a much broader corpus to be analyzed. You can also see in this example that the proper noun identifications are not completely correct. The human reader can discern that “Birds” and “Stares” (stars) should not be marked as proper nouns. The software is currently “trained” to identify parts of speech using the patterns of modern, journalistic English, not Early Modern poetic usages. Fortunately, it can be “retrained” on a new corpus, taking input from a human reader to learn new patterns of probability for its part of speech assignments.

  5. Adams � 5 Dispersion Plot of a Key Word The next couple of slides show an application of the POS Tagger to EBBA research, and introduce another useful tool, R Statistics, which is a “language and environment for statistical computing and graphics” and allows the user to explore and manipulate data, in this case text and information about text, and includes a variety of tools for displaying findings graphically. Using just R, you can examine, chart, and compare the dispersion of key words in a piece of text. Here, I compare the appearance of the word “bird” in two ballads which are accounts of the same event (a strange day in Cork, during which the city is overrun by flocks of birds and then burnt). One clearly focuses on the birds, and the other mentions the birds early in the narrative and moves on. While this is interesting narrative information, the scope of this methodology’s usefulness is limited, since you can only examine one word/phrase at a time. Adding another step before processing text in R enriches the research possibilities, as we will see. 


  6. Adams � 6 Dispersion Plot of a Word Type When we use the same comparative method using the POS Tagger first, we can examine not just single key words but parts of speech. Here, I compare noun usage in the same two ballads as in the previous analysis. This does not yield results that are as immediately interesting, but by adding other factors for comparison — like other parts of speech, sentence length, etc — we can use these tools to examine and compare the writing style of ballads computationally. This type of processing can be used for authorship research, a particularly interesting possibility for ballads, which are mostly un-“signed” but often from the same handful of printing houses in London, and therefore likely to have common writers or “house style.” 


  7. Adams � 7 Topic Modeling The next tool I explored was topic modeling, which is really a series of tools: the POS Tagger, R, and R extensions MALLET and WordCloud. This is a process that uses colocation of words in “chunks” of text to construct topics. Topics are visualized in “word clouds” and can be human-labelled. Topic labels can then be reapplied to ballad metadata, based on each ballad’s match to a given topic’s words, to improve the searchability of the archive.

  8. Adams � 8 Word Cloud: Hunting This is an example of a word cloud generated by topic modeling. The topic contains words like “huntsman,” “tally,” “fox,” etc.; the words have been ranked by the topic modeling software, and the ranking is represented visually by size and proximity to the center. The words are all clearly related and easily identifiable as a topic or theme, but I would label this topic with a word that does not actually appear in the cloud: “hunting,” or something similar. This is a good example of an opportunity for metadata improvement in the archive, then, since users who might be searching for ballads about “hunting” would not find any based on the plain text of the ballads in the group processed here. Using the process I described before, then, would enhance usability of the archive by improving search computationally, and therefore quickly (with minimal human work).

  9. Adams � 9 Word Cloud: Flocks of Birds Sometimes, too, the topics themselves reveal interesting data for the researcher. This topic, about birds or flocks of birds, appeared in the group I processed (ballads already categorized as relating to “animals and nature”). You can see in the word cloud an undertone of more than just birds, though, with words like “squadrons,” “troupes,” “legions,” and “swords” appearing to suggest that the ballads relate or conceptualize birds with military terminology. This would be an interesting thread to follow with a close reading of some of the ballads that match this topic. 


  10. Adams � 10 Named Entity Recognizer The final tool I used is the Named Entity Recognizer (NER, also from Stanford NLP). It works much like the POS Tagger, using syntax clues to label names of locations, persons, and organizations. Also like the POS Tagger, it can be retrained. This tool could have interesting applications to ballad research as well, providing the researcher with an easy method of finding shared references to people or places, by searching across many ballads at once and eliminating false matches. As a simple example, the tagged text on this slide contains the location “Corke,” which is also a word denoting a material, but would be usefully separated as a location from other matches to “corke” by tagging the corpus with NER and using the more precise search term “<LOCATION>Corke</LOCATION>”.

  11. Adams � 11 Recap: Immediate Applications To recap some of the applications of these tools to ballad research and archiving: topic modeling and NER could be particularly useful for search and metadata improvement; the POS Tagger can be used to examine and quantify language or literary features across many ballads at once, which can generate data applicable to authorship research; the NER can be used to discover shared references to characters/people and places, suggesting intertextual connections among ballads.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend