Standardization and the tension between the unique and the - - PDF document

standardization and the tension between the unique and
SMART_READER_LITE
LIVE PREVIEW

Standardization and the tension between the unique and the - - PDF document

Arle Lomme 10.September 2014 KU Leuven Digital Humanities Summer School Slide 1 of 46 Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lommel German Research Center for


slide-1
SLIDE 1

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 1 of 46

Standardization and the tension between the unique and the homogeneous in the Digital Humanities

Arle Lommel


German Research Center for Artificial Intelligence (DKFI)

slide-2
SLIDE 2

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 2 of 46 Before I begin, I thought I would tell you a little bit about myself and why I fjnd the topic today—standards for localization and translation and their relationship to cultural knowledge—interesting. By academic training I am a linguist and folklorist. Particularly in the latter regard, I am educated in ethnology and cultural semiotics. I am deeply interested in musical instruments, a topic I will return to as a parallel to my primary topic a little later one. At the same time, for over 15 years I have been working in the fjeld of translation technology in one capacity or another and for that entire time I have been involved in the development of localization standards in various bod- ies, including the now-defunct Localization Industry Standards Association, OASIS, ASTM, ISO Technical Com- mittee 37 (from which I know Dr. Hendrik Kockaert), the Globalization and Localization Association, the W3C, and other bodies, all of which I have been actively involved in at some point. Tiese are all *technical* bodies and in them I have been involved in defjning XML tags and formats, developing methods for transferring digital data between computer applications, and defjning abstract representations of the data needed to ensure that translators can have access to the information they need to do their jobs. Tiis information includes details about the terms that particular companies use in their documentation, relevant texts that have previously been translated, information on expectations for translation, and a host of other details that impact the act of translators. My current work, for the past two and a half years at the German Center for Artifjcial Intelligence in Berlin, has been on standardizing the representation of errors in translations, a topic that is of interest to individuals here at KU Leuven who are active in the same fjeld.

A bit about me

  • Linguist and folklorist/ethnographer
  • Working at an artificial intelligence center
  • Developing specifications for translation quality
  • 15+ years of experience in translation/

localization standardization

slide-3
SLIDE 3

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 3 of 46 (While I will not go into details of my work, I would like to show a picture of how complex categorizations of translation problems can be. We like to call this the bowl of spaghetti. Ti is may be complex, but we actually do use a subset of it for research into machine translation.)

Translation quality = Spaghetti?

slide-4
SLIDE 4

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 4 of 46 At some level, you may well wonder what I, a geek who deals in bits and bytes and arcana about digital resources, could have to say about the digital humanities. Afuer all, if the digital humanities have a unifjed goal…

What is the goal of digital humanities?

slide-5
SLIDE 5

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 5 of 46 …it surely must have to do with the preservation of diversity and rescuing products of human culture that would

  • therwise disappear. While digital humanities certain involves doing other, interesting things with the output of

human culture, the foundational aspect is the ability to represent and preserve such diverse items as literary works, spoken texts, paintings, music, and dance. Surely, as culture fmowers about us and forever escapes our control and attempts to defjne it, standardization works contrary to this goal.

I don’t have an answer, but surely it must involve the preservation and dissemination of diverse cultural knowledge.

slide-6
SLIDE 6

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 6 of 46 Why would standardization be good for that which cannot and should not be standardized? Does standardization aid and abet those who would extract and exploit culture or try to impose homogeneity?

But standardization is the application of homogeneity.

slide-7
SLIDE 7

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 7 of 46 Put another way, does standardization end up just abetting cultural imperialism by allowing those in the center to exploit those elsewhere by forcing them to do things in one way?

Does standardization abet cultural imperialism?

slide-8
SLIDE 8

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 8 of 46 In my efgorts I have, not infrequently, been accused of facilitating a process that turns human cultural output into a commodity, that reduces individuals to interchangeable widgets. Much of my work has been around machine translation, and this topic strikes fear into the hearts of many translators, who believe that the end goal of efgorts in this fjeld is to replace them entirely, taking their hard intellectual work and converting it to mere mechanical process.

Does standardization turn cultural knowledge into a commodity?

slide-9
SLIDE 9

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 9 of 46 Within the last year I and a colleague were publicly accused in a conference of being like the inventors of nuclear weapons and told that our standardization efgorts, however well intentioned, were morally on par with launching a missile at a defenseless city. Setting aside the hyperbolic nature of this claim, it gets at a real tension.

I am morally on par with the makers of the atomic bomb!*

*I don’t really believe this

slide-10
SLIDE 10

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 10 of 46 Culture cannot be standardized and cannot be converted into units of production, and yet standards have the potential to do just this.

Defining things in a standard fashion is a controversial action.

slide-11
SLIDE 11

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 11 of 46 As part of the same work that led to the comparison of me and my colleague to the makers of weapons of mass destruction, I have worked on providing a universal (or standard) defjnition of translation quality. Tiose not in- volved with translation are ofuen surprised to discover that theories of translation theory provide them with no

  • perative way of telling whether a translation is good or not. I am sure that this fact surprises no-one here, but

companies buying translation are surprised, because they have money on the line and need a way to tell whether the work of a translator will result in customer satisfaction or disapproval. Tieir urgent business need runs directly contrary to any theory that says that translation quality cannot be measured. So the approach taken in my work draws directly on functionalist approaches to translation and states that translation quality is relative to expecta- tions negotiated between the party buying the translation and the one providing it, and that translation must meet purpose-driven minimum levels of accuracy and fmuency, taking into account the needs of end users.

A quality translation demonstrates
 required accuracy and fluency for the audience and purpose and
 complies with all other specifications negotiated between the requester and procvider,
 taking into account end-user needs.

slide-12
SLIDE 12

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 12 of 46 From my perspective, such a defjnition is an improvement over any approach that maintains that all translations must strive for the same Platonic ideal of perfect translation. It is thus a radical reinterpretation of quality away from absolutist approaches that do not consider purpose. Tiis approach puts quality squarely in the center of a complex dynamic with multiple parties, but gives priority to the relationship between the requester of the transla- tion and its provider. It allows for various requirements.

A radical definition of quality for many people

slide-13
SLIDE 13

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 13 of 46 Someone who needs a translation of an intelligence intercept to understand what Russia’s plans in the next four hours for Ukraine are has very difgerent requirements than someone looking to translate the works of Emmanuel

  • Kant. Someone translating a service manual for a technician who will work on your car has very difgerent needs

from someone who needs an advertising campaign to be translated.

Translating an intelligence intercept is very different from translating a financial document

slide-14
SLIDE 14

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 14 of 46 While I believe that such a defjnition should not be controversial, this defjnition has publicly been denounced as nonsense by some very prominent translators since it opens the door to relative quality. I will admit that in some cases it allows really bad translation to be considered suffjcient for purpose, and it is this possibility that leads some people to believe that our efgorts are to impose a low standard. Without dismissing such concerns, I can only

  • bserve that they are truly academic concerns when faced with a paying customer who wants to know whether the

translation meets requirements. And I can only respond that opening the discussion about what level of quality is appropriate is not imposing a low standard. So every day my work deals with the dilemma of how to relate standardization to cultural knowledge. And these factors certainly are in tension.

Defining things in a standard fashion is a controversial action.

slide-15
SLIDE 15

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 15 of 46 But before going further into specifjcs here, I would like to take a detour to address what I consider to be the sine qua non of digital humanities. Tiat is the ability to represent text in a digital fashion. I would argue that if we can- not do this, none of the rest of what we do matters. If we cannot represent texts, digital humanities as such would not exist.

Digital representation of text is a sine qua non for digital humanities.

slide-16
SLIDE 16

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 16 of 46 Even if we exchange images and sound, digital humanities requires the ability to exchange multilingual metadata that uses various writing systems about these objects, so digital text stands at the heart of digital humanities. If we fail to at least allow for this possibility, we are, in fact engaging in the very cultural imperialism that digital humani- ties should resist. Tiis fact is increasingly recognized. For example, the Europeana project, with which many of you are familiar, has added extensive multilingual search capabilities for metadata to allow users searching in Dutch, to take one example, to locate items with metadata in Polish. And here, in general much less controversially than with my example of defjning translation quality, we see similar issues at work. To exchange text in a digital fashion, we must standardize its representation, at times doing away with considerable variation. It is more than an abstract question as to whether a digital text of Don Quixote is the same text as a specifjc edition, printed on a specifjc paper, with a specifjc font. If we want to make texts digitally accessible, we unavoidably lose some aspects of those texts, even as we gain tremendous capabilities.

Even for audio and visual data we must exchange multilingual, multiscript metadata

slide-17
SLIDE 17

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 17 of 46 And we do gain considerably by being able to represent text in a standardized fashion. We are able to search texts near instantaneously. We are able to discover texts and examine them in new ways that are impossible with non- digital texts. We can examine how individual authors use pronouns, what clusters of words characterize their writ- ing, how ofuen they echo other writers, and so on. While these tasks are not entirely impossible without digital texts, they difger in both ease and outcome with digital texts.

Advantages of digital text

  • Searching
  • Discovering
  • Gaining insights into authors
  • Discovering social trends
  • These things are possible without digital texts,

but MUCH harder

slide-18
SLIDE 18

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 18 of 46 A few years ago I became interested in how early Christian writers conceived of madness and what they meant when they called someone mad. Just a few decades ago answering this question would easily have consumed a lifetime of reading thousands of works hidden in obscure libraries, writing information on cards, and carefully comparing these cards to arrive at conclusions. And such a process almost certainly would have missed examples. Ti is sort of work would have required the fj nest of research libraries and the efg

  • rts of many poor graduate stu-

dents who would really rather be looking at some other topic.

An example… madness in early Christian writings

Figure 1. Number and density of tokens of by individual author. Only outlying authors (in

100 200 300 400 500 600 20 40 60 80 100 120 140 160 180

John Chrysostom Augustine Alexander of Alexandria Aristides the Philosopher Arnobius Vincent of Lérins Dionysius the Great Athanasius Commodianus Minucius Felix Jerome Leo the Great Lactantius

Number of tokens Density

slide-19
SLIDE 19

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 19 of 46 And yet, because texts have been digitized and placed on line, this topic becomes much easier to approach. Stan- dardizing text makes it appropriable and queriable. To pursue my research I found a site where an individual had placed thousands of 19th century translations of the writings of the early Church fathers. I downloaded these texts, performed some bulk clean-up (to remove HTML markup and wrappers) and then created a corpus of many megabytes of these texts. Using simple tools, I was able to identify every reference to madness, including variant terms, in the corpus and navigate to them to understand the referents. I found some thousands of instances and compared them, correlating the references to author and topic. Ti e results I found showed that madness for these writers had a specifj c nuance lost on readers today: these authors used madness to refer to individuals who denied trinitarian concepts of God. Ti ey were mad because they got the nature of the universe wrong. Madness, thus, was used to denote those who disagreed with the world view of these proto-Orthodox writers.

An example… madness in early Christian writings

  • Referent

All

  • Exc. John

Chrysostom John Chrysostom # % # % # % Heresies and schism 334 20.6% 330 29.1% 4 0.8% Literal madness 229 14.1% 127 11.2% 102 20.9% General term of disapprobation / unclear reference 148 9.1% 100 8.8% 48 9.8% Pagan gods and daimons, religion, and beliefs (including astrology) 139 8.6% 115 10.2% 24 4.9% Sexual desire, covetousness, and “improper” desire 119 7.3% 37 3.3% 82 16.8% Irrational, defying “common sense” 88 5.4% 49 4.3% 39 8.0% Disbelief or false belief (excluding heresy) 59 3.6% 38 3.4% 21 4.3% Excess of emotion 57 3.5% 40 3.5% 17 3.5% Related to Judaism 48 3.0% 20 1.8% 28 5.7% Stage, circus, dancing, gladiatorial com- bat, or other public spectacles 43 2.7% 39 3.4% 4 0.8% Cruel actions or leaders 36 2.2% 18 1.6% 18 3.7% Opposition to Christ/Christianity 33 2.0% 24 2.1% 9 1.8% Blasphemy, sacrilege, impiety 31 1.9% 16 1.4% 15 3.1% Persecution of Christians 26 1.6% 20 1.8% 6 1.2% Relating to devils or possessed individuals 25 1.5% 6 0.5% 19 3.9% Reckless/imprudent behavior 20 1.2% 15 1.3% 5 1.0% Opinion of non-believers concerning Christians 19 1.2% 17 1.5% 2 0.4% General unrighteousness or sin 18 1.1% 7 0.6% 11 2.2%

Table 1. Referents of MAD in the New Advent corpus with more than one percent of the total for

  • MAD. (A full listing of categories and their counts are presented in Appendix B.)
slide-20
SLIDE 20

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 20 of 46 Tie specifjcs of my research are not of importance here, but rather the fact that by standardizing the digital repre- sentation of text, I was able to conduct research that would have taken decades in a matter of a few days of solid

  • work. As a result I was able to ask new kinds of questions and obtain results that it is unlikely anyone would ever

have bothered to research otherwise. Standardization at one level, by removing the variation and physical con- tingencies of specifjc texts, allowed diversifjcation at another level. Such results, of course, surprise no one in the digital humanities.

A lifetime’s work becomes a few weekends

slide-21
SLIDE 21

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 21 of 46 But even since the advent of digital texts, there have been major diffjculties. Look at the image on screen. Whether you instantly recognize what I am showing you or note is probably a fairly reliable indicator of when you fjrst started dealing with digital texts.

The Lord’s Prayer in… ?

Fair vor, ˛˙ sem er · himnum. Helgist ˛itt nafn, til komi ˛itt rÌki, veri ˛inn vilji svo · jˆru sem · himni. Gef oss Ì dag vort daglegt brau

  • g fyrirgef oss vorar skuldir,

svo sem vÈr og fyrirgefum vorum skuldunautum. Eigi lei ˛˙ oss Ì freistni, heldur frelsa oss fr· illu. [fivÌ a ˛itt er rÌki, m·tturinn og d˝rin a eilÌfu.]

slide-22
SLIDE 22

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 22 of 46 What you see is the text of the Lord’s Prayer in Icelandic. It should look something like this, but what has happened is that I have taken a text stored in ISO Latin-1 encoding, a way of representing texts for many European languages, and I have opened it as a Mac Roman fjle. Tie result is that many of the characters are garbled and unrecognizable.

The Lord’s Prayer in Icelandic

Faðir vor, þú sem er á himnum. Helgist þitt nafn, til komi þitt ríki, verði þinn vilji svo á jörðu sem á himni. Gef oss í dag vort daglegt brauð

  • g fyrirgef oss vorar skuldir,

svo sem vér og fyrirgefum vorum skuldunautum. Eigi leið þú oss í freistni, heldur frelsa oss frá illu. [Því að þitt er ríkið, mátturinn og dýrðin að eilífu.]

slide-23
SLIDE 23

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 23 of 46 Today we ofuen take it for granted that we can send texts around the world digitally and others will be able to read them and use them. But not so many years ago, before the mid 2000s, you could not assume that. Tie problem was that computers were limited to representing texts with eight binary digits per character, which theoretically allowed you to have 256 characters at your disposal (but in practical terms allowed you access to 232 since 24 char- acters were reserved for system functions). As a result you had to make decisions about what characters you could and could not see in a given text and the same underlying representation (essentially a number or code point) could apply to difgerent characters, depending on the fonts you used.

Before the mid 2000s digital text was ambiguous

slide-24
SLIDE 24

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 24 of 46 Tie situation was even worse because everyone wanted to get their texts into digital form, so user communities sprung up around home-made fonts for all sorts of languages. And not infrequently there were multiple such com- munities, each with their own custom fonts that used totally difgerent encodings. A result of this was that texts were standardized into digital ghettos, where only tools that knew what they were could access them and everyone else would see them as piles of junk. You could not simply search for all occurrences of a word that happened to use any characters other than the basic characters of the English language since you had no way of knowing what to look for.

Change the font and your text is unintelligible

slide-25
SLIDE 25

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 25 of 46 Tie result of this situation was that texts were very diffjcult to work with. In the 1990s I started working on publica- tion of linguistic research as an editor. I estimate that a full 60% of my time for every volume I made was spent on designing and tweaking fonts to display the information in articles. Although these volumes were digital, they were not really useful in digital humanities because they could not be queried and anyone who wanted to manipulate them had to have my custom fonts. But there were no other options at the time.

In the 1990s, 60% of my editing work was dealing with fonts and encodings!

slide-26
SLIDE 26

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 26 of 46 Already in the 1980s some forward-thinking sofuware engineers had seen that the use of hundreds of offjcial and unoffjcial encodings was a recipe for disaster and a huge barrier to extending the benefjts of computing outside

  • f the English-speaking countries and French, Italian, German, and Spanish-speaking countries that could share

a common character set. Tieir truly brilliant idea was to create one character encoding to replace all of the exist- ing ones. Tiis character set, if universally adopted, would eliminate garbled characters and enable users to mix languages and writing systems in single documents without worrying about whether changing the font would make the text unusable. Tiis proposed character set, which of course became Unicode, was intended to be truly universal, and as such it provided a straight-forward way to address 65536 characters, with mechanisms to extend the number of characters into the millions. Tiey also provided a way for people using plain ASCII text (which comprises much of the world’s digital text) to use it and be compatible with Unicode’s UTF-8 format, the fjrst 128 code points of which duplicate the ASCII range.

The goal: One standard for representing characters of all writing systems.

slide-27
SLIDE 27

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 27 of 46 Tie fjrst mainstream consumer-facing Unicode-capable application I know of was Adobe InDesign, released in

  • 1998. I was an early adopter of it for the simple reason that using InDesign overnight eliminated all the tedious font

work I had to do in my publication work…

1998: Adobe InDesign comes out: that 60% turns to <5% over night.

slide-28
SLIDE 28

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 28 of 46 …and enabled me to make searchable and indexible PDFs. Similarly, Unicode provided a way to move forward and allow all languages to be represented equally on the Internet. Or at least that is the long-term result. Tie reality, of course, falls a little short. It can take years to document a script and get it into Unicode, and then years more before word processors and other tools begin to support it. But at least Unicode provides a way to get there. Unicode, a technical standard, is what enables digital humanities to exist as a single discipline at all, rather than as a set of more or less aligned disciplines bounded by particular languages and writing systems. If you have come of age as scholars since Unicode entered the mainstream, be glad, for your life as scholars has been enriched immeasurably.

Plus my work now becomes searchable and discoverable!

slide-29
SLIDE 29

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 29 of 46 Now, with all that background out of the way, I would like to turn to an example of how even the issue of standard- izing the representation of texts can be very diffj

  • cult. To address this issue, consider this image of a writing system,
  • ne that I imagine few of you know. Ti

e image on screen is of Székely rovás írás, also known as Old Hungarian runes or Szekler runes.

A strange writing system

slide-30
SLIDE 30

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 30 of 46 Tiis writing system derived form Turkic runes, was once actively used in Transylvania to write the Hungarian

  • language. It was largely displaced by the Latin alphabet by the 1700s and was essentially extinct by the early 1800s.

It was revised by Hungarian nationalists in around the time of the fjrst world war, but never regained mainstream status. In the mid 1990s a proposal was made to add this writing system to Unicode. At least three separate proposals from difgerent user communities were submitted by 1999. All of them largely agreed as to the characters to be encoded, although they difgered in some details. However, the groups disagreed about what the script was called and on the importance of these details. Unfortunately, the difgerent groups could not come to an agreement on these relatively minor technical details.

A strange writing system

  • Rovás írás = Szekler runes
  • Derived from Turkic rune scripts
  • Used through start of 19th century
  • Revived in early 20th century
  • Proposal to add characters to Unicode

presented in 1997

  • But controversy over names and details
slide-31
SLIDE 31

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 31 of 46 As a result, it took 15 years and one side essentially walking away from the table, before this writing system could be standardized in its encoding.

It took over 15 years to add this writing system to Unicode

slide-32
SLIDE 32

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 32 of 46 Tiis long fjght means that this writing system was efgectively kept beyond the pale of access for scholars for a very long time because the two primary factions could not agree on what to call the script and on a few other details. So why would details about how to represent a writing system that nobody actively uses for daily communication arose such passions? One of the reasons is that the script comes from Transylvania. Although now part of Romania, Transylvania holds a special place in concepts of Hungarianness, and the loss of Transylvania to Romania at the end of the First World War remains, almost 100 years afuer the fact, a major issue in Hungarian politics. Because Transylvania is so important to Hungarians, a writing system from Transylvania was bound to arouse strong feelings. In a sense, then, the issues about the writing system were more about who would control the standardization of some aspect

  • f that culture. If, as I discussed earlier, writing provides a means to control concpetualizations of the world and

to make items appropriable, the real battle here was about control over how Transylvania would be represented in Hungarian society.

During this time Hungarian runes might as well not have existed for scholars

slide-33
SLIDE 33

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 33 of 46 Tie act of standardizing is always an ideological activity, even if the ideology is benign and unobjectionable. In standardizing one way of doing something, other ways are deprecated or made less legitimate. Standardization this is exclusionary because if it defjnes how one must do something, it also defjnes how one may not do that same

  • thing. Standardization also decides which variations matter and which do not. In the Hungarian case, this factor

was crucial, because one of the parties—the losing party, as it happens—in the debate had plans to develop addi- tional characters for many local variants of the writing system, even though Unicode would generally not get into those particular details and their particular technical solutions were needed to support a much more ambitious goal.

Standardization always involves an ideology

slide-34
SLIDE 34

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 34 of 46 In its history Unicode has made other controversial choices. One of the most controversial was known as “CJK uni- fj cation.” Ti is process refers to the historical fact that there are four (or fj ve, if older Vietnamese is counted) writing systems in Asia that share close historical roots: Chinese (Traditional and Simplifj ed), Japanese, and Korean. All of these writing systems use characters that are historically related. Japanese and Korean historically borrowed Chi- nese characters (although modern Korean uses relatively few of them), ofu en borrowing them at multiple times in slightly difg erent variations. Ti ese characters would thus be written in difg erent ways in each language. In an efg

  • rt

to keep the number of characters in Unicode’s basic mulitlingual plane to the number it could contain, the Unicode Consortium made the decision to unify the historically related characters from each writing system under one abstract character, treating the local variations essentially as stylistic variants. Ti is decision was motivated largely by practical concerns: it is highly unlikely Unicode would have succeeded if implementers found themselves facing almost four times as many characters as they otherwise did. Ti e drawback of this pragmatic decision is that these variations are signifj cant: using a Simplifj ed Chinese character in place of a Traditional Chinese character, or vice versa, can have profound political implications. In some extreme cases, using the wrong variation may even make a text unintelligible. As a result, we fj nd ourselves in a position something like I described in the pre-Unicode days: you need the right font to ensure the text is what you want it to be. (However, note that using Unicode does at least ensure that the changes that may happen from changing fonts are reversible and intelligible: there is no guess with a text whether a particular digital representation means one thing or another.) Note that unifj cation applied only to CJKV characters, not to any of the other writing systems out there. So, for instance, Greek α, Latin a, Cyrillic а, and Hebrew א were not unifj ed, even though historically they all derive from Phoenician alf ( ).

CJK Unification

Honest (Chinese vs. Japanese) Iron

slide-35
SLIDE 35

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 35 of 46 Here again we see that standardization must make choices about what is signifjcant and what is not. But without making these choices, text cannot be exchanged and the enterprise of digitial humanities cannot take ofg. What is signifjcant in the case of Unicode is that these choices are made in an open and participatory fashion. Tiey are not imposed from on high, or implemented by force of market dominance. Tie architecture is open to all, and anyone may make suggestions and get involved (although the intellectual barriers to being truly involved are not insignifjcant, given the complexity of the topic). What is important is that the broader community is able to have a say in how these deicisions are made. Various parties come together with competing demands. While the Hungar- ian script example I cited is an extreme example of what happens when the parties involved will not compromise, most of the time the various parties make their needs known and then work collaboratively to fjnd a solution that addresses the needs of everyone. Compromise is made, but the end result is something that does not impose a single, biased vision on all parties. Of course, this does not mean everyone walks away perfectly happy, but they do walk away with something they can live with and that meets their needs.

Standardization involves tradeoffs, but these are made openly and with broad participation.

slide-36
SLIDE 36

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 36 of 46 So, at a basic level, digital humanities are a product of technical standardization, but they use technical standard- ization to preserve cultural knowledge an unique features. Earlier I mentioned that in the bad old days of the 1990s most computer processes really only supported ISO Latin 1, which essentially meant that people who did not speak a handful of infmuential Indo-European languages were second-class citizens in the digital world. Tiey could not interact with computer systems in their own languages, could not type their languages on most computers, and certainly did not enjoy the resources that most of us now take for granted. But with standardization of how text in the world’s language are represented, the barriers are coming down. In the 1990s sofuware was almost exclusively developed in English and then, afuer the English version was produced, it was essentially reverse engineered to work with other languages in a process called localization. Tie level of engineering required varied by the language in question, and when you got a local version of your sofuware depended on what language you spoke. As a result, if a product came out in 1995 in the U.S., you might have seen the Dutch version in 1996, the Japanese version in 1997 or 1998, the Arabic version (if it appears at all), in 1999 or 2000 (by which time the English version was fjve

  • r six major versions beyond what you were getting). And if you spoke Basque, Swahili, Maltese, Quechua, or any
  • f the other hundreds or thousands of smaller languages not privileged to belong to the ruling elites old European

colonial powers, you probably are still waiting today for products people in Germany or Spain had in 1995. Fur- thermore, it was highly likely that some of the functions did not work properly because the sofuware could not handle your language properly

In the “bad old days”…

  • Software developed in English
  • Reverse engineered for other languages
  • Typical delays:
  • Dutch: 6 months
  • Japanese: 2–3 years
  • Arabic: 4–5 years
  • Basque, Swahili, Urdu: ∞
slide-37
SLIDE 37

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 37 of 46 Particularly smart developers realized that their products would be sold in other languages, so they took some up-front engineering steps to ensure that products could be localized, but these developers were in the minority in the 1990s. Fortunately this situation is now the norm and it is increasingly possible for companies to release products—sofuware, websites, or devices—in hundreds of languages, ofuen simultaneously with the source release. Some of the world’s largest companies now deal with over 200 languages. Tiis is only possible because so much has now been standardized in so many technical aspects. Tiis sounds great at one level, but some of you might see this as just a recapitulations of old colonialisms with a kinder, gentler face, but still supporting the power of old centers of power. And I could not say that you are wrong in this criticism. But at the same time, the standardization of text encodings and of ways of dealing with text opens up the way for infmuence to run the other way. Skype, for instance, started in Estonia and only later came to the U.S. A few days ago I read about a new app for phones that has strangers from other countries call you to wake you up in the morning. While I would not use this app, it is the U.S./European adaptation of a popular Russian app. Increasingly we are seeing that standardization

  • f text is allowing more and more cultural output to fmow back from the periphery to the center, not as appropri-

ated knowledge, but rather as a reverse of the colonial fmow. Tiese scenarios are not about extraction of goods and resources, but rather about commerce that increasingly obliterates the notion of center and periphery. Standardization at the basic level of how to represent texts is therefore an example of how technical standardization enables or inhibits certain expressions of cultural knowledge, but a standardization *process* that brings compet- ing demands into conversation with one another provides a way to break down traditional barriers.

Today

  • Many products ship simultaneously in hundreds
  • f languages
  • Only possible because of standards
  • Does this recapitulate cultural power hierarchies?
  • Not necessarily…
  • Skype
  • App to have strangers wake you up
  • Standards are eroding center/periphery

distinction

slide-38
SLIDE 38

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 38 of 46 I would now like to turn to another example of how standardization can be problematic. Recently there has been considerable interest in the International Organization for Standardization in defjning best practices for transla- tion providers. Tie proposed standards would provide a way for translation providers to be certifjed as follow- ing certain processes that are intended to provide assurance that their output meets quality expectations. (Tiis approach, of controlling inputs to achieve quality output is generally known as quality assurance.) Most of the proposed standard is widely accepted and not controversial. However, in the ISO context one point has led to vigorous disagreement. Tie standard proposes to defjne the qualifjcations of translators working in conforming projects. As it was proposed there were three ways to be con- sidered qualifjed: education (a masters in translation), experience (more tan fjve years of full-time experience), or certifjcation by a government body as a translator. For those of you who have grown up in countries with civil law systems, these requirements probably seem natural and reasonable, but the last option has held up development of this standard for some time now.

International Standard for translation projects

  • Qualification on one of three levels:
  • Education
  • Experience
  • Certification
  • Certification by government bodies
slide-39
SLIDE 39

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 39 of 46 Tie reason is that in common law jurisdictions that government is not authorized to certify individuals at all. Tiis requirement would not work in the U.S., U.K., and some other countries. Tie fear, thus, was that by standardizing a practice applicable only to certain cultures, that translation providers in the U.S. and England would be at a competitive disadvantage because translators in those countries, who might be equally skilled as those in another country, might not be able to bid for jobs that those in countries like France, Belgium, or Germany could. (Tie issue was further compounded by the fact that the U.S. has few masters programs in translation and that highly- skilled translators increasingly do not work full time as translators but rather ofger translation as part of a broader portfolio of skills.) Tie requested change from the U.S. and U.K. was that the requirement be amended to allow for certifjcation by competent national translators associations (with careful control over how competency was determined). Tiis request, in turn, was rejected by many European delegations because they were worried that this would open the door for commercial organizations to self-certify their translators and undermine the meaning of the standard. Tius there were two cultural practices—common and civil law—that were in positive confmict. Because these two items could not be reconciled, the end result was that certifjcation was dropped entirely as an option, leaving both sides at an equal disadvantage. Here the standardization process outcome was less positive than in the case of Uni- code, but it still was better than either side imposing requirements that could not apply to the other side.

Government certification not possible in US and UK.

slide-40
SLIDE 40

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 40 of 46 Earlier I promised to use an analogy to music.

Finally some music

slide-41
SLIDE 41

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 41 of 46 One of my passions is Hungarian music. I play Hungarian bagpipes, an instrument I am sure most of you did not even know existed (and which you may wish you did not know about afuer hearing this clip). You have no need to

  • worry. As the old joke goes, a gentleman knows how to play his bagpipe, but refrains from doing so, so at least you

will not hear me perform.

Hungarian bagipes

slide-42
SLIDE 42

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 42 of 46 In Hungary there is a proverb that says that there is not room for two bagpipes in one pub. Tie meaning is that if you have two strong personalties in one project that chaos will result. Tiis saying comes about because these instruments were, for many years, true folk instruments, made to no standard of shape or size. As a result bagpipes were purely solo instruments since any two bagpipes were essentially guaranteed to not play in the same pitch or tuning and two bagpipes in one pub would have been a truly horrifying experience. (Tiis may also explain why it was popularly believed that bagpipe players had to go to hell to really learn to play…). Tiis instrument had es- sentially died out by the 1940s.

Proverb: There is not room for two bagpipes in one pub.

slide-43
SLIDE 43

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 43 of 46 However, starting in the 1970s the Hungarian bagpipe was revived and makers started to standardize the construc- tion of the instrument. Today most Hungarian bagpipes play in the same pitch (A) and use a tuning relatively close to that used by concert instruments. As a result, bagpipe players now form ensembles and routinely play in groups with multiple bagpipes. Ti ere are other features of the instruments that have been standardized as well, creating a new sort of bagpipe that is very similar to the old one, but which can function in new cultural situations. Ti is pro- cess of standardization has resulted in the loss of certain cultural knowledge (particularly around older tunings), but by accepting certain losses, the instrument has made dramatic gains in other areas. Ti ese changes were made

  • rganically by the bagpiping community as a whole in response to their current needs and desires, so they see it as

a net gain and not a loss. Ti ere was a long process to arrive at the present situation.

Today…

slide-44
SLIDE 44

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 44 of 46 So it is with standardization. Tie process itself is what prevents standardization from becoming simple imposition. Tie process of standardization is what helps ensure that cultural knowledge is not lost or steamrolled under the dictates of particular companies or countries.

The process, slowly but surely, is what prevents standards from being impositions

slide-45
SLIDE 45

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 45 of 46 Turning back to the digital humanities, without standards we cannot have digital humanities. While I have focused

  • n standards related to representing multilingual texts, translation quality, and so forth, similar issues apply of

course to standards for images, video, and other digital humanities assets. Tie technical standards developed have profound implications for what you can and cannot do in the digital humanities. But by being involved in the standardization process, we are slowly moving in the right direction and our capabilities to share and exchange information and to discover new and profound insights into the objects of our studies are growing.

Without standards there would be no digital humanities

slide-46
SLIDE 46

Standardization and the tension between the unique and the homogeneous in the Digital Humanities Arle Lomme • 10.September 2014 • KU Leuven Digital Humanities Summer School • Slide 46 of 46 Tiank you. If you would like to contact me about any of these materials, please feel free to reach out to me at arle. lommel@gmail.com.

Thank you

arle.lommel@gmail.com