Overview 1. A step back: a look at the data we had to work with 2. - - PowerPoint PPT Presentation
Overview 1. A step back: a look at the data we had to work with 2. - - PowerPoint PPT Presentation
1/19 Overview 1. A step back: a look at the data we had to work with 2. Requirements for the application 3. Basic choices made for data storage 4. Specific technologies we use 5. A detailed example: storage of the tagging 6. A note on
2/19
Data we had to work with
- Sound files (.aiff) of the fieldwork interviews
- Transcripts (plain text files) made from these sound files with
the PRAAT transcription program (www.praat.org)
3/19
Requirements for the application
- Text of transcripts must be searcheable for words, arbitrary
substrings, and with wildcards
- It must be possible to assign POS tagging to individual words
- It must be possible to lemmatize individual words
- It must be possible for one word to have more than one
tagging/annotation associated with it
- It must be possible to draw maps from the results of searches in the data
- It must be possible to hear the original audio of text fragments which
are the results of searches
4/19
- It should also be easy to link to external information (for example,
geographical coordinates of the localities of the interviews)
So how should the data be stored?
- … in such a way that textual searches are fast enough
- … in such a way that it is possible to uniquely identify the smallest
parts of the text, i.e. individual words, while preserving their context at all levels: sentences, question-answer pairs, and complete interviews
- The most flexible and economic way of achieving this seemed to us
to use a relational database for storage of the transcripts.
5/19
From PRAAT transcripts to a relational database: a quick overview
6/19
waveform
- f sound
transcription of sound: separate speakers in separate tiers, divided into intervals
transcription in PRAAT
7/19
- PRAAT saves its transcripts as plain text files, containing, for each
tier, intervals with transcribed text and start and end times, assigned by the transcriber
- scripts were written to split the PRAAT files into individual words
which were read into the database
- the basic table structure for the transcripts looks like this:
word interval tier (speaker) file (interview) word_id interval_id interval_id tier_id tier_id file_id file_id
start time end time locality
8/19
Specific technologies we used
- We want to make our data accessible in an open standard format and we
preferably use open source software
- On the storage end, data should be stored in such a way that it is always
possible to get it out again of whatever storage mechanism one chooses. The storage mechanism should not be tied to a specific operating system. Since we wanted a relational database backend for speed and flexibility reasons, the best choice in this regard would be a non-platform-specific
- pen source database system.
- The database we use at the Meertens Institute is MySQL,
version 4.1.11 at the time of writing. In the database for the SAND, we use InnoDB tables so that we can use foreign key constraints and transactions to guarantee data consistency.
9/19
Specific technologies (continued)
- The end result: a web application, delivering web
pages in standard XHTML: maximum accessability for end users. The only software needed is a modern web browser, on any operating system.
- The application is written in PHP (version 4.3.6 at this
moment, upgrade to 5.0 imminent). PHP is used to translate search questions from web pages into SQL queries for the database, and SQL results from the database back into HTML pages.
- Within the PHP layer, programming logic and
presentation (HTML) are separated by means of the Smarty templating system: smarty.php.net.
10/19
Specific technologies (continued): maps
- The maps are SVG (Scalable Vector Graphics) images. SVG is an
- pen standard (SVG images are actually XML files).
- We have a base map of the Dutch language area in the so-called
“Rijksdriehoeksstelsel” (National Triangulation System), and a database with about 4500 localities and their coordinates in the same system. Searches in the SAND database always return localities, so it is straightforward to display search results as symbols on the map.
- In addition to SVG images, maps can be displayed as JPEG
images (generated by the PHP GD library).
11/19
- The original (.aiff) sound files are converted to QuickTime movies, and
served by QuickTime Streaming Server. With this setup it is possible to play a small slice of the file with a specific starting and ending time.
Specific technologies (continued): sound
- Since searches in the SAND database always return one or more
intervals (from the original transcripts), and intervals have start- and endtimes which are also saved in the database, it is straightforward to construct an HTML <object> with “starttime” and “endtime” parameters which plays, using the QuickTime plugin, just that specific slice of the sound file which corresponds with the found interval.
- We use Apple’s QuickTime server, but there is also an open source
version called Darwin Streaming Server. It can be downloaded from www.opensource.apple.com.
12/19
Use of a relational database: a detailed example storage of the tagging The text of interval id 3817 (from the municipality of Aalter) is: k weet datij zal moete were keren. (“I know that he will have to return”.) The word moete is word id 12345, and this word has the following tag:
V(-e,eind2,inf,mod)
Main category: V = verb; with four attributes defined: -e (inflexion is -e); eind2 (“end2”, it is the second highest verb, hierarchically, in the verb cluster); inf (it is a verbally used infinitive), mod (it is a modal auxiliary verb).
13/19
storage of the tagging (continued) This tag, V(-e,eind2,inf,mod), is broken up into its constituent parts in the database and linked to the word id in two database tables:
- sandtag_toegekend_woordsoort (“sandtag_assigned_category”)
- sandtag_toegekend_attribuut (“sandtag_assigned_attribute”)
If we look up word id 12345 in these tables, we find this:
14/19
sandtag_toegekend_woordsoort +----------+---------------+ | woord_id | woordsoort_id | +----------+---------------+ | 12345 | 4 | +----------+---------------+
storage of the tagging (category)
Category is assigned to word 12345 by means of an id number (4) which refers to another table (sand_woordsoort, “sand_category”): a small table which contains all word categories used in the SAND tag set.
sandtag_woordsoort +---------------+------------+-----------------+ | woordsoort_id | woordsoort | woordsoort_kort | +---------------+------------+-----------------+ | 4 | V-infin | V | +---------------+------------+-----------------+
Category id 4 is ‘V-infin’ (infinite verb), the short form of which, for readability purposes, is just ‘V’.
15/19
sandtag_toegekend_attribuut +----------+--------------+-----------+ | woord_id | attribuut_id | waarde_id | +----------+--------------+-----------+ | 12345 | 1 | 3 | | 12345 | 2 | 18 | | 12345 | 14 | 98 | | 12345 | 11 | 57 | +----------+--------------+-----------+
The attributes are stored as attribute-value pairs, as usual with id numbers. These numbers refer to two tables: sandtag_attribuut (“sandtag_attribute”) and sandtag_waarde (“sandtag_value”). If we look up id numbers found above in these tables, we find:
storage of the tagging (attributes)
sandtag_waarde +-----------+--------+ | waarde_id | waarde | +-----------+--------+ | 3 | -e | | 18 | eind2 | | 98 | inf | | 57 | mod | +-----------+-------- sandtag_attribuut +--------------+-----------+ | attribuut_id | attribuut | +--------------+-----------+ | 1 | infl | | 2 | pos | | 14 | infin | | 11 | aux | +--------------+-----------+
16/19
storage of the tagging (continued) Apart from the three tables mentioned earlier (for categories, attributes and values), there are also two tables which list the legal combinations of categories, attributes and values, whether attributes are optional or required, and the order in which these elements are to be presented. The complete SAND tag set is described in these five tables. The relational database principle of storing data and tagging in separate tables, which are linked by id numbers, makes interesting things possible …
17/19
flexibility
- adding a translation of the tagging system in other languages is
possible without disturbing the data or the existing tagging. Just add a column:
+--------------+-----------+--------------+ | attribuut_id | attribuut | attribuut_en | +--------------+-----------+--------------+ | 1 | infl | infl | | 2 | pos | pos | | 3 | casus | case | | 4 | pers | pers | | 5 | getal | number | | 6 | genus | gender | | 7 | functie | function |
This has already been done for English.
18/19
flexibility (continued)
- it is possible to add as many annotations/taggings to a single
word as you want, without disturbing the existing tagging. Just add another table with the word id as a foreign key. Examples: lemmatizing is done in a separate table:
+----------+----------+ | woord_id | lemma_id | +----------+----------+ | 12345 | 125 | +----------+----------+
There is a separate table with tags according to the CGN (Corpus Gesproken Nederlands) system:
+----------+-----------+ | woord_id | cgntag_id | +----------+-----------+ | 12345 | T314 | +----------+-----------+