Overview 1. A step back: a look at the data we had to work with 2. - - PowerPoint PPT Presentation

overview
SMART_READER_LITE
LIVE PREVIEW

Overview 1. A step back: a look at the data we had to work with 2. - - PowerPoint PPT Presentation

1/19 Overview 1. A step back: a look at the data we had to work with 2. Requirements for the application 3. Basic choices made for data storage 4. Specific technologies we use 5. A detailed example: storage of the tagging 6. A note on


slide-1
SLIDE 1

1/19

1. A step back: a look at the data we had to work with 2. Requirements for the application 3. Basic choices made for data storage 4. Specific technologies we use 5. A detailed example: storage of the tagging 6. A note on long-term archiving

Overview

slide-2
SLIDE 2

2/19

Data we had to work with

  • Sound files (.aiff) of the fieldwork interviews
  • Transcripts (plain text files) made from these sound files with

the PRAAT transcription program (www.praat.org)

slide-3
SLIDE 3

3/19

Requirements for the application

  • Text of transcripts must be searcheable for words, arbitrary

substrings, and with wildcards

  • It must be possible to assign POS tagging to individual words
  • It must be possible to lemmatize individual words
  • It must be possible for one word to have more than one

tagging/annotation associated with it

  • It must be possible to draw maps from the results of searches in the data
  • It must be possible to hear the original audio of text fragments which

are the results of searches

slide-4
SLIDE 4

4/19

  • It should also be easy to link to external information (for example,

geographical coordinates of the localities of the interviews)

So how should the data be stored?

  • … in such a way that textual searches are fast enough
  • … in such a way that it is possible to uniquely identify the smallest

parts of the text, i.e. individual words, while preserving their context at all levels: sentences, question-answer pairs, and complete interviews

  • The most flexible and economic way of achieving this seemed to us

to use a relational database for storage of the transcripts.

slide-5
SLIDE 5

5/19

From PRAAT transcripts to a relational database: a quick overview

slide-6
SLIDE 6

6/19

waveform

  • f sound

transcription of sound: separate speakers in separate tiers, divided into intervals

transcription in PRAAT

slide-7
SLIDE 7

7/19

  • PRAAT saves its transcripts as plain text files, containing, for each

tier, intervals with transcribed text and start and end times, assigned by the transcriber

  • scripts were written to split the PRAAT files into individual words

which were read into the database

  • the basic table structure for the transcripts looks like this:

word interval tier (speaker) file (interview) word_id interval_id interval_id tier_id tier_id file_id file_id

start time end time locality

slide-8
SLIDE 8

8/19

Specific technologies we used

  • We want to make our data accessible in an open standard format and we

preferably use open source software

  • On the storage end, data should be stored in such a way that it is always

possible to get it out again of whatever storage mechanism one chooses. The storage mechanism should not be tied to a specific operating system. Since we wanted a relational database backend for speed and flexibility reasons, the best choice in this regard would be a non-platform-specific

  • pen source database system.
  • The database we use at the Meertens Institute is MySQL,

version 4.1.11 at the time of writing. In the database for the SAND, we use InnoDB tables so that we can use foreign key constraints and transactions to guarantee data consistency.

slide-9
SLIDE 9

9/19

Specific technologies (continued)

  • The end result: a web application, delivering web

pages in standard XHTML: maximum accessability for end users. The only software needed is a modern web browser, on any operating system.

  • The application is written in PHP (version 4.3.6 at this

moment, upgrade to 5.0 imminent). PHP is used to translate search questions from web pages into SQL queries for the database, and SQL results from the database back into HTML pages.

  • Within the PHP layer, programming logic and

presentation (HTML) are separated by means of the Smarty templating system: smarty.php.net.

slide-10
SLIDE 10

10/19

Specific technologies (continued): maps

  • The maps are SVG (Scalable Vector Graphics) images. SVG is an
  • pen standard (SVG images are actually XML files).
  • We have a base map of the Dutch language area in the so-called

“Rijksdriehoeksstelsel” (National Triangulation System), and a database with about 4500 localities and their coordinates in the same system. Searches in the SAND database always return localities, so it is straightforward to display search results as symbols on the map.

  • In addition to SVG images, maps can be displayed as JPEG

images (generated by the PHP GD library).

slide-11
SLIDE 11

11/19

  • The original (.aiff) sound files are converted to QuickTime movies, and

served by QuickTime Streaming Server. With this setup it is possible to play a small slice of the file with a specific starting and ending time.

Specific technologies (continued): sound

  • Since searches in the SAND database always return one or more

intervals (from the original transcripts), and intervals have start- and endtimes which are also saved in the database, it is straightforward to construct an HTML <object> with “starttime” and “endtime” parameters which plays, using the QuickTime plugin, just that specific slice of the sound file which corresponds with the found interval.

  • We use Apple’s QuickTime server, but there is also an open source

version called Darwin Streaming Server. It can be downloaded from www.opensource.apple.com.

slide-12
SLIDE 12

12/19

Use of a relational database: a detailed example storage of the tagging The text of interval id 3817 (from the municipality of Aalter) is: k weet datij zal moete were keren. (“I know that he will have to return”.) The word moete is word id 12345, and this word has the following tag:

V(-e,eind2,inf,mod)

Main category: V = verb; with four attributes defined: -e (inflexion is -e); eind2 (“end2”, it is the second highest verb, hierarchically, in the verb cluster); inf (it is a verbally used infinitive), mod (it is a modal auxiliary verb).

slide-13
SLIDE 13

13/19

storage of the tagging (continued) This tag, V(-e,eind2,inf,mod), is broken up into its constituent parts in the database and linked to the word id in two database tables:

  • sandtag_toegekend_woordsoort (“sandtag_assigned_category”)
  • sandtag_toegekend_attribuut (“sandtag_assigned_attribute”)

If we look up word id 12345 in these tables, we find this:

slide-14
SLIDE 14

14/19

sandtag_toegekend_woordsoort +----------+---------------+ | woord_id | woordsoort_id | +----------+---------------+ | 12345 | 4 | +----------+---------------+

storage of the tagging (category)

Category is assigned to word 12345 by means of an id number (4) which refers to another table (sand_woordsoort, “sand_category”): a small table which contains all word categories used in the SAND tag set.

sandtag_woordsoort +---------------+------------+-----------------+ | woordsoort_id | woordsoort | woordsoort_kort | +---------------+------------+-----------------+ | 4 | V-infin | V | +---------------+------------+-----------------+

Category id 4 is ‘V-infin’ (infinite verb), the short form of which, for readability purposes, is just ‘V’.

slide-15
SLIDE 15

15/19

sandtag_toegekend_attribuut +----------+--------------+-----------+ | woord_id | attribuut_id | waarde_id | +----------+--------------+-----------+ | 12345 | 1 | 3 | | 12345 | 2 | 18 | | 12345 | 14 | 98 | | 12345 | 11 | 57 | +----------+--------------+-----------+

The attributes are stored as attribute-value pairs, as usual with id numbers. These numbers refer to two tables: sandtag_attribuut (“sandtag_attribute”) and sandtag_waarde (“sandtag_value”). If we look up id numbers found above in these tables, we find:

storage of the tagging (attributes)

sandtag_waarde +-----------+--------+ | waarde_id | waarde | +-----------+--------+ | 3 | -e | | 18 | eind2 | | 98 | inf | | 57 | mod | +-----------+-------- sandtag_attribuut +--------------+-----------+ | attribuut_id | attribuut | +--------------+-----------+ | 1 | infl | | 2 | pos | | 14 | infin | | 11 | aux | +--------------+-----------+

slide-16
SLIDE 16

16/19

storage of the tagging (continued) Apart from the three tables mentioned earlier (for categories, attributes and values), there are also two tables which list the legal combinations of categories, attributes and values, whether attributes are optional or required, and the order in which these elements are to be presented. The complete SAND tag set is described in these five tables. The relational database principle of storing data and tagging in separate tables, which are linked by id numbers, makes interesting things possible …

slide-17
SLIDE 17

17/19

flexibility

  • adding a translation of the tagging system in other languages is

possible without disturbing the data or the existing tagging. Just add a column:

+--------------+-----------+--------------+ | attribuut_id | attribuut | attribuut_en | +--------------+-----------+--------------+ | 1 | infl | infl | | 2 | pos | pos | | 3 | casus | case | | 4 | pers | pers | | 5 | getal | number | | 6 | genus | gender | | 7 | functie | function |

This has already been done for English.

slide-18
SLIDE 18

18/19

flexibility (continued)

  • it is possible to add as many annotations/taggings to a single

word as you want, without disturbing the existing tagging. Just add another table with the word id as a foreign key. Examples: lemmatizing is done in a separate table:

+----------+----------+ | woord_id | lemma_id | +----------+----------+ | 12345 | 125 | +----------+----------+

There is a separate table with tags according to the CGN (Corpus Gesproken Nederlands) system:

+----------+-----------+ | woord_id | cgntag_id | +----------+-----------+ | 12345 | T314 | +----------+-----------+

slide-19
SLIDE 19

19/19

Long-term archiving For long-term archiving, a binary file format (even an open format like MySQL’s database files) is probably not a good idea. Fortunately, creating a textual representation from which the database can be rebuilt is easy, with the mysqldump command line tool:

% mysqldump sand > sand.sql

This gives a series of SQL statements to recreate the original database. Or if you would rather use XML as an archiving format:

% mysqldump sand --xml > sand.xml

This will dump the database as well-formed XML.