Archive for December 2009

 
 

New ANNEX and TROVA user interfaces

by Thomas Koller

ANNEX is a web-based annotation exploration tool to display and play back annotated resource bundles (incl. video, audio and annotated text) stored on local or remote language archive servers. TROVA is a web-based search tool to search for simple or complex annotations (incl. regular expressions) on resources residing on local or remote language archive servers.

In autumn 2008 we started to redesign the user interfaces for ANNEX and TROVA to make them more usable and to allow for more functionality to be added at later stages. We decided to use a programming technology called Flex for the redesign of ANNEX and TROVA. A web application developed with Flex will run in the Flash Player browser plugin which is already installed on a vast majority of web browsers due to the use of plugin on popular web sites such as Youtube.

Changing the Order of Tiers

Using the Flash Player as the delivery technology for ANNEX and TROVA has a number of advantages for ANNEX users. First of all, video-based resource bundles can now be played back not only on Windows and Mac computers but also on Linux systems. Linux support for audio-based resource bundles requires some additional changes on the server side and will be available in the near future.

With Flex-based ANNEX and TROVA, we can now provide a more homogeneous look & feel. HTML-based web pages often look different to some extent in various browsers because of the way HTML is rendered differently across browsers. This can also cause yet unknown problems over time as a single browser may change the way HTML is rendered. Flex-/Flash Player-based applications, in contrast, look and work in the same way across browsers and operating systems.

Trova Single Layer Search

Trova Single Layer Search

Using Flash Player as the delivery technology, there is a noticeable improvement in ANNEX in its ‘timeline’, ‘waveform’ and ‘combined’ data views. In the previous ANNEX version, these data views were displayed as static image files, which had two disadvantages: First, when a user wanted to change the order of tiers in the ‘timeline’ or ‘combined’ view, this action had to be executed by selecting appropriate values in two dropdown lists which were located in another part of the ANNEX screen. In the new ANNEX version, the user can directly change the order of tiers via drag & drop. Second, the static data view images were not updated as soon as the video or audio playhead reached the time displayed on the right margin of the currently displayed data view (i.e. the playback went on but the data view stayed the same). In the new ANNEX version, the currently displayed data view is automatically being updated when the video or audio playhead reaches the time displayed on the right margin of the currently displayed data view. Therefore, the user will always be presented with an updated data view when playing back resources of any time length.

The new ANNEX version provides context sensitive help for different parts of the graphical user interface (such as the ‘Video display’ and ‘Media information’ panels). To access the help content for a panel, the user can either press the H key on their keyboard (this will display the help content for the panel below the mouse cursor) or they can drag the question mark (located at the top of the screen) to a user interface panel. The help content for this panel will then be displayed as soon as the question mark has been dropped.

Annex Help Texts

Annex Help Texts

Another important improvement in ANNEX and TROVA is the addition of a font chooser dropdown list. The user is now able to apply any font installed on their computer to the annotation text of the currently selected resource(s). This is particularly useful for the display of languages with uncommon fonts. A newly selected font will immediately be applied to currently displayed annotation text without having to reload ANNEX or TROVA.

Soon, there will also be standalone and embedded versions of ANNEX available. The embedded version will support the embedding of a smaller ANNEX version with a preselected resource in any web page (similar to the way Youtube videos can be embedded in other web pages). This youtube-like feature helps resource authors to showcase their work without making their readers leave their web site. Instead of pointing directly to the IMDI browser on a language archive server, authors can then describe their resource on their own web page in any way they like. Only freely available language resources will be able to be used with the embedded ANNEX version.

Choosing Fonts in Annex

A standalone desktop-based version of ANNEX will provide the opportunity to use ANNEX when for example an Internet connection is not available. This can be useful while travelling on planes or trains or as a fallback strategy for the presentation of language resources at workshops or conferences. It can also be useful to work with data-rich language resource bundles which otherwise could prove to be too demanding for proper display in ANNEX when being used over the web.

The REPLIX Project

by Willem Elbers

The REPLIX project is studying and implementing the next level in grid based replication and synchronization at a logical level by using iRODS. REPLIX is a joint project between DEISA represented by Rechenzentrum Garching, and CLARIN and DOBES both represented by MPI for Psycholinguistics.

Goals

The two main goals are data preservation and authenticity control:

1) When we are talking about data preservation, we are talking about guaranteeing future generations access and use for the data we are archiving now. This includes managing different copies of the data and associated metadata at different physical locations, this is called replication. Metadata in this context includes system metadata (such as file size, creation date, etc.), complex user metadata (anything defined by the user but also the relations defined by the user) and access restrictions (which user has access to which files and operations).

2) When we are talking about authenticity control, we are talking about making sure the information remains authentic. And not only the data files, but also the metadata associated to the data files. Since the data and metadata is replicated, the authenticity of each copy needs to be controlled. Moreover, access to files is also part of the authenticity control. Only authorized editors should be able to edit the information and associated metadata.

Current Infrastructure

Current Infrastructure

Illustration 1: Current Infrastructure

The current infrastructure takes care of replication at a physical level (using tools like rsync and Andrew File System (AFS)). At the moment this is similar to copying files from one location to another. For future use, this approach is too limited since replication causes source collections to be placed in different contexts, which cannot be properly handled by AFS or rsync. In order to ensure consistency of the collections a new approach is needed.

To be able to identify the archived objects in a unique way, MPI uses the handle system. The handle system creates persistent identifiers (PID) and associates them with file properties (such as a reliable checksum). The use of PIDs ensures the identification of archived objects now and in the future by a single identifier.

The current infrastructure consists of one central archive, located at the MPI in Nijmegen. The central archive is replicated (at the file level) to two large data centers, each managing two copies of the archive. Around the world, several satellite archives exist. Researchers use these satellite archives to ingest the information they collect. The first step in the current preservation process ensures proper ingestion into the central archive. The seconds step in the current preservation process ensures proper ingestion into the two remote archives. This is shown in Illustration 1.

Future Infrastructure

REPLIX is researching possibilities to overcome the limitations of existing replication and authenticity control methods. To be more specific, REPLIX is researching how we can use iRODS to create a solution where information is ingested into the archive and replicated ensuring the integrity and authenticity of the data and metadata. The solution should also take care of synchronization of the central archive to the backup data centers. This step also has to verify the integrity and authenticity of the data.

The iRods Solution

Illustration 2: iRODS Solution

Although tested for the MPI infrastructure, the approach should be easily generalized to a solution where any community can use the solution to deposit information into a central zone and taking advantage of the preservation facilities. Before achieving this, we will start to explore iRODS in general. Then we will start to create a setup to synchronize the central archive to one of the two data centers. The next step is to include both data centers and finally the satellite centers have to be included.

IRODS is a storage grid which uses rules to enforce policies on the actions performed on the data inside the storage grid or execute policies on a regular interval. One of the policies could be replication inside the storage grid. As soon as a file is ingested into the storage grid, it is automatically replicated onto several storage resources (hard discs, tapes, …). Another policy could make sure the file remains authentic, by checking the file hashes and repairing any damaged replication(s). It is also possible to create a connection between two or more storage grids. Each storage grid manages it’s own data collection and policies can be created to synchronize information between the storage grids.

Ideally, the iRODS policies should use the PID system to identify the information in all storage grids and based on these identifiers, and the associated information such as a checksum, perform synchronization between multiple (n) storage grids. This synchronization process will verify the integrity and authenticity of the synchronized data. This is shown in Illustration 2.

ISO639-3 as new standard within the MPI Archive

by Alexander König

As the ISO standard for languages codes ISO 639-3 is widely accepted nowadays, the MPI has decided to adopt these codes as the new standard for metadata stored in its archive. In the process of moving to this new standard we have gone through all the metadata files currently stored in the archive and replaced older code schemes like the two-letter ISO 639-1 variants with their 639-3 equivalent. This was another step in harmonising the linguistic metadata in the archive to make it easier to use for researchers.

A Letter from the Editor

by Alexander König

Welcome to the brand new LAT News! On this site we, the Technical Group of the Max Planck Institute for Psycholinguistics, are going to write about all the things we are doing in Language Archiving Technology. The topics will span from archiving of (endangered) languages over language documentation, data and media management to computer tools, e.g. the LAT tools we are developing at the MPI.

We deliberately chose the form of a blog because of its interactive nature. In the beginning we will probably still tinker with the form of the articles a bit until we found the type of writing that best suits you, our readers. To fine-tune this and to ensure that you are happy with what we are providing, we invite you to give us your feedback, either by commenting directly here on the blog or by writing an E-Mail to activate images to see this mail address

We also warmly welcome submissions of news, reviews, and articles from anyone working in the area of language technology, language documentation or archiving. Please contact us, if you want to participate.

For those who prefer to read the articles in a newsletter style, we will create PDF versions three times a year containing all the articles that have been published here on the blog during the preceding four months. For technical reasons these PDFs will only contain the articles and none of the comments. If you would like to subscribe to the PDF version, just send an email stating “subscribe” to activate images to see this mail address. You can also subscribe to the blog by means of an RSS feed.