RELISH workshop on lexicon standards and lexicon tools

by Jacquelijn RIngersma

On August 4 and 5, the RELISH project organized a workshop on lexicon standards and lexicon tools at the MPI in Nijmegen. The workshop brought together field linguists and NLP experts to discuss the approaches, standards, tools and interoperability of lexical resources. The aim of the workshop was to create understanding on the requirements in lexicon tools and to design concrete steps towards further harmonization if possible.

In the RELISH project (Rendering Endangered Languages lexicons Interoperable through Standards Harmonization), funded by NEH and DFG, the MPI works together with The University of Frankfurt and the Eastern Michigan University. The project aims to unify two major collections of digitized lexicons of endangered languages in order to create a searchable virtual archive.

In the workshop, there were presentations from field linguists and from members of the NLP community. The presentations showed that there is some difference in focus and approach. Where the field linguist aims at a content rich resource which can be used both for research purposed and for disseminations to the speech community, NLP searches for an infrastructure covering “all” language resources and tools. As a logic result standardization and interoperability seem to be more important for the NLP society, although certainly not irrelevant for the field linguist. However, the information sharing on the subject of standards and interoperability was felt to be very useful by both ‘parties’.

In the workshop there were also presentations on LMF and ISOcat (the ISO standards for lexical resources) and LIFT and GOLD (the USA standards for lexical resources). The presentations and interactions showed that on both sides of the Atlantic interesting moves have been made towards standardization and that the difference between the two does not seem to be as wide as the mentioned ocean.

In the final 6 months of the RELISH project the parties involved will work on bridging the gap between LMF/ISOcat and LIFT/GOLD and develop an interchange format. Since RELISH brings together organizations that have been instrumental in promoting both endangered languages documentation and standards-development in Europe and the US, the success of RELISH will provide impetus for other standards-harmonization efforts, as well as offer the scientific research community integrated access to important new digital materials.

Presentations of the workshop are available from the Event page on the MPI website.

Get your data archived!

by Jacquelijn Ringersma and Paul Trilsbeek

Language documentation is a field in linguistics which went through a “technology driven” change over the last 10 to 15 years. Linguists have been going into the field for decades making sound recordings of languages and linguistic events. However the miniaturization of recording equipment made it much easier to make large quantities of high quality audio recordings. In addition, upcoming affordable, high quality, video equipment permitted an extension of documentation work from audio to the visual dimension. The latter made it possible to document the languages within their natural and cultural context, which triggered the establishment of a branch within linguistics where the creation of a rich multimedia corpus for languages that are threathened with extinction became the main goal. In addition to collecting large amounts of primary audio and video recordings, numerous derived resources are produced: annotations and transcriptions, lexica, grammars, field notes etc.

The DoBeS (Dokumentation Bedrohter Sprachen/Documentation of Endangered Languages) programme, which started about 10 years ago, was among the first funding initiatives for endangered languages documentation projects. An important aspect of this programme was the establishment of a central, specialized archive to take care of long-term preservation of the valuable material that was collected by the documentation projects. The central archive, which is based at the Max Planck Insitute for Psycholinguistics, was made an essential part of the programme because one had become aware of the fact that large amounts of recordings about languages and cultures were in danger of being lost forever. Old tapes and films that are not stored in specialized climatized rooms rapidly degrade over time, but the situation is even worse for modern digital storage media such as DVDs and hard disks. Even if the media would survive, the technology changes so fast that it is very unlikely that there will be equipment around to read today’s storage media 20 years from now. A specialized digital archive will continuously migrate the stored material to the latest storage technology and will also migrate the stored file formats should they become obsolete.

Some researchers have their doubts about storing their resources in an online archive. Arguments presented to us are in the form of: (1) Once my material is in there, I will not be able to get it out; or (2) Other researchers will use my material without giving me the credit and do all kinds of nice things with it. However, when you store material in the MPI archive, you will maintain full control over the access to the data through an online access management system (AMS). You are the owner of the data, and you will remain the owner of the data. You decide who you will grant access. This opens up opportunities to give access to members of the speech communities or the relatives of those recorded.

The MPI archive accepts deposits from linguists who do not have an affiliation with the MPI or DoBeS. Storing your data in the MPI archive has the advantage that the data is stored in an organized manner and that you can use online tools to search through your data. You can also use online tools to visualize your data in an attractive manner. But most important, we will safeguard your data by making various backup copies in the Netherlands and Germany, by always using the latest state of the art in storage technology and by migrating to newer file formats should the current ones become obsolete in the future.

If you are interested in storing your language data in the MPI archive, please inquire about the conditions with one of the archive managers: Paul Trilsbeek or Jacquelijn Ringersma.

The International CLARA Summer School

by Thomas Koller

The Max Planck Institute for Psycholinguistics is proud to offer an international CLARA summer school on ”Advanced Resource Creation, Archiving and Usage” in Nijmegen (Netherlands). The summer school topics will be taught by experienced external specialists and MPI experts. It will take place at the Institute from July 5th to July 16th, 2010.

The summer school is part of the European CLARA project (Common Language Resources and their Applications). CLARA is a Marie Curie Initial Training Network which aims to offer early-stage researchers the opportunity to improve their research skills, to join established research teams and to enhance their career prospects.

Participating in this summer school will allow young researchers to get a deep understanding of modern methodologies and technologies to create, archive and use sharable language resources. The aim is to train young researchers in how to use modern technology to create language resources, in particular when the source material are multimedia streams. Additionally they will learn how the resulting complex resource types can be archived, how they can be accessed and analyzed via state-of-the-art (web) applications and how they can be enriched. 

The CLARA summer school has already attracted a varied and interesting group of young researchers and is fully booked out.

More information on the CLARA summer school can be found at the MPI website.

Embeddable Annex

by Thomas Koller

The MPI developers recently made a new Annex feature available which allows users to embed a smaller-sized customised version of Annex into any web page. This new feature has since then been warmly welcomed by researchers inside and outside our institute as it is a great way to easily show research results to outsiders.

The embeddable version of Annex only supports access to freely accessible annotation resource bundles, i.e. resource bundles which can be accessed from the IMDI browser without user login. This restriction helps to avoid authentication issues and effectively protects resources with restricted access.

This new feature can be accessed directly from Annex by clicking on embed in the menu. Then a small dialog pops up where the user can customise the HTML snippet before copying it to the clipboard and pasting it into a webpage. This works pretty much the same way as the similar YouTube feature which users may already be familiar with. The following options are available:

  • Show border around embedded Annex application: the creator can select a border width, a border color and a border type (solid, dotted or dashed)
  • Size of the embedded Annex application: 4 predefined sizes are available. The user can also set any custom sizes directly in the HTML markup. It should be noted, however, that the embedded Annex application has been optimised in layout and components sizes for the 4 predefined sizes. So any custom size set in the HTML snippet can lead to a non-optimal looking Annex instance.
  • Default view: text or subtitle. Setting a different default view (such as timeline or grid) will be ignored, instead the ‘text’ view will be set.
  • Tier text font: This setting may be helpful if the user wants the embedded Annex to display an annotation resource with special characters which may not be contained in a standard font on the user’s computer. If the ‘Tier text font’ parameter is set with a font name which is not available on the user’s computer, then the embedded Annex application will automatically fall back to a standard font. The end user also has the option to change the tier text font and the font size at any time via a dropdown list.

The embedded Annex application has a Start Full ANNEX button in its top right corner. When the end user clicks this button, a new browser tab will open the full Annex version showing the same annotation resource.

The CLARIN-NL metadata tutorial

by Dieter van Uytvanck

On Friday May 27, about 25 persons gathered in the Max Planck Institute in Nijmegen to attend a workshop on the practical use of the Component Metadata Infrastructure (CMDI) for the description of language resources. CMDI is the metadata part of CLARIN, a European initiative to create a Common Language Resources Infrastructure

After a short introduction about metadata in general and a history sketch, the concepts behind CMDI were introduced: The core ideas behind the new metadata format are modularity, reusability, and the use of data categories. A special session was dedicated to the use of ISOcat, the reference implementation of a data category registry. The idea behind this is to have a dependable definition of what is meant with a data category as, for example, Part of Speech. This way it doesn’t matter how you call or spell it in your particular metadata schema, the connection to similar schemata is always clear.

After these more general introductions, the specific CMDI software was presented.

First the Component Registry was shown. It is a web application that can be used for inspecting, searching, creating and editing CMDI metadata components. Afterwards it was illustrated how to create CMDI metadata files using a version of Arbil that has been modified to directly interact with the Component Registry. Both Arbil and the Component Registry are developed by the Max Planck Institute for Psycholinguistics and were presented by their respective developers. Although both applications are still in a development state it was clear that they can already be used now for the production of CMDI metadata.

All slides of the presentations can be downloaded from the CLARIN NL website.

More information about CMDI, including links to the software so you can try it out yourself, can be found on the main CLARIN site.

ANNEX and ELAN – A Comparison

by Thomas Koller and Han Sloetjes

ANNEX and ELAN are two closely related applications designed for handling of digital media files and associated annotation files. While ELAN as a desktop application is used for the creation of rich annotations on audio and video recordings, ANNEX represents a web-based viewer which allows to study annotated resources once they have been properly stored on the archive server.

This short article aims at highlighting on the one hand what features they have in common and on the other hand what features are unique to each tool.

ELAN is a local tool (desktop application) for the creation of annotations to audio and or video recordings. It is a combination of a media player with viewer and editor components for annotations. The annotation documents are stored in the XML-based ELAN Annotation Format (EAF). ELAN is written in the Java programming language and is available for Windows, Mac OS X and Linux. On Windows and Mac the media playback is delegated to an available high performance native media framework: DirectX/DirectShow on Windows and QuickTime on Mac. On Linux JMF is used. The list of supported file types depends on the available media player frameworks.

ELAN main window

Although there is limited support for streaming media via the RTSP protocol, most commonly the media files are accessed directly on a local hard drive or the local network. This guarantees high accuracy in media playback, especially in (repeated) playback of fragments of the media, which is usually a basic step in the process of segmenting the media. The annotation boundaries can be determined with millisecond precision. ELAN supports simultaneous, synchronized playback of up to 4 video files. The annotation documents are stored locally as well. The variant of the TROVA search engine that is distributed with ELAN can query the contents of physical directory structures. To that end it creates temporary in-memory indexes for the content of selected folders and files. The search is limited to EAF files. The ELAN window offers several customizable views on the annotation data, all synchronized with the media player. All viewers are editors at the same time. Many operations are provided for manipulating tiers and annotations.

ANNEX is written as an ELAN compliant browser-based tool (web application) that supports media playback via HTTP pseudostreaming and the Flash Player browser plugin. For freely accessible language resources ANNEX can also be embedded in any web page by pasting a simple HTML snippet into the page (comparable to the way Youtube supports embedding of videos into web pages). Alongside the media player it contains several customizable viewer components for annotations. By default both the media files and the annotation files are streamed from the MPI online archive; there is no need for downloading files in order to be able to view their contents. ANNEX is seamlessly integrated with the archive access management tools and interacts with available web services, for example the ones exposed by the lexicon tool LEXUS. Other tools in turn can make parameterized calls to ANNEX.

ANNEX works with the online version of TROVA, which creates an index for a whole LAMUS archive using the Postgres database system. This version of TROVA supports not only EAF but also Shoebox, CHAT and generic XML, HTML and text files.

Comparison Matrix
Feature ANNEX ELAN
Number of synchronized videos 1 4
Media file types MPG for video files, WAV for audio files Depending on the media framework of the particular platform
Waveform for audio .wav only .wav only
Media playback precision Depends on keyframe rate milliseconds
Streaming media support Pseudostreaming for audio and video files Limited, via rtsp
Annotation formats EAF, Toolbox/Shoebox, Chat. Will be converted to a single XML format for transfer. EAF, import of Toolbox, Chat, Praat, Transcriber, CSV
Annotation editing No Yes
Number of tiers Unlimited Unlimited
Font usage Any font available on the system Any font available on the system supported by Java
Search options TROVA search engine, search in entire (accessible part of) archive Single file search and multiple file search (TROVA) in local corpus
Technology Flash, XML, Quicktime (temporarily for resources with master audio file) Java, XML
Tool interaction, API Support for parameterized calls to ANNEX Extension mechanism for particular parts of the application

The Language Archive (TLA)

by Peter Wittenburg & Wolfgang Klein

The digital era changed a few characteristics of data management fundamentally. For widely persistent carriers such as the old clay tablets or even for some papyrus rolls it was obvious that they survived thousands of years and still contain the information the creators wanted to convene. Already for analogue electro-magnetic storage media that were introduced during the last century it became obvious that the life time of carriers is very limited and we realized that every copying activity was bound to a decrease in quality. As a consequence of this a UNESCO survey found out that about 80% of the material on cultures and languages in the ethno-linguistic domain are highly endangered. It was good practice to store master tapes in air-conditioned Faraday cages, however, for most of the recordings this was impossible and the implicit “don’t touch” policy created a logistical problem aside from the cost aspect, since the old players were not around anymore after a few years.

The digital area in turn changed the challenges again, since (a) copying is comparatively easy and if done carefully does not lead to a quality decrease and (b) it is just a matter of principle that the stored material needs to be touched regularly to do migrations of the carriers, of the formats and the encodings to maintain interpretability. Digital holdings are inherently dynamic and need a 2-tier framework for life-cycle management: (1) data centers that take care of bit-stream preservation and (2) community centers that know about format and encoding principles. The worldwide debate about the loss of our scientific and cultural memory which is being carried out worldwide gives an impression about the urgency of the lifecycle management problem.

This was the background for the Max Planck Society and the MPI for Psycholinguistics to establish a new unit with the name “The Language Archive (TLA)” to take care of the long-term preservation of the huge treasure which is enclosed in its large digital archive and which has been created in a wide range of initiatives and sub-disciplines. As prominent examples we would like to mention the resources about language studies from MPI researchers, the archive about endangered languages created by the DOBES program and the digital human-ethological archive from Eibl-Eibesfeldt. A plan has been submitted for 25 years of persistence of such a unit to offer the necessary services of a digital archive such as deposit, access, searching, visualization and preservation and beyond these to look after a number of critical characteristics such as integrity, authenticity, usability, discoverability and interoperability. TLA will carry out this task in collaboration with the two big computer centres of the Max-Planck-Society which will focus on bit-stream preservation and in future also on giving access to the material following the agreed principles.

To fulfill its mission TLA will have archiving experts who know about metadata, formats, standards and encoding principles as they are used in our domain and who can deploy curation strategies, software experts who can maintain the existing code base and develop new functionality and system experts who will interact with the storage system managers of the MPI and the computer centres to take care of the bit-stream preservation and proper security. We see digital archiving with its many facets as a networking task as well, i.e. we will participate in relevant collaborations to be able to apply state-of-the-art methods. One such network is the worldwide network of regional centres for language material which will be supported in the future.

Due to the proven Language Archiving Technology (LAT) software-suite which has been developed during the last decades the archive can be open for any serious language and cultural material which is of relevance for researchers. Based on open legal and ethical rules, material can be deposited and accessed via the web using a variety of tools. The archive will continue to participate in national and European projects to maintain the existing software and to provide new advanced functionality, and to establish professional research infrastructures that will improve data lifecycle management and the access to language and cultural material.

TLA will start its operation formally at 1. September 2010 lead by Wolfgang Klein and Peter Wittenburg.

LEXUS and ViCoS: a software ‘couple’ in the LAT suite

by Jacquelijn Ringersma

LEXUS is our online tool for the creation of multimedia lexica and encyclopedic dictionaries. LEXUS is targeted at linguistics involved in language documentation, but also actively used by researchers in Sign Language research. LEXUS is based on the ISO recommendation for Language Resource Management (ISO TC37/SC4), providing a Lexical Markup Framework (LMF) lexicon structure and a concept naming registry (ISOcat). With LEXUS, users can create lexica from scratch, but also import lexica created in Toolbox or other XML based tools. Lexica using LMF and ISOcat are interoperable with each other, allowing for multi lexicon searches and merging of lexica. Users may customize views of the word list and lexical entries. Standard functionality, like sorting or filtering of word lists is already available and we are currently working on paper output options. One of the major strengths of the online tool is that users may share their lexica with other users, either on a read only or read/write basis.

ViCoS is an extension of LEXUS, with which users can create relations between lexical entries, using fuzzily defined relation types. The result of this network of relations can be a conceptual space, where each word is represented as an element in a network of other related words. Relations can be ‘universal’ (e.g. A_is_a_B) or specifically defined for a particular lexicon (A_eats_B). In its current version ViCoS can only be used from the LEXUS user interface, since the words are the basis of the conceptual space. Future plans for ViCoS envisage that the tool will be central in the creation of a customized ‘eScience environment’, a user-defined workspace where users can link any type of resource into new organizational layers.

LEXUS and ViCoS training and support

Recently we did a LEXUS/ViCoS training session in the Winter School Saami Language Documentation and Revitalization in Bodø, Norway. Some 25 participants were trained in creating lexica, adding multimedia fragments, customizing views and creating conceptual spaces. Although the training was basic and could not cover the full functionality of LEXUS and ViCoS, most users were enthusiastic about the tools and registered as LEXUS users after the training.

At the Saami Winter School (photo by Lena Karvovskaya)

If you are interested in using the tools, you may request a LEXUS user account by sending an e-mail to Jacquelijn Ringersma. We have regular LEXUS and ViCoS training in the DoBeS training weeks, or in summer schools and language documentation workshops.

Archiving workshop in India

by Jacquelijn Ringersma

From February 5 to February 8, there was a workshop on documentation and archiving in Guwahati, Assam (India). 22 participants were trained in the recording of audio and video, handling of audio and video files, and use of the LAT software. Two members of the MPI’s technical group were among the workshop trainers.

Participants trying out the video equipment

The archiving workshop was organised by DoBeS, in collaboration with Guwahati University and the Phonogrammarchiv (Austria). Its purpose was to train local linguists in best practices and current methods of documenting languages and cultures. The workshop was financed by the Volkwagen foundation, within the framework of the DoBeS project: The Traditional Songs and Poetry of Upper Assam.

Strengthen local capacity

The project aims at multifaceted linguistic and ethnographic documentation of the Tangsa, Tai and Singpho communities in Margherita (North-East India). The Guwahati workshop contributes to the project by strengthening the local capacity. Among the trainees were students and PhD’s of the Guwahati University and staff members of the National Folklore Support Centre (NFSC).

Arbil

by Peter Withers

Arbil (short for “Archive Builder”) is an application for arranging research material and associated metadata into a format appropriate for archiving. It is basically the successor to the IMDI Editor, which a lot of our readers probably know. This old tool is now almost ten years old and a lot has happened in software engineering in that time. Therefore, instead of simply updating the IMDI editor to incorporate user suggestions and modern software architecture principles, the MPI has decided to create a completely new application that will replace the IMDI editor altogether.

The most obvious difference to the old IMDI Editor is that Arbil has a tabular display, which allows a comparative view on the data and the ability to copy-paste between matching sets of data.

Arbil Screenshot 1

Arbil main screen

While Arbil is primarily a tool to enter metadata, it also has functions to help organise the collected material and create a local well-organised corpus before it is archived. These functions include the ability to search for and compare metadata, and to open the resource files in associated applications, e.g. ELAN for annotations or a media player to watch videos.

Arbil Screenshot 2

Arbil's embedded image viewer

Arbil has been designed as a local application so that it can also be used offline, for instance in remote field sites. The metadata and resource files can be entered in part or as a whole; once an Internet connection is available the previously entered data and associated structures can be exported from Arbil and then be transferred to the main archive via Lamus. As the idea behind Arbil is to deal with a complete corpus or corpus branch Arbil can mirror branches from the main archive on your local computer so that they can still be referred to offline and in the field.

Unlike the IMDI Editor, Arbil incorporates the functionality of a number of separate tools while being designed around the workflow of the user. This means the application is meant to be the one program around which all your archiving work will be centered. In Arbil the metadata is viewed in tables, which can contain a single node of metadata as a list of fields, or many different nodes each with its fields as a separate row in the table. This tabular view of the data allows multiple metadata nodes to be compared across the rows of the table. If the metadata node is editable then these fields can be edited in any table in which they are viewed. Drag and drop is used extensively both for constructing a hierarchical corpus tree structure and for adding nodes to tables for viewing and editing. Bulk editing of metadata can be done primarily via copy and paste, which allows a string of text to be pasted into multiple fields of multiple rows, or to paste multiple fields into the matching fields of multiple rows.

Arbil Screenshot 3

Tabular view of multiple imdi files

The basic user interface of Arbil is highly customisable. For example, you can decide which columns of a table should be visible and save various table layouts to quickly switch between different views to accommodate for different tasks. Furthermore, columns can be resized, sorted on any column and reordered. Rows can easily be added and then dragged from one table to another, and the cells can be highlighted based on matching text.

With all the various metadata fields it is not always easy to see at a glance what a particular fields is intended for and which fields are specifically required. For this reason a description of the intended usage for each field is displayed in a tool tip. Similarly an indication of which fields are required is given by a textual and colour highlight when the data is not filled in. Likewise in the case of fields requiring specific formatting, such as date fields, the cell and the table will be highlighted when the formatting is incorrect.

Arbil Screenshot 4

Arbil window style

In contrast to the IMDI Editor Arbil is more of a team-player, which means that there are various import and export facilities. Any valid IMDI files can be imported into Arbil. If the metadata refers to resource files, they can optionally be imported at the same time, for instance when migrating or merging from one computer to another. All of the IMDI data and the associated resources within Arbil can be exported into a self-contained directory. The exported files can then be uploaded into LAMUS where any new corpus branches can be uploaded and existing sessions that have been edited can be replaced. During both the import and export processes, all of the metadata is validated and a list of warnings are given if there are any errors. The textual data entered into Arbil can be exported in formats other than IMDI, for instance the contents of a table can be copied and pasted into a text editor or into spreadsheet. A custom style-sheet can be used to transform and export the data in a particular format. The tables used in Arbil can also be embedded in a web page, where the table’s contents, size, columns and highlighting will be displayed in the resulting web page.

Arbil continues to be actively developed to extend these features further and to make the user experience as pleasant as possible. Why don’t you try it out yourself?