In the pipeline: LEXAN, an advanced Annotation Framework for ELAN

by Herman Stehouwer & Sebastian Drude

As all linguistic field workers know, transcribing and further annotating audio and video recordings and other texts is a very expensive and time-consuming procedure. For a single hour of a recording of a lesser documented language it can take more than a hundred hours of expert time to create useful linguistic annotations such as “basic annotation” (a transcription and a translation) and “basic glossing”: additional information on individual units – usually morphs, sometimes words – such as an individual gloss (indication of meaning or function) and perhaps categorical information such as a part-of-speech tags (or its equivalents on the morphological level). More advanced glossing can take even longer.

Furthermore, information on the lexical units encountered in the texts need to be transferred to a lexical tool. After all, often one the goals of field work is to create a usable lexicon, describing the endangered language.

Currently, this work is supported best by tools like (The Field Linguist’s) Toolbox or the FieldWorks Language Explorer (FLEx), both without proper support for media-files. Many users have asked for support for advanced annotation tasks in ELAN, ideally using LEXUS to build, access and expand a lexical database. Making this possible is the objective of TLA’s newest project called LEXAN, a modular annotation support framework coupled to a new interface in ELAN. It will support different “annotyzers”,  i.e. modules that produce annotation suggestions for the researcher, including machine-learning modules.

The “annotyzers” will work on a tier or set of tiers, the “source tier[s]”, as chosen by the user, and typically produce an additional tier or a group of tiers, the “target tier[s]”, with content generated based on the source tiers and additional data, e.g. lexical data.

A first annotyzer-like functionality of ELAN (without requiring interaction with a lexicon yet) would be the possibility to copy one entire source tier, for instance a detailed transcript, or a literal translation. The created target tier can then serve as a starting point for preparing another tier with similar but edited content, for instance a cleaner adapted version of the orthographic transcript, or an idiomatic free translation.

Similarly, a basic tokenizer would copy the individual words (recognized by spaces and perhaps hyphens or similar punctuation) on one source tier – containing an orthographical representation of a sentence – into separate annotation units on a new (target) word-tier which can then be corrected (e.g., cells can be joined in the case of composed words such as black board, or on the contrary split in the case of clitics which may orthographically be parts of more comprehensive words).

As a possible next step, already making use of interaction with a lexicon, an annotyzer would use the annotations on the word-tier to build an “intermediate” database of individual inflected word forms. Each entry in this database would have at least a field which contains the citation form of the lexical word for each given inflected word form, possibly together with a semantic label (lexical gloss) and a disambiguating homonym index in case that two lexical words with identical citation forms exist. Some of these fields would be obtained from the lexicon once the citation form has been determined, and the citation form itself and other information (such as a “complete gloss” of the inflected word form which includes semantic effects of inflectional categories and the like) could be written back to new target tiers in ELAN. Although much of this information would still have to be added by hand the first time an inflected word form occurs, this simple setting would already help to: a) create lexical entries for new lexical units, b) reduce writing when the form occurs a second, third etc. time, and c) encourage and support consistency.

Many users acquainted with Toolbox or FLEx would expect a “glossing” functionality like they know it from these tools of the future LEXAN. This would include a parser-module (generic or language-specific, pure string-matching or advanced with using the context, static or with learning capacities etc.) which would split up the individual inflected word forms on a source word-tier into individual morphs on a new target morph-tier. This morph-tier would then serve as a source for adding further target tiers with annotations such as glosses (indication of lexical meaning or functional/categorical effects) and perhaps part-of-speech-like tags (on the morpheme level). In the lexicon, this functionality would presuppose corresponding fields in all entries such as a part-of-speech label for each morph and a gloss, which are probably the most common fields in lexical databases in field research anyway (in addition to the citation and variant forms of the morph and possibly a way to distinguish different but related senses which are given as lexicographical definitions or translation equivalents). Again, correct parses and glosses would be stored in the intermediate database so that they can be re-used and referred to.

It is a well-known fact that general parsers work better for some and less well for other languages (for instance, usually morphological parsers score high with predominantly isolating and agglutinative languages and less good with inflectional and polysynthetic languages). It is also true that glossing schemes and set-ups are based on specific types of linguistic theories – for instance, the setting presented above (which corresponds to the default functionalities of Toolbox and FLEx) is clearly tied to an “item-and-arrangement” (less so “item-and-process”) reasoning on language structure. In principle, an infrastructure as the one proposed here should strive at being as interoperable with different linguistic theories as possible, which would imply that also “word-and-paradigm” theories could fruitfully use the tools and functionalities. The proposal of an “intermediate” database with one entry each for every individual (inflected) word form goes into that direction, allowing, for instance, characterizing forms with respect to their functional categories without assigning these categories to individual morphs. Of course, to be fully functional providing for arbitrary theories and language types, also complex (multiple-word) forms must be covered, which presupposes the development of modules (parsers and the like) that recognize syntactic structures and that are able to cope with, say, discontinuous word forms.

More sophisticated and complete annotations on the morphological, syntactic and even other levels (phonetic/phonological, intonational) can be added by additional annotyzers as corresponding modules become available – for instance, morphological or syntactic constituent structures or grammatical relations could be generated (semi)automatically and represented in corresponding tiers in ELAN.

 

 

Click for bigger version

Figure: A schematic view of the architecture of LEXAN

Summary of the 2011 CLARA Summer School

by Przemek Lenkiewicz

The CLARA Summer School on Infrastructure Tool Development has taken place at Max Planck Institute for Psycholinguistics on 5th – 12th July.

Participants came from several institutions, including the University of Bielefeld, the Technical University of Aachen, Gießen University or Technical School of Mittelhessen. Some representatives of Max Planck staff also participated in parts of the summer school, especially those requiring less technical expertise. Altogether they have created a very inspiring and productive group that managed to carry out the tasks planned for the event and also came up with some new ideas for developing useful things, which also have been done during the summer school.

On the first day Przemek Lenkiewicz opened the summer school and introduced participants to the agenda and all extra activities. Participants were also encouraged to present themselves and their work, giving an idea about how they use ELAN and what are they hoping to learn at this event.

Later Han Sloetjes, the main developer of ELAN, has presented the annotation tool and introduced its mechanisms for creating and integrating extensions (recognizers). Some users said that although they have used ELAN for quite a long time, they were not even aware that it is possible to extend its functionality and that it is so simple. Han has spent the whole day with participants to clear out any doubts they might have. He also showed up on following days and participated in the development sessions.

Stefano Masneri with participants

Days 2-4 of the event were about signal processing techniques. Stefano Masneri of Fraunhofer HHI Berlin and Dr. Rolf Bardeli of Fraunhofer IAIS Sankt Augustin have introduced the participants to video and audio processing basics. In the afternoon hands-on sessions participants have developed some simple video/audio processing algorithms, like histogram calculations for both audio and video, color-to-greyscale conversion, image flipping, etc. But also more advanced functionality was developed, like detecting a person’s hand in a video using edge detector as the base or detecting fricatives in a speech recording using thresholding.

The last two days of the summer school were led by Przemek Lenkiewicz and Eric Auer. In a brainstorming session with the participants we defined two recognizers, which are interesting for them to develop. Those included automated importing of eye-tracking data into ELAN and representing it as annotations and curves, and also a recognizer to compare two tiers based on the similarity of the annotations. Both recognizers have been successfully developed until the end of the summer school.

Przemek Lenkiewicz and Eric Auer

Since the summer school included the weekend, the group met and explored Nijmegen for a while. On Monday July 11th we also had dinner together in a nice Dutch restaurant.

Additional pictures from the event can be found on this web page.

After the event participants have filled a survey and rated the summer school very well for a good content, good way to deliver it and for overall organization. Considering the good feedback, another Summer School on Infrastructure Tool Development might take place at Max Planck in summer 2012. All interested in participating should contact Przemek Lenkiewicz about it.



Detectar idioma » English

Introduction of a new transcription mode in ELAN 4.1.0

by Aarthy Somasundaram & Han Sloetjes

In this new release of ELAN a completely new “Transcription Mode” and an improved “Segmentation Mode” are introduced. Both have been developed in close cooperation with ELAN users.
The Transcription Mode is built for high-speed transcription. Where the traditional Annotation mode can be seen as accuracy-oriented rather than productivity-oriented, the Transcription mode aims at increasing the speed and efficiency of transcription work. The user interface has been designed with convenient text entry in mind: the main element is a table containing the annotations of selected tier types, displayed in a vertical order. Each cell in the table represents an annotation (or a position where a depending annotation can be created). The segments (annotations) need to be created first, in the segmentation or annotation mode, after which text can be typed into the (empty) segments in this mode. Operation in this mode is very much keyboard oriented. Selecting an annotation plays the corresponding segment automatically and brings it into edit mode: ready for you to start typing. Press the TAB key to replay. After editing, hit ENTER (or use the navigation keys) to jump to the next annotation, to play that segment automatically and to start typing right away and so on… Activation of a cell will silently create child annotations if they don’t exist yet — merely clicking an empty cell (or moving there using the keyboard) creates an annotation and opens it for editing. All this brings down the transcription work to just listening and typing, making it easy for the transcriber.

Figure 1: The Transcription Mode

 

“On-the-fly Segmentation” has been moved into the main window as the new Segmentation mode (instead of in a separate dialog). It is now easier to switch between tiers while the media is playing. Segments are created by keyboard strokes and can be modified by dragging with the mouse. This mode introduces a preliminary step-and-repeat playback mode.

Apart from that, some new multiple file processing functions have been added, like annotations from overlaps and annotation statistics. An option to add a group of tiers for a new participant has been implemented, as well as for deletion of multiple tiers in one action. Customization of the program has been improved by the introduction of new preference elements.

The new version can be downloaded at the ELAN web site where you will also find the updated manual, detailing how to use the new modes and other new functionalities.

Semantic interoperability of linguistic resources now and in the future

by Menzo Windhouwer

Language resources are a very valuable asset. Not only now, where they form the basis for new scientific publications, but also in the future when new research might need to reassess previous findings. Primary data, like audio and video recordings, can by the curation efforts of the archive managers still be accessible in this future. However, for a lexicon or a grammatical description curation is not so easy. The semantics of the terminology used by the creators of these resources can have drifted off, i.e., the tems might now have a (slightly) different meaning. So it is easily possible that future users have a hard time interpreting the resource in the right way or even come to wrong conclusions based on wrong assumptions. A possible solution would be to make the semantics associated with these resources explicit. The Data Category Registry, nicknamed ISOcat, is taking that route.

ISOcat provides a way for resource creators to describe and share the semantics of the elementary descriptors, called data categories, in their resources. Each data category becomes uniquely identifiable by a so called persistent identifier. And as the name of this identifier indicates, data categories in this registry are meant to stay around for a very long time. Future researchers should thus be able to take a resource from an archive and resolve these identifiers to get to the semantic descriptions of the data categories used in the resource. These descriptions should then help this researcher to interpret the resource.

However, already now adding data category identifiers to resources can help us. Because data categories can be reused by various resources they provide hints on which resources are semantically close together, i.e., they can help researchers to find more interesting resources based on semantic closeness. In these cases islands of resources using domain or application specific terminology can be connected as the specification allows the declaration of the use of various terms for the same data category.

ISOcat is the Data Category Registry for the ISO Technical Committee 37, which develops many standards for linguistic resources. Standards like the Lexical Markup Framework (LMF; ISO 24613:2008) and the, in preparation, Linguistic Annotation Framework (LAF; ISO/DIS 24612) rely on the use of data categories taken from this registry to turn an abstract model into a model that is actually useful for a specific resource (type). The ISO committee is working towards sets of standardized data categories for various domains, e.g., metadata and morphosyntax. This work is reflecting in ISOcat as public accessible Thematic Views. However, every linguist can actually create her own data categories, share them with others and offer them for standardization. This grass roots approach aims at providing a standardized core useful for a broad range of linguists, and reusable data categories for and maintained by specific groups of linguists.

Tools provided by The Language Archive are starting to interact with ISOcat. In ELAN items in a controlled vocabulary can be taken from ISOcat. LEXUS, which allows the construction of LMF compliant lexica, can interact with ISOcat to select data categories to actually instantiate the abstract LMF data model. The Component Registry allows elementary elements and values in component metadata to link to ISOcat data categories. While these are just first steps and more will be needed the ultimate goal is that this will support the semantic interoperability of linguistic resources and thus research now and in the (far) future.

Some news from the AVATecH project

by Przemek Lenkiewicz

The AVATecH project is an interesting initiative of the Max Planck Gesellschaft and Fraunhofer Gesellschaft. It aims at developing solutions that would allow creation of automated annotation for media recorded by linguistic researchers, therefore it has been seen as something highly desired and the expectations are high.

The project has recently passed two very important milestones. The first one has happened in November, when the AVATecH Expert Workshop took place. For two days the participants of the project have interacted with each other and with the potential users of their solutions, in order to present what is the status of the development and integration of their work and to get feedback and further suggestions from the linguists. Also experts from different fields have been present (audio/video processing, gesture and sign language research, field researchers) to see the status of work and to get an idea about what can be soon available for their purposes. Naturally they contributed numerous valuable comments.

After the status of work has been presented and suggestions have been gathered, all the project participants have worked on their solutions and another important point of the project has been reached, which was to deliver the first automated annotation functionality to the ELAN tool and make it available for Max Planck researchers. This functionality covers these initial possibilities:

  • The audio part aims at providing some functionality that takes place in major part of the annotations. This would be: detecting how many persons are speaking in the audio recording and create appropriate number of tiers; detect who is speaking when and create annotations for that at appropriate parts of the recording; align the recording with transcription from a text file.
  • The video part provides the following functionality: detecting shots and subshots in the recording; creating representative keyframes for given shots the subshots; estimating the color ranges that represent human skin in the recording; tracing the position of hands and head of the speaker. Further functionality will be built on top of the last mentioned recognizer, namely the position of the hands and head will be taken into account and together with time information they will serve to estimate the speed of hands movement, their relation to each other and to the speaker’s body, etc.

The MPI team is currently working on integrating these features with ELAN and providing manuals for researchers on how to use them.

New release of ELAN – Version 4.0.0

by Aarthy Somasundaram

Toward the end of last year a new version of ELAN has been released, containing lots of new features and improved functionalities, a new media player solution for Windows and fixes for a number of issues and bugs in previous versions.

A first implementation of interaction with LEXUS, the MPI developed web-based lexicon tool for creating and editing lexical databases, has been added. A new lexicon viewer allows the user to perform a look up for values in an online lexicon and to apply a value to the selected annotation.

ELAN has been facing many codec related problems, especially with mpeg-1 and mpeg-2 files. With the intention to eliminate a few of them, a new player, for Windows has been developed based on DirectShow (JDS, Java-Direct Show).
To use this player, it is necessary to select it first in the Platform/OS tab in the “Edit Preferences” window.

This version extends its support for controlled vocabularies with externally defined closed controlled vocabularies (located e.g. on the web). The list of supported file formats for importing controlled vocabularies has been extended with .txt and .csv. The file format of externally defined closed controlled vocabularies files is .ecv, which is close to eaf.

To make life easier and to increase the work speed of ELAN users, several improvements have been made to get things done with fewer steps and clicks.  A few tier-based operations, like removing multiple annotations or annotation values from selected tiers or creating depending annotations recursively on all depending tiers, can be performed much faster and with more ease of use. Now it is also possible to automatically create depending annotations, when an annotation is created on a tier with dependent tiers. The merge transcriptions function is extended with options for appending one file to the other, making the merging process more versatile.

Further support for audio and video recognizers, as developed in e.g. the AVATecH Project, has been implemented. To learn more about this project, visit the AVATecH website.

You can download the new version at the ELAN web site where you will also find the updated manual detailing how to use the new functionalities.

ANNEX and ELAN – A Comparison

by Thomas Koller and Han Sloetjes

ANNEX and ELAN are two closely related applications designed for handling of digital media files and associated annotation files. While ELAN as a desktop application is used for the creation of rich annotations on audio and video recordings, ANNEX represents a web-based viewer which allows to study annotated resources once they have been properly stored on the archive server.

This short article aims at highlighting on the one hand what features they have in common and on the other hand what features are unique to each tool.

ELAN is a local tool (desktop application) for the creation of annotations to audio and or video recordings. It is a combination of a media player with viewer and editor components for annotations. The annotation documents are stored in the XML-based ELAN Annotation Format (EAF). ELAN is written in the Java programming language and is available for Windows, Mac OS X and Linux. On Windows and Mac the media playback is delegated to an available high performance native media framework: DirectX/DirectShow on Windows and QuickTime on Mac. On Linux JMF is used. The list of supported file types depends on the available media player frameworks.

ELAN main window

Although there is limited support for streaming media via the RTSP protocol, most commonly the media files are accessed directly on a local hard drive or the local network. This guarantees high accuracy in media playback, especially in (repeated) playback of fragments of the media, which is usually a basic step in the process of segmenting the media. The annotation boundaries can be determined with millisecond precision. ELAN supports simultaneous, synchronized playback of up to 4 video files. The annotation documents are stored locally as well. The variant of the TROVA search engine that is distributed with ELAN can query the contents of physical directory structures. To that end it creates temporary in-memory indexes for the content of selected folders and files. The search is limited to EAF files. The ELAN window offers several customizable views on the annotation data, all synchronized with the media player. All viewers are editors at the same time. Many operations are provided for manipulating tiers and annotations.

ANNEX is written as an ELAN compliant browser-based tool (web application) that supports media playback via HTTP pseudostreaming and the Flash Player browser plugin. For freely accessible language resources ANNEX can also be embedded in any web page by pasting a simple HTML snippet into the page (comparable to the way Youtube supports embedding of videos into web pages). Alongside the media player it contains several customizable viewer components for annotations. By default both the media files and the annotation files are streamed from the MPI online archive; there is no need for downloading files in order to be able to view their contents. ANNEX is seamlessly integrated with the archive access management tools and interacts with available web services, for example the ones exposed by the lexicon tool LEXUS. Other tools in turn can make parameterized calls to ANNEX.

ANNEX works with the online version of TROVA, which creates an index for a whole LAMUS archive using the Postgres database system. This version of TROVA supports not only EAF but also Shoebox, CHAT and generic XML, HTML and text files.

Comparison Matrix
Feature ANNEX ELAN
Number of synchronized videos 1 4
Media file types MPG for video files, WAV for audio files Depending on the media framework of the particular platform
Waveform for audio .wav only .wav only
Media playback precision Depends on keyframe rate milliseconds
Streaming media support Pseudostreaming for audio and video files Limited, via rtsp
Annotation formats EAF, Toolbox/Shoebox, Chat. Will be converted to a single XML format for transfer. EAF, import of Toolbox, Chat, Praat, Transcriber, CSV
Annotation editing No Yes
Number of tiers Unlimited Unlimited
Font usage Any font available on the system Any font available on the system supported by Java
Search options TROVA search engine, search in entire (accessible part of) archive Single file search and multiple file search (TROVA) in local corpus
Technology Flash, XML, Quicktime (temporarily for resources with master audio file) Java, XML
Tool interaction, API Support for parameterized calls to ANNEX Extension mechanism for particular parts of the application