Archive for May 2011

 
 

Introduction of a new transcription mode in ELAN 4.1.0

by Aarthy Somasundaram & Han Sloetjes

In this new release of ELAN a completely new “Transcription Mode” and an improved “Segmentation Mode” are introduced. Both have been developed in close cooperation with ELAN users.
The Transcription Mode is built for high-speed transcription. Where the traditional Annotation mode can be seen as accuracy-oriented rather than productivity-oriented, the Transcription mode aims at increasing the speed and efficiency of transcription work. The user interface has been designed with convenient text entry in mind: the main element is a table containing the annotations of selected tier types, displayed in a vertical order. Each cell in the table represents an annotation (or a position where a depending annotation can be created). The segments (annotations) need to be created first, in the segmentation or annotation mode, after which text can be typed into the (empty) segments in this mode. Operation in this mode is very much keyboard oriented. Selecting an annotation plays the corresponding segment automatically and brings it into edit mode: ready for you to start typing. Press the TAB key to replay. After editing, hit ENTER (or use the navigation keys) to jump to the next annotation, to play that segment automatically and to start typing right away and so on… Activation of a cell will silently create child annotations if they don’t exist yet — merely clicking an empty cell (or moving there using the keyboard) creates an annotation and opens it for editing. All this brings down the transcription work to just listening and typing, making it easy for the transcriber.

Figure 1: The Transcription Mode

 

“On-the-fly Segmentation” has been moved into the main window as the new Segmentation mode (instead of in a separate dialog). It is now easier to switch between tiers while the media is playing. Segments are created by keyboard strokes and can be modified by dragging with the mouse. This mode introduces a preliminary step-and-repeat playback mode.

Apart from that, some new multiple file processing functions have been added, like annotations from overlaps and annotation statistics. An option to add a group of tiers for a new participant has been implemented, as well as for deletion of multiple tiers in one action. Customization of the program has been improved by the introduction of new preference elements.

The new version can be downloaded at the ELAN web site where you will also find the updated manual, detailing how to use the new modes and other new functionalities.

Semantic interoperability of linguistic resources now and in the future

by Menzo Windhouwer

Language resources are a very valuable asset. Not only now, where they form the basis for new scientific publications, but also in the future when new research might need to reassess previous findings. Primary data, like audio and video recordings, can by the curation efforts of the archive managers still be accessible in this future. However, for a lexicon or a grammatical description curation is not so easy. The semantics of the terminology used by the creators of these resources can have drifted off, i.e., the tems might now have a (slightly) different meaning. So it is easily possible that future users have a hard time interpreting the resource in the right way or even come to wrong conclusions based on wrong assumptions. A possible solution would be to make the semantics associated with these resources explicit. The Data Category Registry, nicknamed ISOcat, is taking that route.

ISOcat provides a way for resource creators to describe and share the semantics of the elementary descriptors, called data categories, in their resources. Each data category becomes uniquely identifiable by a so called persistent identifier. And as the name of this identifier indicates, data categories in this registry are meant to stay around for a very long time. Future researchers should thus be able to take a resource from an archive and resolve these identifiers to get to the semantic descriptions of the data categories used in the resource. These descriptions should then help this researcher to interpret the resource.

However, already now adding data category identifiers to resources can help us. Because data categories can be reused by various resources they provide hints on which resources are semantically close together, i.e., they can help researchers to find more interesting resources based on semantic closeness. In these cases islands of resources using domain or application specific terminology can be connected as the specification allows the declaration of the use of various terms for the same data category.

ISOcat is the Data Category Registry for the ISO Technical Committee 37, which develops many standards for linguistic resources. Standards like the Lexical Markup Framework (LMF; ISO 24613:2008) and the, in preparation, Linguistic Annotation Framework (LAF; ISO/DIS 24612) rely on the use of data categories taken from this registry to turn an abstract model into a model that is actually useful for a specific resource (type). The ISO committee is working towards sets of standardized data categories for various domains, e.g., metadata and morphosyntax. This work is reflecting in ISOcat as public accessible Thematic Views. However, every linguist can actually create her own data categories, share them with others and offer them for standardization. This grass roots approach aims at providing a standardized core useful for a broad range of linguists, and reusable data categories for and maintained by specific groups of linguists.

Tools provided by The Language Archive are starting to interact with ISOcat. In ELAN items in a controlled vocabulary can be taken from ISOcat. LEXUS, which allows the construction of LMF compliant lexica, can interact with ISOcat to select data categories to actually instantiate the abstract LMF data model. The Component Registry allows elementary elements and values in component metadata to link to ISOcat data categories. While these are just first steps and more will be needed the ultimate goal is that this will support the semantic interoperability of linguistic resources and thus research now and in the (far) future.