In the pipeline: LEXAN, an advanced Annotation Framework for ELAN

by Herman Stehouwer & Sebastian Drude

As all linguistic field workers know, transcribing and further annotating audio and video recordings and other texts is a very expensive and time-consuming procedure. For a single hour of a recording of a lesser documented language it can take more than a hundred hours of expert time to create useful linguistic annotations such as “basic annotation” (a transcription and a translation) and “basic glossing”: additional information on individual units – usually morphs, sometimes words – such as an individual gloss (indication of meaning or function) and perhaps categorical information such as a part-of-speech tags (or its equivalents on the morphological level). More advanced glossing can take even longer.

Furthermore, information on the lexical units encountered in the texts need to be transferred to a lexical tool. After all, often one the goals of field work is to create a usable lexicon, describing the endangered language.

Currently, this work is supported best by tools like (The Field Linguist’s) Toolbox or the FieldWorks Language Explorer (FLEx), both without proper support for media-files. Many users have asked for support for advanced annotation tasks in ELAN, ideally using LEXUS to build, access and expand a lexical database. Making this possible is the objective of TLA’s newest project called LEXAN, a modular annotation support framework coupled to a new interface in ELAN. It will support different “annotyzers”,  i.e. modules that produce annotation suggestions for the researcher, including machine-learning modules.

The “annotyzers” will work on a tier or set of tiers, the “source tier[s]”, as chosen by the user, and typically produce an additional tier or a group of tiers, the “target tier[s]”, with content generated based on the source tiers and additional data, e.g. lexical data.

A first annotyzer-like functionality of ELAN (without requiring interaction with a lexicon yet) would be the possibility to copy one entire source tier, for instance a detailed transcript, or a literal translation. The created target tier can then serve as a starting point for preparing another tier with similar but edited content, for instance a cleaner adapted version of the orthographic transcript, or an idiomatic free translation.

Similarly, a basic tokenizer would copy the individual words (recognized by spaces and perhaps hyphens or similar punctuation) on one source tier – containing an orthographical representation of a sentence – into separate annotation units on a new (target) word-tier which can then be corrected (e.g., cells can be joined in the case of composed words such as black board, or on the contrary split in the case of clitics which may orthographically be parts of more comprehensive words).

As a possible next step, already making use of interaction with a lexicon, an annotyzer would use the annotations on the word-tier to build an “intermediate” database of individual inflected word forms. Each entry in this database would have at least a field which contains the citation form of the lexical word for each given inflected word form, possibly together with a semantic label (lexical gloss) and a disambiguating homonym index in case that two lexical words with identical citation forms exist. Some of these fields would be obtained from the lexicon once the citation form has been determined, and the citation form itself and other information (such as a “complete gloss” of the inflected word form which includes semantic effects of inflectional categories and the like) could be written back to new target tiers in ELAN. Although much of this information would still have to be added by hand the first time an inflected word form occurs, this simple setting would already help to: a) create lexical entries for new lexical units, b) reduce writing when the form occurs a second, third etc. time, and c) encourage and support consistency.

Many users acquainted with Toolbox or FLEx would expect a “glossing” functionality like they know it from these tools of the future LEXAN. This would include a parser-module (generic or language-specific, pure string-matching or advanced with using the context, static or with learning capacities etc.) which would split up the individual inflected word forms on a source word-tier into individual morphs on a new target morph-tier. This morph-tier would then serve as a source for adding further target tiers with annotations such as glosses (indication of lexical meaning or functional/categorical effects) and perhaps part-of-speech-like tags (on the morpheme level). In the lexicon, this functionality would presuppose corresponding fields in all entries such as a part-of-speech label for each morph and a gloss, which are probably the most common fields in lexical databases in field research anyway (in addition to the citation and variant forms of the morph and possibly a way to distinguish different but related senses which are given as lexicographical definitions or translation equivalents). Again, correct parses and glosses would be stored in the intermediate database so that they can be re-used and referred to.

It is a well-known fact that general parsers work better for some and less well for other languages (for instance, usually morphological parsers score high with predominantly isolating and agglutinative languages and less good with inflectional and polysynthetic languages). It is also true that glossing schemes and set-ups are based on specific types of linguistic theories – for instance, the setting presented above (which corresponds to the default functionalities of Toolbox and FLEx) is clearly tied to an “item-and-arrangement” (less so “item-and-process”) reasoning on language structure. In principle, an infrastructure as the one proposed here should strive at being as interoperable with different linguistic theories as possible, which would imply that also “word-and-paradigm” theories could fruitfully use the tools and functionalities. The proposal of an “intermediate” database with one entry each for every individual (inflected) word form goes into that direction, allowing, for instance, characterizing forms with respect to their functional categories without assigning these categories to individual morphs. Of course, to be fully functional providing for arbitrary theories and language types, also complex (multiple-word) forms must be covered, which presupposes the development of modules (parsers and the like) that recognize syntactic structures and that are able to cope with, say, discontinuous word forms.

More sophisticated and complete annotations on the morphological, syntactic and even other levels (phonetic/phonological, intonational) can be added by additional annotyzers as corresponding modules become available – for instance, morphological or syntactic constituent structures or grammatical relations could be generated (semi)automatically and represented in corresponding tiers in ELAN.

 

 

Click for bigger version

Figure: A schematic view of the architecture of LEXAN

LEXUS on the TextGrid infrastructure – exploring new potentialities

by André Moreira

A new LEXUS interface is now part of the TextGrid Laboratory environment.

Since 2010 TLA together with the Institut für Deutsche Sprache (IDS), have been developing the required technology to integrate LEXUS into the TextGrid Laboratory. Starting from the last TextGrid Laboratory Beta release on the 7th of December 2011, this work is now available to the public.

TextGrid is a joint research project, part of the D-Grid initiative, and is funded by the German Federal Ministry of Education and Research (BMBF). It aims to support access to and exchange of data in the arts and humanities by means of modern information technology (the grid).
TextGrid serves as a virtual research environment for philologists, linguists, musicologists and art historians. As a single point of entry to the virtual research environment, the TextGrid Laboratory provides integrated access to specialized tools, services and content.

The TextGrid Laboratory is a cross-platform, highly modular software application based on the Eclipse RCP platform. The modularity is brought to the user via various available plug-ins, which can be installed to expand the Laboratory functionality. Each of these plug-ins is usually a different tool, and when put together create “a single point of entry to the virtual research environment”. The application bundles, by default, a set of plug-ins, which are available right after installation and where LEXUS is now included.

Click for bigger version

Figure 1 - TextGrid Laboratory architecture portrait

The LEXUS plug-in itself, aims to emphasize the possibilities of using the LEXUS web service for the language resources commonly encountered in the TextGrid environment, as well as to demonstrate the usability of the TextGrid environment for non-standard languages as the ones commonly found in LEXUS.
From the user point of view it is a very simple plug-in which allows the user to search for occurrences of a certain word or character sequence in a lexical database. As in LEXUS, the search can be conducted in specific datacategories of a lexicon and different search setups can be used, e.g. searching for all the lexical entries that start with a certain prefix (fig. 2).
When displaying the results, LEXUS presents the full structure and data of the lexical entry containing the match, as well as a custom HTML view for that specific entry. This HTML view can be customized by the data manager (lexicon owner) through the regular LEXUS interface, thus enabling very flexible control over the layout of the lexical entries.

Click for bigger version

Figure 2 - TextGrid-LEXUS plug-in screenshot. Searching for the prefix 'auf' in the German syllabification lexicon.

Technically the LEXUS-TextGrid implementation is divided into two main components: the TextGrid Laboratory plug-in, providing to the user a full SWT-based user interface, and a SOAP web-service made available through the LEXUS back-end running on TLA servers.
Even though the web-service was developed within the scope of TextGrid’s, it also allows other clients to interface with it, as currently already happens with the LEXUS plug-in for ELAN.

For the time being every TextGrid Laboratory user will have two lexica available out of the box to search and explore. These are very simple lexica, which were made available for demonstration purposes. One containing the syllabification of most known German words, and the other containing a sample set of lexical entries from the Wichita endangered language.

Click for bigger version

Figure 3 - TextGrid-LEXUS plug-in screenshot. Searching the Wichita lexicon. Note the custom HTML layout on the right.

In the future we plan to extend the functionality made available by LEXUS in the TextGrid Laboratory, for instance by assigning to each user a private LEXUS workspace so that the user can also have private lexica, in addition to the lexica made available for every TextGrid Laboratory user.
Moreover, plans exist to further integrate ANNEX into the TextGrid Laboratory, enabling more TLA software functionality in the Laboratory workbench.

Semantic interoperability of linguistic resources now and in the future

by Menzo Windhouwer

Language resources are a very valuable asset. Not only now, where they form the basis for new scientific publications, but also in the future when new research might need to reassess previous findings. Primary data, like audio and video recordings, can by the curation efforts of the archive managers still be accessible in this future. However, for a lexicon or a grammatical description curation is not so easy. The semantics of the terminology used by the creators of these resources can have drifted off, i.e., the tems might now have a (slightly) different meaning. So it is easily possible that future users have a hard time interpreting the resource in the right way or even come to wrong conclusions based on wrong assumptions. A possible solution would be to make the semantics associated with these resources explicit. The Data Category Registry, nicknamed ISOcat, is taking that route.

ISOcat provides a way for resource creators to describe and share the semantics of the elementary descriptors, called data categories, in their resources. Each data category becomes uniquely identifiable by a so called persistent identifier. And as the name of this identifier indicates, data categories in this registry are meant to stay around for a very long time. Future researchers should thus be able to take a resource from an archive and resolve these identifiers to get to the semantic descriptions of the data categories used in the resource. These descriptions should then help this researcher to interpret the resource.

However, already now adding data category identifiers to resources can help us. Because data categories can be reused by various resources they provide hints on which resources are semantically close together, i.e., they can help researchers to find more interesting resources based on semantic closeness. In these cases islands of resources using domain or application specific terminology can be connected as the specification allows the declaration of the use of various terms for the same data category.

ISOcat is the Data Category Registry for the ISO Technical Committee 37, which develops many standards for linguistic resources. Standards like the Lexical Markup Framework (LMF; ISO 24613:2008) and the, in preparation, Linguistic Annotation Framework (LAF; ISO/DIS 24612) rely on the use of data categories taken from this registry to turn an abstract model into a model that is actually useful for a specific resource (type). The ISO committee is working towards sets of standardized data categories for various domains, e.g., metadata and morphosyntax. This work is reflecting in ISOcat as public accessible Thematic Views. However, every linguist can actually create her own data categories, share them with others and offer them for standardization. This grass roots approach aims at providing a standardized core useful for a broad range of linguists, and reusable data categories for and maintained by specific groups of linguists.

Tools provided by The Language Archive are starting to interact with ISOcat. In ELAN items in a controlled vocabulary can be taken from ISOcat. LEXUS, which allows the construction of LMF compliant lexica, can interact with ISOcat to select data categories to actually instantiate the abstract LMF data model. The Component Registry allows elementary elements and values in component metadata to link to ISOcat data categories. While these are just first steps and more will be needed the ultimate goal is that this will support the semantic interoperability of linguistic resources and thus research now and in the (far) future.

New release of ELAN – Version 4.0.0

by Aarthy Somasundaram

Toward the end of last year a new version of ELAN has been released, containing lots of new features and improved functionalities, a new media player solution for Windows and fixes for a number of issues and bugs in previous versions.

A first implementation of interaction with LEXUS, the MPI developed web-based lexicon tool for creating and editing lexical databases, has been added. A new lexicon viewer allows the user to perform a look up for values in an online lexicon and to apply a value to the selected annotation.

ELAN has been facing many codec related problems, especially with mpeg-1 and mpeg-2 files. With the intention to eliminate a few of them, a new player, for Windows has been developed based on DirectShow (JDS, Java-Direct Show).
To use this player, it is necessary to select it first in the Platform/OS tab in the “Edit Preferences” window.

This version extends its support for controlled vocabularies with externally defined closed controlled vocabularies (located e.g. on the web). The list of supported file formats for importing controlled vocabularies has been extended with .txt and .csv. The file format of externally defined closed controlled vocabularies files is .ecv, which is close to eaf.

To make life easier and to increase the work speed of ELAN users, several improvements have been made to get things done with fewer steps and clicks.  A few tier-based operations, like removing multiple annotations or annotation values from selected tiers or creating depending annotations recursively on all depending tiers, can be performed much faster and with more ease of use. Now it is also possible to automatically create depending annotations, when an annotation is created on a tier with dependent tiers. The merge transcriptions function is extended with options for appending one file to the other, making the merging process more versatile.

Further support for audio and video recognizers, as developed in e.g. the AVATecH Project, has been implemented. To learn more about this project, visit the AVATecH website.

You can download the new version at the ELAN web site where you will also find the updated manual detailing how to use the new functionalities.

RELISH workshop on lexicon standards and lexicon tools

by Jacquelijn RIngersma

On August 4 and 5, the RELISH project organized a workshop on lexicon standards and lexicon tools at the MPI in Nijmegen. The workshop brought together field linguists and NLP experts to discuss the approaches, standards, tools and interoperability of lexical resources. The aim of the workshop was to create understanding on the requirements in lexicon tools and to design concrete steps towards further harmonization if possible.

In the RELISH project (Rendering Endangered Languages lexicons Interoperable through Standards Harmonization), funded by NEH and DFG, the MPI works together with The University of Frankfurt and the Eastern Michigan University. The project aims to unify two major collections of digitized lexicons of endangered languages in order to create a searchable virtual archive.

In the workshop, there were presentations from field linguists and from members of the NLP community. The presentations showed that there is some difference in focus and approach. Where the field linguist aims at a content rich resource which can be used both for research purposed and for disseminations to the speech community, NLP searches for an infrastructure covering “all” language resources and tools. As a logic result standardization and interoperability seem to be more important for the NLP society, although certainly not irrelevant for the field linguist. However, the information sharing on the subject of standards and interoperability was felt to be very useful by both ‘parties’.

In the workshop there were also presentations on LMF and ISOcat (the ISO standards for lexical resources) and LIFT and GOLD (the USA standards for lexical resources). The presentations and interactions showed that on both sides of the Atlantic interesting moves have been made towards standardization and that the difference between the two does not seem to be as wide as the mentioned ocean.

In the final 6 months of the RELISH project the parties involved will work on bridging the gap between LMF/ISOcat and LIFT/GOLD and develop an interchange format. Since RELISH brings together organizations that have been instrumental in promoting both endangered languages documentation and standards-development in Europe and the US, the success of RELISH will provide impetus for other standards-harmonization efforts, as well as offer the scientific research community integrated access to important new digital materials.

Presentations of the workshop are available from the Event page on the MPI website.

LEXUS and ViCoS: a software ‘couple’ in the LAT suite

by Jacquelijn Ringersma

LEXUS is our online tool for the creation of multimedia lexica and encyclopedic dictionaries. LEXUS is targeted at linguistics involved in language documentation, but also actively used by researchers in Sign Language research. LEXUS is based on the ISO recommendation for Language Resource Management (ISO TC37/SC4), providing a Lexical Markup Framework (LMF) lexicon structure and a concept naming registry (ISOcat). With LEXUS, users can create lexica from scratch, but also import lexica created in Toolbox or other XML based tools. Lexica using LMF and ISOcat are interoperable with each other, allowing for multi lexicon searches and merging of lexica. Users may customize views of the word list and lexical entries. Standard functionality, like sorting or filtering of word lists is already available and we are currently working on paper output options. One of the major strengths of the online tool is that users may share their lexica with other users, either on a read only or read/write basis.

ViCoS is an extension of LEXUS, with which users can create relations between lexical entries, using fuzzily defined relation types. The result of this network of relations can be a conceptual space, where each word is represented as an element in a network of other related words. Relations can be ‘universal’ (e.g. A_is_a_B) or specifically defined for a particular lexicon (A_eats_B). In its current version ViCoS can only be used from the LEXUS user interface, since the words are the basis of the conceptual space. Future plans for ViCoS envisage that the tool will be central in the creation of a customized ‘eScience environment’, a user-defined workspace where users can link any type of resource into new organizational layers.

LEXUS and ViCoS training and support

Recently we did a LEXUS/ViCoS training session in the Winter School Saami Language Documentation and Revitalization in Bodø, Norway. Some 25 participants were trained in creating lexica, adding multimedia fragments, customizing views and creating conceptual spaces. Although the training was basic and could not cover the full functionality of LEXUS and ViCoS, most users were enthusiastic about the tools and registered as LEXUS users after the training.

At the Saami Winter School (photo by Lena Karvovskaya)

If you are interested in using the tools, you may request a LEXUS user account by sending an e-mail to Jacquelijn Ringersma. We have regular LEXUS and ViCoS training in the DoBeS training weeks, or in summer schools and language documentation workshops.