Semantic interoperability of linguistic resources now and in the future

by Menzo Windhouwer

Language resources are a very valuable asset. Not only now, where they form the basis for new scientific publications, but also in the future when new research might need to reassess previous findings. Primary data, like audio and video recordings, can by the curation efforts of the archive managers still be accessible in this future. However, for a lexicon or a grammatical description curation is not so easy. The semantics of the terminology used by the creators of these resources can have drifted off, i.e., the tems might now have a (slightly) different meaning. So it is easily possible that future users have a hard time interpreting the resource in the right way or even come to wrong conclusions based on wrong assumptions. A possible solution would be to make the semantics associated with these resources explicit. The Data Category Registry, nicknamed ISOcat, is taking that route.

ISOcat provides a way for resource creators to describe and share the semantics of the elementary descriptors, called data categories, in their resources. Each data category becomes uniquely identifiable by a so called persistent identifier. And as the name of this identifier indicates, data categories in this registry are meant to stay around for a very long time. Future researchers should thus be able to take a resource from an archive and resolve these identifiers to get to the semantic descriptions of the data categories used in the resource. These descriptions should then help this researcher to interpret the resource.

However, already now adding data category identifiers to resources can help us. Because data categories can be reused by various resources they provide hints on which resources are semantically close together, i.e., they can help researchers to find more interesting resources based on semantic closeness. In these cases islands of resources using domain or application specific terminology can be connected as the specification allows the declaration of the use of various terms for the same data category.

ISOcat is the Data Category Registry for the ISO Technical Committee 37, which develops many standards for linguistic resources. Standards like the Lexical Markup Framework (LMF; ISO 24613:2008) and the, in preparation, Linguistic Annotation Framework (LAF; ISO/DIS 24612) rely on the use of data categories taken from this registry to turn an abstract model into a model that is actually useful for a specific resource (type). The ISO committee is working towards sets of standardized data categories for various domains, e.g., metadata and morphosyntax. This work is reflecting in ISOcat as public accessible Thematic Views. However, every linguist can actually create her own data categories, share them with others and offer them for standardization. This grass roots approach aims at providing a standardized core useful for a broad range of linguists, and reusable data categories for and maintained by specific groups of linguists.

Tools provided by The Language Archive are starting to interact with ISOcat. In ELAN items in a controlled vocabulary can be taken from ISOcat. LEXUS, which allows the construction of LMF compliant lexica, can interact with ISOcat to select data categories to actually instantiate the abstract LMF data model. The Component Registry allows elementary elements and values in component metadata to link to ISOcat data categories. While these are just first steps and more will be needed the ultimate goal is that this will support the semantic interoperability of linguistic resources and thus research now and in the (far) future.

RELISH workshop on lexicon standards and lexicon tools

by Jacquelijn RIngersma

On August 4 and 5, the RELISH project organized a workshop on lexicon standards and lexicon tools at the MPI in Nijmegen. The workshop brought together field linguists and NLP experts to discuss the approaches, standards, tools and interoperability of lexical resources. The aim of the workshop was to create understanding on the requirements in lexicon tools and to design concrete steps towards further harmonization if possible.

In the RELISH project (Rendering Endangered Languages lexicons Interoperable through Standards Harmonization), funded by NEH and DFG, the MPI works together with The University of Frankfurt and the Eastern Michigan University. The project aims to unify two major collections of digitized lexicons of endangered languages in order to create a searchable virtual archive.

In the workshop, there were presentations from field linguists and from members of the NLP community. The presentations showed that there is some difference in focus and approach. Where the field linguist aims at a content rich resource which can be used both for research purposed and for disseminations to the speech community, NLP searches for an infrastructure covering “all” language resources and tools. As a logic result standardization and interoperability seem to be more important for the NLP society, although certainly not irrelevant for the field linguist. However, the information sharing on the subject of standards and interoperability was felt to be very useful by both ‘parties’.

In the workshop there were also presentations on LMF and ISOcat (the ISO standards for lexical resources) and LIFT and GOLD (the USA standards for lexical resources). The presentations and interactions showed that on both sides of the Atlantic interesting moves have been made towards standardization and that the difference between the two does not seem to be as wide as the mentioned ocean.

In the final 6 months of the RELISH project the parties involved will work on bridging the gap between LMF/ISOcat and LIFT/GOLD and develop an interchange format. Since RELISH brings together organizations that have been instrumental in promoting both endangered languages documentation and standards-development in Europe and the US, the success of RELISH will provide impetus for other standards-harmonization efforts, as well as offer the scientific research community integrated access to important new digital materials.

Presentations of the workshop are available from the Event page on the MPI website.

The CLARIN-NL metadata tutorial

by Dieter van Uytvanck

On Friday May 27, about 25 persons gathered in the Max Planck Institute in Nijmegen to attend a workshop on the practical use of the Component Metadata Infrastructure (CMDI) for the description of language resources. CMDI is the metadata part of CLARIN, a European initiative to create a Common Language Resources Infrastructure

After a short introduction about metadata in general and a history sketch, the concepts behind CMDI were introduced: The core ideas behind the new metadata format are modularity, reusability, and the use of data categories. A special session was dedicated to the use of ISOcat, the reference implementation of a data category registry. The idea behind this is to have a dependable definition of what is meant with a data category as, for example, Part of Speech. This way it doesn’t matter how you call or spell it in your particular metadata schema, the connection to similar schemata is always clear.

After these more general introductions, the specific CMDI software was presented.

First the Component Registry was shown. It is a web application that can be used for inspecting, searching, creating and editing CMDI metadata components. Afterwards it was illustrated how to create CMDI metadata files using a version of Arbil that has been modified to directly interact with the Component Registry. Both Arbil and the Component Registry are developed by the Max Planck Institute for Psycholinguistics and were presented by their respective developers. Although both applications are still in a development state it was clear that they can already be used now for the production of CMDI metadata.

All slides of the presentations can be downloaded from the CLARIN NL website.

More information about CMDI, including links to the software so you can try it out yourself, can be found on the main CLARIN site.