Semantic interoperability of linguistic resources now and in the future

by Menzo Windhouwer

Language resources are a very valuable asset. Not only now, where they form the basis for new scientific publications, but also in the future when new research might need to reassess previous findings. Primary data, like audio and video recordings, can by the curation efforts of the archive managers still be accessible in this future. However, for a lexicon or a grammatical description curation is not so easy. The semantics of the terminology used by the creators of these resources can have drifted off, i.e., the tems might now have a (slightly) different meaning. So it is easily possible that future users have a hard time interpreting the resource in the right way or even come to wrong conclusions based on wrong assumptions. A possible solution would be to make the semantics associated with these resources explicit. The Data Category Registry, nicknamed ISOcat, is taking that route.

ISOcat provides a way for resource creators to describe and share the semantics of the elementary descriptors, called data categories, in their resources. Each data category becomes uniquely identifiable by a so called persistent identifier. And as the name of this identifier indicates, data categories in this registry are meant to stay around for a very long time. Future researchers should thus be able to take a resource from an archive and resolve these identifiers to get to the semantic descriptions of the data categories used in the resource. These descriptions should then help this researcher to interpret the resource.

However, already now adding data category identifiers to resources can help us. Because data categories can be reused by various resources they provide hints on which resources are semantically close together, i.e., they can help researchers to find more interesting resources based on semantic closeness. In these cases islands of resources using domain or application specific terminology can be connected as the specification allows the declaration of the use of various terms for the same data category.

ISOcat is the Data Category Registry for the ISO Technical Committee 37, which develops many standards for linguistic resources. Standards like the Lexical Markup Framework (LMF; ISO 24613:2008) and the, in preparation, Linguistic Annotation Framework (LAF; ISO/DIS 24612) rely on the use of data categories taken from this registry to turn an abstract model into a model that is actually useful for a specific resource (type). The ISO committee is working towards sets of standardized data categories for various domains, e.g., metadata and morphosyntax. This work is reflecting in ISOcat as public accessible Thematic Views. However, every linguist can actually create her own data categories, share them with others and offer them for standardization. This grass roots approach aims at providing a standardized core useful for a broad range of linguists, and reusable data categories for and maintained by specific groups of linguists.

Tools provided by The Language Archive are starting to interact with ISOcat. In ELAN items in a controlled vocabulary can be taken from ISOcat. LEXUS, which allows the construction of LMF compliant lexica, can interact with ISOcat to select data categories to actually instantiate the abstract LMF data model. The Component Registry allows elementary elements and values in component metadata to link to ISOcat data categories. While these are just first steps and more will be needed the ultimate goal is that this will support the semantic interoperability of linguistic resources and thus research now and in the (far) future.

OAI Tools at The Language Archive

by Lari Lampen

An old man looks back on his life, spent in the ultimately futile pursuit of knowledge. Born, lived and soon to die within an immense – perhaps infinite – library, his world is made up of hexagonal arrays of bookshelves separated by tiny corridors. However, the most significant items found in this universal library or library universe are of course books. This multitude of bookshelves is filled with an unfathomable number of books filled with mostly incomprehensible sequences of letters that occasionally manage to spell a few words, much like the output of a million monkeys with typewriters. Librarians travel the endless corridors looking for a book, the catalogue of catalogues, which would reveal the locations of meaningful books.

The setting is that of “The Library of Babel” (1941), arguably the most famous of the seminal short stories of Jorge Luis Borges, a parable on the difficulty of fishing for meaning from a virtually endless ocean of data. The library universe of the universal library is practically devoid of meaning: while every possible book in every (alphabetic) language is included in it, the entirety of the library contributes nothing to anyone seeking useful information, simply because it is impossible to find anything.

The books in Borges’s vision of the universal library are not stored in any particular order; and while there are letters on the spine of each book, “these letters do not indicate or prefigure what the pages will say”. The crucial thing missing from this picture is not data, of which there is an abundance, but signposts, shelf labels, meaningful book titles or anything else describing the data contained in the endless profusion of books: in short, metadata.

The Max Planck Institute for Psycholinguistics hosts a substantial archive of language data, but it has its own unexplored corners, records that have only ever been accessed a handful of times, if even that. As the archive grows, finding relevant information becomes harder. Moreover, ours is but one of a number of repositories one needs to dig through when looking for data on, say, speakers of a particular language in a given area. Trawling through the different archives can be time-consuming and awkward, so it has been necessary to develop a method of sharing metadata between archives.

The mechanism by which this is achieved is called the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Proprietors of corpora become providers, serving metadata records using the OAI-PMH protocol, which is then collected by harvesters to be processed further as required. The Language Archive here at the MPI makes its records available as an OAI provider. In turn we harvest metadata from around 60 other repositories of language data. There automated processes silently take place on servers, allowing end users to view information on harvested records alongside TLA-hosted records in a single tree structure.

At the moment, adaptation of OAI-PMH is still in its infancy, relative to the scale it is hoped to eventually attain, but the protocol is already helping to provide a uniform view into a number of language corpora, making it at least slightly easier to find what you want. It may not be the catalogue of catalogues, but it is a start.

Metadata Workshop

by Dieter van Uytvanck

On September 7 and 8 a workshop was organized at the MPI in Nijmegen about the use of metadata within European research infrastructures. Representatives from a broad range of fields (ranging from high-energy physics over biodiversity to linguistics) gathered to explain what their particular views on metadata are.

It became soon clear that although the differences between closely related disciplines can be overcome, there are huge gaps between others. While in the humanities area the metadata generally is carefully hand-crafted, this is completely infeasible for the enormous amounts of data resulting from sensors in the physics world.

Despite all the differences between the communities some common goals for the future were identified. Among them the need to build an infrastructure using re-usable metadata components and access to shared ontologies and vocabularies.

Bringing together all conclusions of the workshop, a document was authored, meant as the basis of a proposal towards the European Commission for collaboration on the field of metadata. This can be found here.

More information and the presentations of both days are available at the workshop’s website.

The CLARIN-NL metadata tutorial

by Dieter van Uytvanck

On Friday May 27, about 25 persons gathered in the Max Planck Institute in Nijmegen to attend a workshop on the practical use of the Component Metadata Infrastructure (CMDI) for the description of language resources. CMDI is the metadata part of CLARIN, a European initiative to create a Common Language Resources Infrastructure

After a short introduction about metadata in general and a history sketch, the concepts behind CMDI were introduced: The core ideas behind the new metadata format are modularity, reusability, and the use of data categories. A special session was dedicated to the use of ISOcat, the reference implementation of a data category registry. The idea behind this is to have a dependable definition of what is meant with a data category as, for example, Part of Speech. This way it doesn’t matter how you call or spell it in your particular metadata schema, the connection to similar schemata is always clear.

After these more general introductions, the specific CMDI software was presented.

First the Component Registry was shown. It is a web application that can be used for inspecting, searching, creating and editing CMDI metadata components. Afterwards it was illustrated how to create CMDI metadata files using a version of Arbil that has been modified to directly interact with the Component Registry. Both Arbil and the Component Registry are developed by the Max Planck Institute for Psycholinguistics and were presented by their respective developers. Although both applications are still in a development state it was clear that they can already be used now for the production of CMDI metadata.

All slides of the presentations can be downloaded from the CLARIN NL website.

More information about CMDI, including links to the software so you can try it out yourself, can be found on the main CLARIN site.

Arbil

by Peter Withers

Arbil (short for “Archive Builder”) is an application for arranging research material and associated metadata into a format appropriate for archiving. It is basically the successor to the IMDI Editor, which a lot of our readers probably know. This old tool is now almost ten years old and a lot has happened in software engineering in that time. Therefore, instead of simply updating the IMDI editor to incorporate user suggestions and modern software architecture principles, the MPI has decided to create a completely new application that will replace the IMDI editor altogether.

The most obvious difference to the old IMDI Editor is that Arbil has a tabular display, which allows a comparative view on the data and the ability to copy-paste between matching sets of data.

Arbil Screenshot 1

Arbil main screen

While Arbil is primarily a tool to enter metadata, it also has functions to help organise the collected material and create a local well-organised corpus before it is archived. These functions include the ability to search for and compare metadata, and to open the resource files in associated applications, e.g. ELAN for annotations or a media player to watch videos.

Arbil Screenshot 2

Arbil's embedded image viewer

Arbil has been designed as a local application so that it can also be used offline, for instance in remote field sites. The metadata and resource files can be entered in part or as a whole; once an Internet connection is available the previously entered data and associated structures can be exported from Arbil and then be transferred to the main archive via Lamus. As the idea behind Arbil is to deal with a complete corpus or corpus branch Arbil can mirror branches from the main archive on your local computer so that they can still be referred to offline and in the field.

Unlike the IMDI Editor, Arbil incorporates the functionality of a number of separate tools while being designed around the workflow of the user. This means the application is meant to be the one program around which all your archiving work will be centered. In Arbil the metadata is viewed in tables, which can contain a single node of metadata as a list of fields, or many different nodes each with its fields as a separate row in the table. This tabular view of the data allows multiple metadata nodes to be compared across the rows of the table. If the metadata node is editable then these fields can be edited in any table in which they are viewed. Drag and drop is used extensively both for constructing a hierarchical corpus tree structure and for adding nodes to tables for viewing and editing. Bulk editing of metadata can be done primarily via copy and paste, which allows a string of text to be pasted into multiple fields of multiple rows, or to paste multiple fields into the matching fields of multiple rows.

Arbil Screenshot 3

Tabular view of multiple imdi files

The basic user interface of Arbil is highly customisable. For example, you can decide which columns of a table should be visible and save various table layouts to quickly switch between different views to accommodate for different tasks. Furthermore, columns can be resized, sorted on any column and reordered. Rows can easily be added and then dragged from one table to another, and the cells can be highlighted based on matching text.

With all the various metadata fields it is not always easy to see at a glance what a particular fields is intended for and which fields are specifically required. For this reason a description of the intended usage for each field is displayed in a tool tip. Similarly an indication of which fields are required is given by a textual and colour highlight when the data is not filled in. Likewise in the case of fields requiring specific formatting, such as date fields, the cell and the table will be highlighted when the formatting is incorrect.

Arbil Screenshot 4

Arbil window style

In contrast to the IMDI Editor Arbil is more of a team-player, which means that there are various import and export facilities. Any valid IMDI files can be imported into Arbil. If the metadata refers to resource files, they can optionally be imported at the same time, for instance when migrating or merging from one computer to another. All of the IMDI data and the associated resources within Arbil can be exported into a self-contained directory. The exported files can then be uploaded into LAMUS where any new corpus branches can be uploaded and existing sessions that have been edited can be replaced. During both the import and export processes, all of the metadata is validated and a list of warnings are given if there are any errors. The textual data entered into Arbil can be exported in formats other than IMDI, for instance the contents of a table can be copied and pasted into a text editor or into spreadsheet. A custom style-sheet can be used to transform and export the data in a particular format. The tables used in Arbil can also be embedded in a web page, where the table’s contents, size, columns and highlighting will be displayed in the resulting web page.

Arbil continues to be actively developed to extend these features further and to make the user experience as pleasant as possible. Why don’t you try it out yourself?

ISO639-3 as new standard within the MPI Archive

by Alexander König

As the ISO standard for languages codes ISO 639-3 is widely accepted nowadays, the MPI has decided to adopt these codes as the new standard for metadata stored in its archive. In the process of moving to this new standard we have gone through all the metadata files currently stored in the archive and replaced older code schemes like the two-letter ISO 639-1 variants with their 639-3 equivalent. This was another step in harmonising the linguistic metadata in the archive to make it easier to use for researchers.