Archive for the Category CLARIN

 
 

Upcoming Projects with TLA participation

by Peter Wittenburg

Participation in externally funded projects is very important for the TLA (The Language Archive) team for the usual reasons: (1) ensure funding to maintain existing software and add new functionalities – both being essential to maintain software; (2) participate in open competitions to show and to improve competence; (3) open new opportunities in a dynamic IT landscape. In this respect TLA was very successful during the last months, although the effort to form stable consortia and to come to proper proposals was considerable. We were part of 6 proposals from which 5 were accepted. It is a pity that the CLARICLE proposal which was meant to support the CLARIN ERIC in its construction efforts was not accepted.

CLARIN D (BMBF)
Common Language and Technology Research Infrastructure
2011 – 2016

The follow-up project for the German D-SPIN (CLARIN) has been granted and will start officially at 1.5.2011. The new CLARIN D will participate in building the language resource and tools infrastructure and is therefore part of the European CLARIN ERIC initiative which will become a legal entity in 2011. In this initiative TLA will become one of the strong centers, improve some of the already started frameworks and add new ones that will turn out to be important for building and maintaining a useful research infrastructure enabling e-Humanities. Since we have reported frequently about CLARIN we refer for further information to the web-site.

DASISH (EC)
Data Service Infrastructure for the Social Sciences and Humanities
2011 – 2014

This project brings together all 5 ESFRI research infrastructure initiatives in the social sciences and humanities (SSH) represented each by some centers: CLARIN, DARIAH, CESSDA, ESS, SHARE. The goal is to determine areas of possible synergies in the infrastructure development and to work on a few concrete joint activities. The rationale behind this idea is that a) double developments should be prevented, b) initiatives should mutually benefit from the advanced work of the others and c) to establish joint integrated domains where this makes sense for the SSH users. Joint activities will be along the following dimensions: understanding the different architectural solutions, assessing and improving data and metadata quality, setting up a tools and services forum, improve the quality of survey data, locate and improve data preservation and curation services, develop a joint shared data access and enrichment framework (AAI, PIDs, joint Metadata, Workflow implementations, joint annotation framework), jointly work on legal and ethical aspects, carry out much training and education work, work on disseminating the results.
For TLA this is a very interesting opportunity to disseminate resources and tools to other disciplines and integrate good components from others in the CLARIN infrastructure. This project is expected to start after the summer time in 2011.

INNET (EC)
Innovative Networking in Infrastructure for Endangered Languages
2011 – 2014

This project will strengthen our international activities which where started in the DOBES project on the one hand and in CLARIN on the other. Together with the University of Cologne and colleagues from Poznan and Budapest we will start the following activities in the area of endangered language documentation and archiving: (1) setup 3 new regional archives and run annual workshops with all experts active in the current and coming regional centers; (2) organize best practice meetings with international guests and summer schools, (3) work out educational material to go into schools to get pupils’ attention. In all infrastructure aspects the CLARIN agreements will be of relevance.
For TLA it is an excellent opportunity to extend its archiving network and it is of course fo great importance to spread the CLARIN messages. More about this project will be said in a separate article. This project is expected to start in June/July 2011.

EUDAT (EC)
European Data Infrastructure
2011 – 2014

EUDAT is a first consequence of the report “Riding the Wave” of the EC’s High Level Expert Group on Scientific Data in so far as it brings together 13 community driven infrastructure initiatives and 10 data centers to build a first prototype of a Collaborative Data Infrastructure (CDI). In such a CDI the community infrastructures take care of user oriented services on data, the data centers take care of common horizontal data services which are the same or at least very similar for all research disciplines and where both need to address topics such as data curation and establishment of trust between all stakeholders. CLARIN is one of the communities being selected in this project of strategic relevance. It has been understood worldwide that our efforts to take care of research data in terms of their preservation and in order to maintain accessibility need to be strengthened. Therefore EUDAT will focus on professional and robust common services such as: (1) providing an easy deposit for all involved researchers, (2) setup a distributed architecture allowing the participating centers to easily store large data volumes for preservation and access purposes (which includes a safe replication of data), (3) working on a policy-rules based replication at logical level of collections, (4) testing generic web services execution frameworks. This project is expected to start at 1.10.2011.

Radieschen (DFG)
Rahmenbedingungen einer disziplinübergreifenden Forschungsdateninfrastruktur
2011 – 2014

This project can be compared with the EUDAT project in so far as it tries to define the basis and roadmap for a future data infrastructure for the research domain in Germany. Whiel EUDAT is already meant to come up with concrete services, Radieschen will make many interviews with experts from different stakeholders which will be analyzed in a few major dimensions with the goal to come up with a suggestion how the Collaborative Data Infrastructure can be realized in Germany with its federal organization structure. This project will start at 1.5.2011

References:

Semantic interoperability of linguistic resources now and in the future

by Menzo Windhouwer

Language resources are a very valuable asset. Not only now, where they form the basis for new scientific publications, but also in the future when new research might need to reassess previous findings. Primary data, like audio and video recordings, can by the curation efforts of the archive managers still be accessible in this future. However, for a lexicon or a grammatical description curation is not so easy. The semantics of the terminology used by the creators of these resources can have drifted off, i.e., the tems might now have a (slightly) different meaning. So it is easily possible that future users have a hard time interpreting the resource in the right way or even come to wrong conclusions based on wrong assumptions. A possible solution would be to make the semantics associated with these resources explicit. The Data Category Registry, nicknamed ISOcat, is taking that route.

ISOcat provides a way for resource creators to describe and share the semantics of the elementary descriptors, called data categories, in their resources. Each data category becomes uniquely identifiable by a so called persistent identifier. And as the name of this identifier indicates, data categories in this registry are meant to stay around for a very long time. Future researchers should thus be able to take a resource from an archive and resolve these identifiers to get to the semantic descriptions of the data categories used in the resource. These descriptions should then help this researcher to interpret the resource.

However, already now adding data category identifiers to resources can help us. Because data categories can be reused by various resources they provide hints on which resources are semantically close together, i.e., they can help researchers to find more interesting resources based on semantic closeness. In these cases islands of resources using domain or application specific terminology can be connected as the specification allows the declaration of the use of various terms for the same data category.

ISOcat is the Data Category Registry for the ISO Technical Committee 37, which develops many standards for linguistic resources. Standards like the Lexical Markup Framework (LMF; ISO 24613:2008) and the, in preparation, Linguistic Annotation Framework (LAF; ISO/DIS 24612) rely on the use of data categories taken from this registry to turn an abstract model into a model that is actually useful for a specific resource (type). The ISO committee is working towards sets of standardized data categories for various domains, e.g., metadata and morphosyntax. This work is reflecting in ISOcat as public accessible Thematic Views. However, every linguist can actually create her own data categories, share them with others and offer them for standardization. This grass roots approach aims at providing a standardized core useful for a broad range of linguists, and reusable data categories for and maintained by specific groups of linguists.

Tools provided by The Language Archive are starting to interact with ISOcat. In ELAN items in a controlled vocabulary can be taken from ISOcat. LEXUS, which allows the construction of LMF compliant lexica, can interact with ISOcat to select data categories to actually instantiate the abstract LMF data model. The Component Registry allows elementary elements and values in component metadata to link to ISOcat data categories. While these are just first steps and more will be needed the ultimate goal is that this will support the semantic interoperability of linguistic resources and thus research now and in the (far) future.

The VLO – Faceted Browser

by Patrick Duin

The Virtual Language Observatory (VLO), is an alternative way of browsing and searching different archives all over the world. We are happy to announce that the faceted browser for browsing resources part of the VLO has currently been updated and improved.

A faceted browser is a way to browse and search the data in the various archives based on the facets available in the data. These Facets are certain searchable aspects of the data. For instance figure 1 shows two facets Country and Language. It tells us that there are 15.548 records that have “Netherlands” as value for the country name.

Figure 1: Different Facets

When the user clicks on a facet value the user interface updates the other facets accordingly. So for example if we click “Netherlands”, we get figure 2.

Figure 2: Country = "Netherlands"

The language facet is updated and now only shows the languages of records that have “Netherlands” as the value for Country.
By clicking and selecting more facets the number of records can be narrowed down even more.

Selecting a result record gives a more detailed view of the data and if possible a link to the original context (archive) and links to resources associated with that record. See figure 3.

Figure 3: Result View

There is a direct access to the resources. This link goes directly to the archive providing the resource so authentication and authorization may be required.

Metadata Workshop

by Dieter van Uytvanck

On September 7 and 8 a workshop was organized at the MPI in Nijmegen about the use of metadata within European research infrastructures. Representatives from a broad range of fields (ranging from high-energy physics over biodiversity to linguistics) gathered to explain what their particular views on metadata are.

It became soon clear that although the differences between closely related disciplines can be overcome, there are huge gaps between others. While in the humanities area the metadata generally is carefully hand-crafted, this is completely infeasible for the enormous amounts of data resulting from sensors in the physics world.

Despite all the differences between the communities some common goals for the future were identified. Among them the need to build an infrastructure using re-usable metadata components and access to shared ontologies and vocabularies.

Bringing together all conclusions of the workshop, a document was authored, meant as the basis of a proposal towards the European Commission for collaboration on the field of metadata. This can be found here.

More information and the presentations of both days are available at the workshop’s website.

The CLARIN-NL metadata tutorial

by Dieter van Uytvanck

On Friday May 27, about 25 persons gathered in the Max Planck Institute in Nijmegen to attend a workshop on the practical use of the Component Metadata Infrastructure (CMDI) for the description of language resources. CMDI is the metadata part of CLARIN, a European initiative to create a Common Language Resources Infrastructure

After a short introduction about metadata in general and a history sketch, the concepts behind CMDI were introduced: The core ideas behind the new metadata format are modularity, reusability, and the use of data categories. A special session was dedicated to the use of ISOcat, the reference implementation of a data category registry. The idea behind this is to have a dependable definition of what is meant with a data category as, for example, Part of Speech. This way it doesn’t matter how you call or spell it in your particular metadata schema, the connection to similar schemata is always clear.

After these more general introductions, the specific CMDI software was presented.

First the Component Registry was shown. It is a web application that can be used for inspecting, searching, creating and editing CMDI metadata components. Afterwards it was illustrated how to create CMDI metadata files using a version of Arbil that has been modified to directly interact with the Component Registry. Both Arbil and the Component Registry are developed by the Max Planck Institute for Psycholinguistics and were presented by their respective developers. Although both applications are still in a development state it was clear that they can already be used now for the production of CMDI metadata.

All slides of the presentations can be downloaded from the CLARIN NL website.

More information about CMDI, including links to the software so you can try it out yourself, can be found on the main CLARIN site.