Archive for the Category Projects

 
 

LEXUS on the TextGrid infrastructure – exploring new potentialities

by André Moreira

A new LEXUS interface is now part of the TextGrid Laboratory environment.

Since 2010 TLA together with the Institut für Deutsche Sprache (IDS), have been developing the required technology to integrate LEXUS into the TextGrid Laboratory. Starting from the last TextGrid Laboratory Beta release on the 7th of December 2011, this work is now available to the public.

TextGrid is a joint research project, part of the D-Grid initiative, and is funded by the German Federal Ministry of Education and Research (BMBF). It aims to support access to and exchange of data in the arts and humanities by means of modern information technology (the grid).
TextGrid serves as a virtual research environment for philologists, linguists, musicologists and art historians. As a single point of entry to the virtual research environment, the TextGrid Laboratory provides integrated access to specialized tools, services and content.

The TextGrid Laboratory is a cross-platform, highly modular software application based on the Eclipse RCP platform. The modularity is brought to the user via various available plug-ins, which can be installed to expand the Laboratory functionality. Each of these plug-ins is usually a different tool, and when put together create “a single point of entry to the virtual research environment”. The application bundles, by default, a set of plug-ins, which are available right after installation and where LEXUS is now included.

Click for bigger version

Figure 1 - TextGrid Laboratory architecture portrait

The LEXUS plug-in itself, aims to emphasize the possibilities of using the LEXUS web service for the language resources commonly encountered in the TextGrid environment, as well as to demonstrate the usability of the TextGrid environment for non-standard languages as the ones commonly found in LEXUS.
From the user point of view it is a very simple plug-in which allows the user to search for occurrences of a certain word or character sequence in a lexical database. As in LEXUS, the search can be conducted in specific datacategories of a lexicon and different search setups can be used, e.g. searching for all the lexical entries that start with a certain prefix (fig. 2).
When displaying the results, LEXUS presents the full structure and data of the lexical entry containing the match, as well as a custom HTML view for that specific entry. This HTML view can be customized by the data manager (lexicon owner) through the regular LEXUS interface, thus enabling very flexible control over the layout of the lexical entries.

Click for bigger version

Figure 2 - TextGrid-LEXUS plug-in screenshot. Searching for the prefix 'auf' in the German syllabification lexicon.

Technically the LEXUS-TextGrid implementation is divided into two main components: the TextGrid Laboratory plug-in, providing to the user a full SWT-based user interface, and a SOAP web-service made available through the LEXUS back-end running on TLA servers.
Even though the web-service was developed within the scope of TextGrid’s, it also allows other clients to interface with it, as currently already happens with the LEXUS plug-in for ELAN.

For the time being every TextGrid Laboratory user will have two lexica available out of the box to search and explore. These are very simple lexica, which were made available for demonstration purposes. One containing the syllabification of most known German words, and the other containing a sample set of lexical entries from the Wichita endangered language.

Click for bigger version

Figure 3 - TextGrid-LEXUS plug-in screenshot. Searching the Wichita lexicon. Note the custom HTML layout on the right.

In the future we plan to extend the functionality made available by LEXUS in the TextGrid Laboratory, for instance by assigning to each user a private LEXUS workspace so that the user can also have private lexica, in addition to the lexica made available for every TextGrid Laboratory user.
Moreover, plans exist to further integrate ANNEX into the TextGrid Laboratory, enabling more TLA software functionality in the Laboratory workbench.

AVATecH promotion video

by Przemek Lenkiewicz

In the AVATecH project we are currently ready to share our initial results with the research community. The first recognizers are tested by MPI researchers and their valuable feedback is recorded in order to help us further improve our work and deliver tools that can save a lot of researchers’ time.

In order to spread the word about AVATecH and get more researchers interested, we have created this short movie clip that introduces the principal ideas of the project and shows some of our results.

The video is in German. English subtitles should be shown automatically, if not click on the little CC at the bottom.

Upcoming Projects with TLA participation

by Peter Wittenburg

Participation in externally funded projects is very important for the TLA (The Language Archive) team for the usual reasons: (1) ensure funding to maintain existing software and add new functionalities – both being essential to maintain software; (2) participate in open competitions to show and to improve competence; (3) open new opportunities in a dynamic IT landscape. In this respect TLA was very successful during the last months, although the effort to form stable consortia and to come to proper proposals was considerable. We were part of 6 proposals from which 5 were accepted. It is a pity that the CLARICLE proposal which was meant to support the CLARIN ERIC in its construction efforts was not accepted.

CLARIN D (BMBF)
Common Language and Technology Research Infrastructure
2011 – 2016

The follow-up project for the German D-SPIN (CLARIN) has been granted and will start officially at 1.5.2011. The new CLARIN D will participate in building the language resource and tools infrastructure and is therefore part of the European CLARIN ERIC initiative which will become a legal entity in 2011. In this initiative TLA will become one of the strong centers, improve some of the already started frameworks and add new ones that will turn out to be important for building and maintaining a useful research infrastructure enabling e-Humanities. Since we have reported frequently about CLARIN we refer for further information to the web-site.

DASISH (EC)
Data Service Infrastructure for the Social Sciences and Humanities
2011 – 2014

This project brings together all 5 ESFRI research infrastructure initiatives in the social sciences and humanities (SSH) represented each by some centers: CLARIN, DARIAH, CESSDA, ESS, SHARE. The goal is to determine areas of possible synergies in the infrastructure development and to work on a few concrete joint activities. The rationale behind this idea is that a) double developments should be prevented, b) initiatives should mutually benefit from the advanced work of the others and c) to establish joint integrated domains where this makes sense for the SSH users. Joint activities will be along the following dimensions: understanding the different architectural solutions, assessing and improving data and metadata quality, setting up a tools and services forum, improve the quality of survey data, locate and improve data preservation and curation services, develop a joint shared data access and enrichment framework (AAI, PIDs, joint Metadata, Workflow implementations, joint annotation framework), jointly work on legal and ethical aspects, carry out much training and education work, work on disseminating the results.
For TLA this is a very interesting opportunity to disseminate resources and tools to other disciplines and integrate good components from others in the CLARIN infrastructure. This project is expected to start after the summer time in 2011.

INNET (EC)
Innovative Networking in Infrastructure for Endangered Languages
2011 – 2014

This project will strengthen our international activities which where started in the DOBES project on the one hand and in CLARIN on the other. Together with the University of Cologne and colleagues from Poznan and Budapest we will start the following activities in the area of endangered language documentation and archiving: (1) setup 3 new regional archives and run annual workshops with all experts active in the current and coming regional centers; (2) organize best practice meetings with international guests and summer schools, (3) work out educational material to go into schools to get pupils’ attention. In all infrastructure aspects the CLARIN agreements will be of relevance.
For TLA it is an excellent opportunity to extend its archiving network and it is of course fo great importance to spread the CLARIN messages. More about this project will be said in a separate article. This project is expected to start in June/July 2011.

EUDAT (EC)
European Data Infrastructure
2011 – 2014

EUDAT is a first consequence of the report “Riding the Wave” of the EC’s High Level Expert Group on Scientific Data in so far as it brings together 13 community driven infrastructure initiatives and 10 data centers to build a first prototype of a Collaborative Data Infrastructure (CDI). In such a CDI the community infrastructures take care of user oriented services on data, the data centers take care of common horizontal data services which are the same or at least very similar for all research disciplines and where both need to address topics such as data curation and establishment of trust between all stakeholders. CLARIN is one of the communities being selected in this project of strategic relevance. It has been understood worldwide that our efforts to take care of research data in terms of their preservation and in order to maintain accessibility need to be strengthened. Therefore EUDAT will focus on professional and robust common services such as: (1) providing an easy deposit for all involved researchers, (2) setup a distributed architecture allowing the participating centers to easily store large data volumes for preservation and access purposes (which includes a safe replication of data), (3) working on a policy-rules based replication at logical level of collections, (4) testing generic web services execution frameworks. This project is expected to start at 1.10.2011.

Radieschen (DFG)
Rahmenbedingungen einer disziplinübergreifenden Forschungsdateninfrastruktur
2011 – 2014

This project can be compared with the EUDAT project in so far as it tries to define the basis and roadmap for a future data infrastructure for the research domain in Germany. Whiel EUDAT is already meant to come up with concrete services, Radieschen will make many interviews with experts from different stakeholders which will be analyzed in a few major dimensions with the goal to come up with a suggestion how the Collaborative Data Infrastructure can be realized in Germany with its federal organization structure. This project will start at 1.5.2011

References:

Some news from the AVATecH project

by Przemek Lenkiewicz

The AVATecH project is an interesting initiative of the Max Planck Gesellschaft and Fraunhofer Gesellschaft. It aims at developing solutions that would allow creation of automated annotation for media recorded by linguistic researchers, therefore it has been seen as something highly desired and the expectations are high.

The project has recently passed two very important milestones. The first one has happened in November, when the AVATecH Expert Workshop took place. For two days the participants of the project have interacted with each other and with the potential users of their solutions, in order to present what is the status of the development and integration of their work and to get feedback and further suggestions from the linguists. Also experts from different fields have been present (audio/video processing, gesture and sign language research, field researchers) to see the status of work and to get an idea about what can be soon available for their purposes. Naturally they contributed numerous valuable comments.

After the status of work has been presented and suggestions have been gathered, all the project participants have worked on their solutions and another important point of the project has been reached, which was to deliver the first automated annotation functionality to the ELAN tool and make it available for Max Planck researchers. This functionality covers these initial possibilities:

  • The audio part aims at providing some functionality that takes place in major part of the annotations. This would be: detecting how many persons are speaking in the audio recording and create appropriate number of tiers; detect who is speaking when and create annotations for that at appropriate parts of the recording; align the recording with transcription from a text file.
  • The video part provides the following functionality: detecting shots and subshots in the recording; creating representative keyframes for given shots the subshots; estimating the color ranges that represent human skin in the recording; tracing the position of hands and head of the speaker. Further functionality will be built on top of the last mentioned recognizer, namely the position of the hands and head will be taken into account and together with time information they will serve to estimate the speed of hands movement, their relation to each other and to the speaker’s body, etc.

The MPI team is currently working on integrating these features with ELAN and providing manuals for researchers on how to use them.

The CLARA Project

by Przemek Lenkiewicz

Recently the Max Planck Institute started its participation in a very interesting project called CLARA. The name stands for Common Language Resources and their Applications. It is a European project that runs under the Initial Training Network framework of the Marie-Curie Actions.

CLARA offers posts for researchers both PhD and postdocs. The project will train a new generation of researchers who will be able to cooperate across national boundaries on the establishment of a common language resources infrastructure and its exploitation for the construction of the next generation of language models with wide theoretical and applied significance. The work of CLARA researchers will focus around two main goals:

  • to develop the next generation of data-intensive language models and applications by integrating approaches across language and country boundaries;
  • to contribute to the establishment of a pan-European infrastructure for language resources.

Recent advances in technology and widespread research efforts have expanded the size of corpora and the extent of their annotations. From corpora as basic resources, other resources are being derived, e.g. lexicons, frequency lists, word nets, term banks, etc. Although a large number of language resources have been produced to date, many scientific and organizational challenges remain, including the following:

  • Theories and modeling approaches have not yet been applied on a wide range of languages;
  • The gap between academic models and the needs of industrial actors who aim at real life applications remains to be bridged;
  • There is a lack of appropriate documentation for many resources. Moreover there is no good overview of available resources for some European languages;
  • Since some resources are developed for specific purposes, there is a challenge to convert them so they can be reused for other purposes;
  • The long term preservation of language resources needs to be secured;
  • Efficiency issues in accessing language resources in very large repositories must be addressed.

These challenges are meant to be addressed by CLARA researchers by means like:

  • further work on standardization of coding and annotation practices;
  • development of registries and documentation systems for language resources;
  • transfer and integration of single-purpose resources to interoperable, reusable and extendable forms.

The Max Planck Institute is hosting three researchers of the CLARA project, two PhDs and one postdoc. Their work will be organized as contribution to the AVATecH project, which aims at developing methods for automated annotation creation and thus addresses the areas of interests of the CLARA project.

People involved:
Peter Wittenburg – Scientist in charge.
Perry Janssen – Administrative contact.
Przemek Lenkiewicz – Experienced Researcher, Scientific contact.
Hugo García Blanco – Early Stage Researcher.
Binyam Gebrekidan Gebre – Early Stage Researcher.

The AVATecH Project

by Binyam Gebrekidan Gebre

The AVATecH project (Advancing Video Audio Technology in Humanities Research) aims at investigating, developing and applying advanced technology for semi-automatic annotation of collected audio-visual recordings used in humanities research. Currently, even the simplest annotations of, for example, recorded dialogs take too much time and effort. By making the annotation process more efficient through the use of automatic detectors, more data can be annotated more efficiently, allowing new possibilities for search and corpus analysis and better theory building.

Initial research will focus on the creation of detector components which, given media recordings, generate lists of segments and annotations. Such detectors can be invoked from within annotation tools such as the widely used and proven ELAN software and from a batch-processing framework, to process a number of recordings in one go.

The project is organized in two major phases:
1. First, low hanging fruit detectors will be identified that can operate on a selected collection of typical audio/video material. They will be integrated into ELAN and so that the developers can interact with researchers during the evaluation.
2. Second, more advanced and complex detector tasks will be tackled after the results of the low hanging fruit detectors have been evaluated.

Head and Hands Tracking

Head and Hands Tracking

The detectors developed will be made available via interactive annotation tools and batch processing. In this project, two Max Planck Institutes (the MPI for Psycholinguistics in Nijmegen and the MPI for Social Anthropology in Halle) and two Fraunhofer Institutes (the Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS in Sankt Augustin and the Fraunhofer Heinrich Hertz Institute HHI in Berlin) are cooperating in different capacities. The Max Planck Institutes act as experts for the research driven questions resulting from an analysis of the AV material and for user-friendly interaction tools. The Fraunhofer Institutes act as experts for digital sound and video processing methods. More information on AVATecH can be found on the project’s homepage.