LEXUS on the TextGrid infrastructure – exploring new potentialities

by André Moreira

A new LEXUS interface is now part of the TextGrid Laboratory environment.

Since 2010 TLA together with the Institut für Deutsche Sprache (IDS), have been developing the required technology to integrate LEXUS into the TextGrid Laboratory. Starting from the last TextGrid Laboratory Beta release on the 7th of December 2011, this work is now available to the public.

TextGrid is a joint research project, part of the D-Grid initiative, and is funded by the German Federal Ministry of Education and Research (BMBF). It aims to support access to and exchange of data in the arts and humanities by means of modern information technology (the grid).
TextGrid serves as a virtual research environment for philologists, linguists, musicologists and art historians. As a single point of entry to the virtual research environment, the TextGrid Laboratory provides integrated access to specialized tools, services and content.

The TextGrid Laboratory is a cross-platform, highly modular software application based on the Eclipse RCP platform. The modularity is brought to the user via various available plug-ins, which can be installed to expand the Laboratory functionality. Each of these plug-ins is usually a different tool, and when put together create “a single point of entry to the virtual research environment”. The application bundles, by default, a set of plug-ins, which are available right after installation and where LEXUS is now included.

Click for bigger version

Figure 1 - TextGrid Laboratory architecture portrait

The LEXUS plug-in itself, aims to emphasize the possibilities of using the LEXUS web service for the language resources commonly encountered in the TextGrid environment, as well as to demonstrate the usability of the TextGrid environment for non-standard languages as the ones commonly found in LEXUS.
From the user point of view it is a very simple plug-in which allows the user to search for occurrences of a certain word or character sequence in a lexical database. As in LEXUS, the search can be conducted in specific datacategories of a lexicon and different search setups can be used, e.g. searching for all the lexical entries that start with a certain prefix (fig. 2).
When displaying the results, LEXUS presents the full structure and data of the lexical entry containing the match, as well as a custom HTML view for that specific entry. This HTML view can be customized by the data manager (lexicon owner) through the regular LEXUS interface, thus enabling very flexible control over the layout of the lexical entries.

Click for bigger version

Figure 2 - TextGrid-LEXUS plug-in screenshot. Searching for the prefix 'auf' in the German syllabification lexicon.

Technically the LEXUS-TextGrid implementation is divided into two main components: the TextGrid Laboratory plug-in, providing to the user a full SWT-based user interface, and a SOAP web-service made available through the LEXUS back-end running on TLA servers.
Even though the web-service was developed within the scope of TextGrid’s, it also allows other clients to interface with it, as currently already happens with the LEXUS plug-in for ELAN.

For the time being every TextGrid Laboratory user will have two lexica available out of the box to search and explore. These are very simple lexica, which were made available for demonstration purposes. One containing the syllabification of most known German words, and the other containing a sample set of lexical entries from the Wichita endangered language.

Click for bigger version

Figure 3 - TextGrid-LEXUS plug-in screenshot. Searching the Wichita lexicon. Note the custom HTML layout on the right.

In the future we plan to extend the functionality made available by LEXUS in the TextGrid Laboratory, for instance by assigning to each user a private LEXUS workspace so that the user can also have private lexica, in addition to the lexica made available for every TextGrid Laboratory user.
Moreover, plans exist to further integrate ANNEX into the TextGrid Laboratory, enabling more TLA software functionality in the Laboratory workbench.

Statistical Language Models for Alternative Sequence Selection

by Herman Stehouwer

Is there a need to limit certain aspects of statistical language models?

Is it necessary to pre-limit the size of the n-gram?

Is it useful to use linguistic annotation, within alternative sequence selection tasks?

According to a new study by Herman Stehouwer, the size of the n-gram can be completely flexible depending on the situation. The study also finds that the addition of certain linguistic annotations, specifically part-of-speech annotations and dependency-parses, did not aid the model in making decisions.

The study compares the ability of a language model to select the correct alternative from sets of alternatives in hundreds of experiments. These experiments where performed for three different alternative sequence selection tasks, for four different annotations (and also for no annotation), and for four different ways to combine the annotation with the text. The results of the study have been used to write the thesis “Statistical Language Models for Alternative Sequence Selection”. This thesis will be defended on the 7th of December at 18:00 in the Aula of Tilburg University.

Coinciding with the defense a colloquium on language modeling is organized with invited talks by Colin de la Higuera, Louis ten Bosch, and Antal van den Bosch. For more information on the colloquium you can send an e-mail to herman.stehouwer [at] mpi.nl or look at its website.

West Ambrym in the Humboldt-Box

by Lena Karvovskaya and Soraya Hosni

The DoBeS Project “Languages of Southwest Ambrym” is happy to invite you to an exhibit in the newly opened exhibition-center Humboldt-Box in the heart of Berlin. The exhibit “Sprachdokumentation auf Südwest-Ambrym” (Flyer with more information) will be open to the public from 1st of July till 31st of December 2011.

The project team members wanted the installation to present the different ways in which culture, language and knowledge are transmitted within written (books and recordings) and oral societies (sand drawing and story telling). The highlights of the installation are sandroings: a unique form of art practiced in Vanuatu. An example of such a performance is shown in a short film “The Liliwi masks story” projected on the ground. The film shows an elder man drawing complex geometric figures onto the sand with a continuous one finger movement so that it will end up forming a specific picture. The drawing is followed by a story or a description. This is a sandroing performance. The Liliwi masks story has a sand drawing to illustrate the narrative.

A typical sandroing

The exhibit shows an original Sandroing left by Abel Taho as he was our guest in Berlin from Ambrym. Visitors can also try themselves to make the performance, all you need to do is to follow the instructions which a young girl on the video is giving you: Joelyne teaches German children how to draw a breadfruit. Additionally you can watch a film on the process of linguistic fieldwork at the installation. One can see how the recordings are being transcribed and translated and how a dictionary is being composed. There is also a beautiful illustration for the dictionary done by local artist Joebang Maaseng.

For those who want to see and hear more about the “Languages of Southwest Ambrym”, there is a video channel on Youtube, where Soraya Hosni shares her works. At the moment it contains the film about language documentation, the video of the Liliwi sandroing performance and two films which give you instructions on how to make a sandroing yourself. The channel will be regularly updated with new films.

Visitors at the Ambrym exhibition

The project “Languages of Southwest Ambrym” is also presented to the broader public through “Science movies”, the videoblog of the Volkswagen foundation. “Wer spricht noch Daakaka?” is a series of 10 shorts, filmed by Susanne Fuchs and Soraya Hosni, in which we follow them on their journey from Berlin to Ambrym. We learn about daily life in the island, from preparing meals and basic hygiene to how houses are built or marriages are celebrated. We can admire the unique volcanic landscape and tropical vegetation but we can also learn about how the “Languages of Southwest Ambrym” team conduct linguistic and ethnographic fieldwork and collaborate with local leaders, schools and children to make the best out of the research and contribute to the survival of the Ambrym language and culture for future generations.

The Project “Languages of Southwest Ambrym” has started in August 2009. It investigates three language varieties spoken on Ambrym, a volcanic island in the northern part of Vanuatu: Daakaka, Daakiye and Dal kalaen. The goal of the project is documentation of linguistic and cultural heritage of the people of Ambrym. During extensive fieldwork sessions the team members make recordings of custom stories and cultural practices. Among others the project has created a collection of sandroings. Each drawing has been documented together with the language performance.

The team members are: Prof. Dr. Manfred Krifka, Soraya Hosni, Kilu von Prince, Dr. Susanne Fuchs and Lena Karvovskaya (student assistant). To learn more about the Project “Languages of Southwest Ambrym” visit the official websites at the MPI or at the ZAS.

The CLARA Project

by Przemek Lenkiewicz

Recently the Max Planck Institute started its participation in a very interesting project called CLARA. The name stands for Common Language Resources and their Applications. It is a European project that runs under the Initial Training Network framework of the Marie-Curie Actions.

CLARA offers posts for researchers both PhD and postdocs. The project will train a new generation of researchers who will be able to cooperate across national boundaries on the establishment of a common language resources infrastructure and its exploitation for the construction of the next generation of language models with wide theoretical and applied significance. The work of CLARA researchers will focus around two main goals:

  • to develop the next generation of data-intensive language models and applications by integrating approaches across language and country boundaries;
  • to contribute to the establishment of a pan-European infrastructure for language resources.

Recent advances in technology and widespread research efforts have expanded the size of corpora and the extent of their annotations. From corpora as basic resources, other resources are being derived, e.g. lexicons, frequency lists, word nets, term banks, etc. Although a large number of language resources have been produced to date, many scientific and organizational challenges remain, including the following:

  • Theories and modeling approaches have not yet been applied on a wide range of languages;
  • The gap between academic models and the needs of industrial actors who aim at real life applications remains to be bridged;
  • There is a lack of appropriate documentation for many resources. Moreover there is no good overview of available resources for some European languages;
  • Since some resources are developed for specific purposes, there is a challenge to convert them so they can be reused for other purposes;
  • The long term preservation of language resources needs to be secured;
  • Efficiency issues in accessing language resources in very large repositories must be addressed.

These challenges are meant to be addressed by CLARA researchers by means like:

  • further work on standardization of coding and annotation practices;
  • development of registries and documentation systems for language resources;
  • transfer and integration of single-purpose resources to interoperable, reusable and extendable forms.

The Max Planck Institute is hosting three researchers of the CLARA project, two PhDs and one postdoc. Their work will be organized as contribution to the AVATecH project, which aims at developing methods for automated annotation creation and thus addresses the areas of interests of the CLARA project.

People involved:
Peter Wittenburg – Scientist in charge.
Perry Janssen – Administrative contact.
Przemek Lenkiewicz – Experienced Researcher, Scientific contact.
Hugo García Blanco – Early Stage Researcher.
Binyam Gebrekidan Gebre – Early Stage Researcher.