West Ambrym in the Humboldt-Box

by Lena Karvovskaya and Soraya Hosni

The DoBeS Project “Languages of Southwest Ambrym” is happy to invite you to an exhibit in the newly opened exhibition-center Humboldt-Box in the heart of Berlin. The exhibit “Sprachdokumentation auf Südwest-Ambrym” (Flyer with more information) will be open to the public from 1st of July till 31st of December 2011.

The project team members wanted the installation to present the different ways in which culture, language and knowledge are transmitted within written (books and recordings) and oral societies (sand drawing and story telling). The highlights of the installation are sandroings: a unique form of art practiced in Vanuatu. An example of such a performance is shown in a short film “The Liliwi masks story” projected on the ground. The film shows an elder man drawing complex geometric figures onto the sand with a continuous one finger movement so that it will end up forming a specific picture. The drawing is followed by a story or a description. This is a sandroing performance. The Liliwi masks story has a sand drawing to illustrate the narrative.

A typical sandroing

The exhibit shows an original Sandroing left by Abel Taho as he was our guest in Berlin from Ambrym. Visitors can also try themselves to make the performance, all you need to do is to follow the instructions which a young girl on the video is giving you: Joelyne teaches German children how to draw a breadfruit. Additionally you can watch a film on the process of linguistic fieldwork at the installation. One can see how the recordings are being transcribed and translated and how a dictionary is being composed. There is also a beautiful illustration for the dictionary done by local artist Joebang Maaseng.

For those who want to see and hear more about the “Languages of Southwest Ambrym”, there is a video channel on Youtube, where Soraya Hosni shares her works. At the moment it contains the film about language documentation, the video of the Liliwi sandroing performance and two films which give you instructions on how to make a sandroing yourself. The channel will be regularly updated with new films.

Visitors at the Ambrym exhibition

The project “Languages of Southwest Ambrym” is also presented to the broader public through “Science movies”, the videoblog of the Volkswagen foundation. “Wer spricht noch Daakaka?” is a series of 10 shorts, filmed by Susanne Fuchs and Soraya Hosni, in which we follow them on their journey from Berlin to Ambrym. We learn about daily life in the island, from preparing meals and basic hygiene to how houses are built or marriages are celebrated. We can admire the unique volcanic landscape and tropical vegetation but we can also learn about how the “Languages of Southwest Ambrym” team conduct linguistic and ethnographic fieldwork and collaborate with local leaders, schools and children to make the best out of the research and contribute to the survival of the Ambrym language and culture for future generations.

The Project “Languages of Southwest Ambrym” has started in August 2009. It investigates three language varieties spoken on Ambrym, a volcanic island in the northern part of Vanuatu: Daakaka, Daakiye and Dal kalaen. The goal of the project is documentation of linguistic and cultural heritage of the people of Ambrym. During extensive fieldwork sessions the team members make recordings of custom stories and cultural practices. Among others the project has created a collection of sandroings. Each drawing has been documented together with the language performance.

The team members are: Prof. Dr. Manfred Krifka, Soraya Hosni, Kilu von Prince, Dr. Susanne Fuchs and Lena Karvovskaya (student assistant). To learn more about the Project “Languages of Southwest Ambrym” visit the official websites at the MPI or at the ZAS.

Semantic interoperability of linguistic resources now and in the future

by Menzo Windhouwer

Language resources are a very valuable asset. Not only now, where they form the basis for new scientific publications, but also in the future when new research might need to reassess previous findings. Primary data, like audio and video recordings, can by the curation efforts of the archive managers still be accessible in this future. However, for a lexicon or a grammatical description curation is not so easy. The semantics of the terminology used by the creators of these resources can have drifted off, i.e., the tems might now have a (slightly) different meaning. So it is easily possible that future users have a hard time interpreting the resource in the right way or even come to wrong conclusions based on wrong assumptions. A possible solution would be to make the semantics associated with these resources explicit. The Data Category Registry, nicknamed ISOcat, is taking that route.

ISOcat provides a way for resource creators to describe and share the semantics of the elementary descriptors, called data categories, in their resources. Each data category becomes uniquely identifiable by a so called persistent identifier. And as the name of this identifier indicates, data categories in this registry are meant to stay around for a very long time. Future researchers should thus be able to take a resource from an archive and resolve these identifiers to get to the semantic descriptions of the data categories used in the resource. These descriptions should then help this researcher to interpret the resource.

However, already now adding data category identifiers to resources can help us. Because data categories can be reused by various resources they provide hints on which resources are semantically close together, i.e., they can help researchers to find more interesting resources based on semantic closeness. In these cases islands of resources using domain or application specific terminology can be connected as the specification allows the declaration of the use of various terms for the same data category.

ISOcat is the Data Category Registry for the ISO Technical Committee 37, which develops many standards for linguistic resources. Standards like the Lexical Markup Framework (LMF; ISO 24613:2008) and the, in preparation, Linguistic Annotation Framework (LAF; ISO/DIS 24612) rely on the use of data categories taken from this registry to turn an abstract model into a model that is actually useful for a specific resource (type). The ISO committee is working towards sets of standardized data categories for various domains, e.g., metadata and morphosyntax. This work is reflecting in ISOcat as public accessible Thematic Views. However, every linguist can actually create her own data categories, share them with others and offer them for standardization. This grass roots approach aims at providing a standardized core useful for a broad range of linguists, and reusable data categories for and maintained by specific groups of linguists.

Tools provided by The Language Archive are starting to interact with ISOcat. In ELAN items in a controlled vocabulary can be taken from ISOcat. LEXUS, which allows the construction of LMF compliant lexica, can interact with ISOcat to select data categories to actually instantiate the abstract LMF data model. The Component Registry allows elementary elements and values in component metadata to link to ISOcat data categories. While these are just first steps and more will be needed the ultimate goal is that this will support the semantic interoperability of linguistic resources and thus research now and in the (far) future.

The (non)sense of high definition audio

by Paul Trilsbeek

Field linguists often ask me whether they shouldn’t be recording audio in high definition, 24 bit 96 kHz format, because their recorder has this option and the higher the quality the better, right? Well, not really. I’ll try to explain why it doesn’t make much sense to do so and why we even convert all audio recordings that we receive at The Language Archive to 16 bit 44.1 or 48 kHz.

When the digital audio CD standard was developed, it was argued that a digital representation of the audio signal using 16 bits and a sampling frequency of 44.1 kHz was sufficient to capture all the details a human being would be able to hear in a musical recording. For most types of music that is actually the case, only some highly dynamic music with both very loud as well as very silent passages might not fit in the 96 dB of dynamic range that 16 bits of audio resolution offer. Nonetheless, companies selling audio equipment such as Philips and Sony saw the need to introduce newer formats such as the Super Audio CD and the DVD-Audio format at the end of the nineties, not unlikely driven by the idea to have consumers replace their perfectly fine CD players with the latest state of the art. Both turned out a commercial failure. Still, high definition audio has gained some ground in the recording industry and during the last years also in “prosumer” audio recording equipment.

Before I go into the issue whether or not humans can actually hear a difference between HD and regular CD-quality audio, let me give some arguments why from a technical point of view it makes little to no sense for field linguists to record in 24 bit 96 kHz or higher.

Many cheap portable audio recording devices these days offer the possibility to record in 24 bit at 96 kHz. Recording with a sampling frequency of 96 kHz means that in theory you can record frequencies up to 48 kHz, more than double the highest frequency that (young) human beings can hear and way beyond the highest frequency components that are present in a speech signal (about 7 kHz). The built-in microphones in these types of recorders however do not capture anything above 16 kHz at most, so in order to record higher frequencies, one needs to use an external microphone. There are microphones on the market that record frequencies up to 40 or 50 kHz, but these are not the kind of microphones a linguist would typically take into the field if they even were within their budget (>3000 € a piece). The same is true for the dynamic range. 16 bit recordings can have a theoretical dynamic range of 96 dB, 24 bit recordings can have a dynamic range of 144 dB. The background noise in a very quiet room has a sound pressure level of about 20-30 dB, the human pain threshold lays around 130 dB. Human speech has a dynamic range of about 40 dB. Very good microphones have a dynamic range of about 120 dB, however the type of microphone a linguist is likely to be using in the field does not have a dynamic range higher than about 75 dB. Recording high definition audio from a technical point of view only makes sense with ultimate quality recording equipment, for example in a recording studio or in a high-end digitization facility.

Some argue that recording in 24 bit would allow one to leave more “headroom” for unexpected peaks when setting the recording level. This is only true though for the level of the analog line-level signal that goes into the analog–to-digital converter of the recorder. Most portable audio recorders only allow one to adjust the input gain of the microphone preamplifier, which should be adjusted properly anyhow to achieve a good signal-to-noise ratio, regardless of whether one records in 16 or 24 bit.

Some analogue carriers can actually reproduce sound beyond the limits of the digital audio CD specification. 1/4 inch open reel audio tape being recorded/played on a studio recorder with Dolby SR noise reduction could achieve a dynamic range of over 100 dB for example. Commercially produced vinyl records can in some cases contain frequencies of up to 50 kHz. For archives dealing with these kinds of materials, it would make sense to digitize them in high definition formats in order to truthfully capture the originals.

It is still debated whether humans can actually hear the difference between CD-quality and high definition audio. Audiophiles claim that the presence of frequencies above the human hearing limit does have an influence on the frequencies that we do hear. Blind listening tests however have shown that even expert listeners were at chance level when having to judge whether a recording was high definition or not (Meyer and Moran, 2007). In order to rule out possible differences in the recordings themselves, the same high definition recordings were played both with and without a device in the chain to reduce the recordings to regular CD quality. The rest of the playback setup (loudspeakers, amplifiers, cables, etc.) was left identical.

The main disadvantages of recording with high sampling frequencies and bit rates are that the recordings take up more storage space and that they are less compatible with audio software and hardware. Recordings made in 24bit/96kHz take up 3 times as much storage space as CD quality recordings and even though flash memory cards are getting cheaper every month, this is still a drastic reduction in recording capacity for no real-world benefit in terms of quality. Recording in 24 bit at normal sampling frequencies (44.1kHz/48kHz) would create files that are 1/3 larger than 16 bit files, which isn’t too dramatic and could be justified when using very high grade microphones and recording equipment. The fact that not all audio software and hardware can play back high-definition formats may cause problems when working with the files on a computer. As an archive, we would therefore need to create additional copies in standard CD quality, such that everyone can use the files. Instead of creating duplicate files in different qualities, we have chosen to normalize and convert high definition audio to regular 16 bit at 44.1 or 48 kHz. The normalization step before the conversion makes sure that we use the maximum 96 dB of dynamic range that 16 bits offer, which is more than enough to retain the full quality of the recordings we receive.

References:

E. Brad Meyer and David R. Moran (2007). “Audibility of a CD-Standard A/D/A Loop Inserted into High-Resolution Audio Playback”, Journal of the Audio Engineering Society, 55-9, pp. 775-779.

Get your data archived!

by Jacquelijn Ringersma and Paul Trilsbeek

Language documentation is a field in linguistics which went through a “technology driven” change over the last 10 to 15 years. Linguists have been going into the field for decades making sound recordings of languages and linguistic events. However the miniaturization of recording equipment made it much easier to make large quantities of high quality audio recordings. In addition, upcoming affordable, high quality, video equipment permitted an extension of documentation work from audio to the visual dimension. The latter made it possible to document the languages within their natural and cultural context, which triggered the establishment of a branch within linguistics where the creation of a rich multimedia corpus for languages that are threathened with extinction became the main goal. In addition to collecting large amounts of primary audio and video recordings, numerous derived resources are produced: annotations and transcriptions, lexica, grammars, field notes etc.

The DoBeS (Dokumentation Bedrohter Sprachen/Documentation of Endangered Languages) programme, which started about 10 years ago, was among the first funding initiatives for endangered languages documentation projects. An important aspect of this programme was the establishment of a central, specialized archive to take care of long-term preservation of the valuable material that was collected by the documentation projects. The central archive, which is based at the Max Planck Insitute for Psycholinguistics, was made an essential part of the programme because one had become aware of the fact that large amounts of recordings about languages and cultures were in danger of being lost forever. Old tapes and films that are not stored in specialized climatized rooms rapidly degrade over time, but the situation is even worse for modern digital storage media such as DVDs and hard disks. Even if the media would survive, the technology changes so fast that it is very unlikely that there will be equipment around to read today’s storage media 20 years from now. A specialized digital archive will continuously migrate the stored material to the latest storage technology and will also migrate the stored file formats should they become obsolete.

Some researchers have their doubts about storing their resources in an online archive. Arguments presented to us are in the form of: (1) Once my material is in there, I will not be able to get it out; or (2) Other researchers will use my material without giving me the credit and do all kinds of nice things with it. However, when you store material in the MPI archive, you will maintain full control over the access to the data through an online access management system (AMS). You are the owner of the data, and you will remain the owner of the data. You decide who you will grant access. This opens up opportunities to give access to members of the speech communities or the relatives of those recorded.

The MPI archive accepts deposits from linguists who do not have an affiliation with the MPI or DoBeS. Storing your data in the MPI archive has the advantage that the data is stored in an organized manner and that you can use online tools to search through your data. You can also use online tools to visualize your data in an attractive manner. But most important, we will safeguard your data by making various backup copies in the Netherlands and Germany, by always using the latest state of the art in storage technology and by migrating to newer file formats should the current ones become obsolete in the future.

If you are interested in storing your language data in the MPI archive, please inquire about the conditions with one of the archive managers: Paul Trilsbeek or Jacquelijn Ringersma.