Archive for the Category MPI Archive

 
 

The Language Archive officially launched

by Sebastian Drude

Tuesday, the 11th of October 2011, the new unit of the Max-Planck-Institute for Psycholinguistics “The Language Archive” (TLA) has been officially launched in a public event with more than 150 guests and speeches from eminent representatives from Germany and the Netherlands.

Many more showed up than expected: there were even not enough seats for all guests at the launching of TLA in the Headquarters of the Berlin-Brandenburgische Akademie der Wissenschaften (BBAW) at the Gendarmenmarkt in the center of Berlin. The BBAW is one of the three supporting institutions of TLA, together with the Dutch Koninklijke Nederlandse Akademie van Wetenschappen (KNAW) and the German Max-Planck-Gesellschaft (MPG).

The guests were presented with coffee and snacks, but before and above all with much content: five eminent representatives of the major stakeholders of the new unit gave fascinating talks discussing different topics, all related to the ongoing and future activities of TLA. These were on the one hand the respective representatives of the three supporting institutions: Wolfgang Klein for the MPG, Angelika Storrer for the BBAW, and Theo Mulder for the KNAW. On the other hand, Wilhelm Krull represented the Volkswagenstiftung, the funding agency that supports the programme “Documentation of Endangered Languages” (DOBES) since 2000, which in turn was represented by Nikolaus P. Himmelmann. The DOBES archive is in many respects the core of the archive hosted by TLA. After the talks, Paul Trilsbeek provided a look into the archive itself.

The full program and topics of the speeches

Begrüßung und Zielstellung für das Spracharchiv
Prof. Dr. Wolfgang Klein
Direktor am Max Planck Institut für Psycholinguistik

Sprachforschung und Sprachdokumentation im digitalen Zeitalter
Prof. Dr. Angelika Storrer
Zentrum Sprache der BBAW

E-science: a major challenge for the humanities
Prof. Dr. Theo Mulder
Forschungsdirektor der KNAW

Dokumentation bedrohter Sprachen – eine Aufgabe für Wissenschaft und Gesellschaft
Dr. Wilhelm Krull
Generalsekretär der VolkswagenStiftung

Wie die Sprachwissenschaft zur Empirie fand (und findet)
Prof. Dr. Nikolaus P. Himmelmann
Universität Köln

Blick ins Archiv
(interactive presentation)

The TLA Opening in the media:

West Ambrym in the Humboldt-Box

by Lena Karvovskaya and Soraya Hosni

The DoBeS Project “Languages of Southwest Ambrym” is happy to invite you to an exhibit in the newly opened exhibition-center Humboldt-Box in the heart of Berlin. The exhibit “Sprachdokumentation auf Südwest-Ambrym” (Flyer with more information) will be open to the public from 1st of July till 31st of December 2011.

The project team members wanted the installation to present the different ways in which culture, language and knowledge are transmitted within written (books and recordings) and oral societies (sand drawing and story telling). The highlights of the installation are sandroings: a unique form of art practiced in Vanuatu. An example of such a performance is shown in a short film “The Liliwi masks story” projected on the ground. The film shows an elder man drawing complex geometric figures onto the sand with a continuous one finger movement so that it will end up forming a specific picture. The drawing is followed by a story or a description. This is a sandroing performance. The Liliwi masks story has a sand drawing to illustrate the narrative.

A typical sandroing

The exhibit shows an original Sandroing left by Abel Taho as he was our guest in Berlin from Ambrym. Visitors can also try themselves to make the performance, all you need to do is to follow the instructions which a young girl on the video is giving you: Joelyne teaches German children how to draw a breadfruit. Additionally you can watch a film on the process of linguistic fieldwork at the installation. One can see how the recordings are being transcribed and translated and how a dictionary is being composed. There is also a beautiful illustration for the dictionary done by local artist Joebang Maaseng.

For those who want to see and hear more about the “Languages of Southwest Ambrym”, there is a video channel on Youtube, where Soraya Hosni shares her works. At the moment it contains the film about language documentation, the video of the Liliwi sandroing performance and two films which give you instructions on how to make a sandroing yourself. The channel will be regularly updated with new films.

Visitors at the Ambrym exhibition

The project “Languages of Southwest Ambrym” is also presented to the broader public through “Science movies”, the videoblog of the Volkswagen foundation. “Wer spricht noch Daakaka?” is a series of 10 shorts, filmed by Susanne Fuchs and Soraya Hosni, in which we follow them on their journey from Berlin to Ambrym. We learn about daily life in the island, from preparing meals and basic hygiene to how houses are built or marriages are celebrated. We can admire the unique volcanic landscape and tropical vegetation but we can also learn about how the “Languages of Southwest Ambrym” team conduct linguistic and ethnographic fieldwork and collaborate with local leaders, schools and children to make the best out of the research and contribute to the survival of the Ambrym language and culture for future generations.

The Project “Languages of Southwest Ambrym” has started in August 2009. It investigates three language varieties spoken on Ambrym, a volcanic island in the northern part of Vanuatu: Daakaka, Daakiye and Dal kalaen. The goal of the project is documentation of linguistic and cultural heritage of the people of Ambrym. During extensive fieldwork sessions the team members make recordings of custom stories and cultural practices. Among others the project has created a collection of sandroings. Each drawing has been documented together with the language performance.

The team members are: Prof. Dr. Manfred Krifka, Soraya Hosni, Kilu von Prince, Dr. Susanne Fuchs and Lena Karvovskaya (student assistant). To learn more about the Project “Languages of Southwest Ambrym” visit the official websites at the MPI or at the ZAS.

The (non)sense of high definition audio

by Paul Trilsbeek

Field linguists often ask me whether they shouldn’t be recording audio in high definition, 24 bit 96 kHz format, because their recorder has this option and the higher the quality the better, right? Well, not really. I’ll try to explain why it doesn’t make much sense to do so and why we even convert all audio recordings that we receive at The Language Archive to 16 bit 44.1 or 48 kHz.

When the digital audio CD standard was developed, it was argued that a digital representation of the audio signal using 16 bits and a sampling frequency of 44.1 kHz was sufficient to capture all the details a human being would be able to hear in a musical recording. For most types of music that is actually the case, only some highly dynamic music with both very loud as well as very silent passages might not fit in the 96 dB of dynamic range that 16 bits of audio resolution offer. Nonetheless, companies selling audio equipment such as Philips and Sony saw the need to introduce newer formats such as the Super Audio CD and the DVD-Audio format at the end of the nineties, not unlikely driven by the idea to have consumers replace their perfectly fine CD players with the latest state of the art. Both turned out a commercial failure. Still, high definition audio has gained some ground in the recording industry and during the last years also in “prosumer” audio recording equipment.

Before I go into the issue whether or not humans can actually hear a difference between HD and regular CD-quality audio, let me give some arguments why from a technical point of view it makes little to no sense for field linguists to record in 24 bit 96 kHz or higher.

Many cheap portable audio recording devices these days offer the possibility to record in 24 bit at 96 kHz. Recording with a sampling frequency of 96 kHz means that in theory you can record frequencies up to 48 kHz, more than double the highest frequency that (young) human beings can hear and way beyond the highest frequency components that are present in a speech signal (about 7 kHz). The built-in microphones in these types of recorders however do not capture anything above 16 kHz at most, so in order to record higher frequencies, one needs to use an external microphone. There are microphones on the market that record frequencies up to 40 or 50 kHz, but these are not the kind of microphones a linguist would typically take into the field if they even were within their budget (>3000 € a piece). The same is true for the dynamic range. 16 bit recordings can have a theoretical dynamic range of 96 dB, 24 bit recordings can have a dynamic range of 144 dB. The background noise in a very quiet room has a sound pressure level of about 20-30 dB, the human pain threshold lays around 130 dB. Human speech has a dynamic range of about 40 dB. Very good microphones have a dynamic range of about 120 dB, however the type of microphone a linguist is likely to be using in the field does not have a dynamic range higher than about 75 dB. Recording high definition audio from a technical point of view only makes sense with ultimate quality recording equipment, for example in a recording studio or in a high-end digitization facility.

Some argue that recording in 24 bit would allow one to leave more “headroom” for unexpected peaks when setting the recording level. This is only true though for the level of the analog line-level signal that goes into the analog–to-digital converter of the recorder. Most portable audio recorders only allow one to adjust the input gain of the microphone preamplifier, which should be adjusted properly anyhow to achieve a good signal-to-noise ratio, regardless of whether one records in 16 or 24 bit.

Some analogue carriers can actually reproduce sound beyond the limits of the digital audio CD specification. 1/4 inch open reel audio tape being recorded/played on a studio recorder with Dolby SR noise reduction could achieve a dynamic range of over 100 dB for example. Commercially produced vinyl records can in some cases contain frequencies of up to 50 kHz. For archives dealing with these kinds of materials, it would make sense to digitize them in high definition formats in order to truthfully capture the originals.

It is still debated whether humans can actually hear the difference between CD-quality and high definition audio. Audiophiles claim that the presence of frequencies above the human hearing limit does have an influence on the frequencies that we do hear. Blind listening tests however have shown that even expert listeners were at chance level when having to judge whether a recording was high definition or not (Meyer and Moran, 2007). In order to rule out possible differences in the recordings themselves, the same high definition recordings were played both with and without a device in the chain to reduce the recordings to regular CD quality. The rest of the playback setup (loudspeakers, amplifiers, cables, etc.) was left identical.

The main disadvantages of recording with high sampling frequencies and bit rates are that the recordings take up more storage space and that they are less compatible with audio software and hardware. Recordings made in 24bit/96kHz take up 3 times as much storage space as CD quality recordings and even though flash memory cards are getting cheaper every month, this is still a drastic reduction in recording capacity for no real-world benefit in terms of quality. Recording in 24 bit at normal sampling frequencies (44.1kHz/48kHz) would create files that are 1/3 larger than 16 bit files, which isn’t too dramatic and could be justified when using very high grade microphones and recording equipment. The fact that not all audio software and hardware can play back high-definition formats may cause problems when working with the files on a computer. As an archive, we would therefore need to create additional copies in standard CD quality, such that everyone can use the files. Instead of creating duplicate files in different qualities, we have chosen to normalize and convert high definition audio to regular 16 bit at 44.1 or 48 kHz. The normalization step before the conversion makes sure that we use the maximum 96 dB of dynamic range that 16 bits offer, which is more than enough to retain the full quality of the recordings we receive.

References:

E. Brad Meyer and David R. Moran (2007). “Audibility of a CD-Standard A/D/A Loop Inserted into High-Resolution Audio Playback”, Journal of the Audio Engineering Society, 55-9, pp. 775-779.

OAI Tools at The Language Archive

by Lari Lampen

An old man looks back on his life, spent in the ultimately futile pursuit of knowledge. Born, lived and soon to die within an immense – perhaps infinite – library, his world is made up of hexagonal arrays of bookshelves separated by tiny corridors. However, the most significant items found in this universal library or library universe are of course books. This multitude of bookshelves is filled with an unfathomable number of books filled with mostly incomprehensible sequences of letters that occasionally manage to spell a few words, much like the output of a million monkeys with typewriters. Librarians travel the endless corridors looking for a book, the catalogue of catalogues, which would reveal the locations of meaningful books.

The setting is that of “The Library of Babel” (1941), arguably the most famous of the seminal short stories of Jorge Luis Borges, a parable on the difficulty of fishing for meaning from a virtually endless ocean of data. The library universe of the universal library is practically devoid of meaning: while every possible book in every (alphabetic) language is included in it, the entirety of the library contributes nothing to anyone seeking useful information, simply because it is impossible to find anything.

The books in Borges’s vision of the universal library are not stored in any particular order; and while there are letters on the spine of each book, “these letters do not indicate or prefigure what the pages will say”. The crucial thing missing from this picture is not data, of which there is an abundance, but signposts, shelf labels, meaningful book titles or anything else describing the data contained in the endless profusion of books: in short, metadata.

The Max Planck Institute for Psycholinguistics hosts a substantial archive of language data, but it has its own unexplored corners, records that have only ever been accessed a handful of times, if even that. As the archive grows, finding relevant information becomes harder. Moreover, ours is but one of a number of repositories one needs to dig through when looking for data on, say, speakers of a particular language in a given area. Trawling through the different archives can be time-consuming and awkward, so it has been necessary to develop a method of sharing metadata between archives.

The mechanism by which this is achieved is called the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Proprietors of corpora become providers, serving metadata records using the OAI-PMH protocol, which is then collected by harvesters to be processed further as required. The Language Archive here at the MPI makes its records available as an OAI provider. In turn we harvest metadata from around 60 other repositories of language data. There automated processes silently take place on servers, allowing end users to view information on harvested records alongside TLA-hosted records in a single tree structure.

At the moment, adaptation of OAI-PMH is still in its infancy, relative to the scale it is hoped to eventually attain, but the protocol is already helping to provide a uniform view into a number of language corpora, making it at least slightly easier to find what you want. It may not be the catalogue of catalogues, but it is a start.

Get your data archived!

by Jacquelijn Ringersma and Paul Trilsbeek

Language documentation is a field in linguistics which went through a “technology driven” change over the last 10 to 15 years. Linguists have been going into the field for decades making sound recordings of languages and linguistic events. However the miniaturization of recording equipment made it much easier to make large quantities of high quality audio recordings. In addition, upcoming affordable, high quality, video equipment permitted an extension of documentation work from audio to the visual dimension. The latter made it possible to document the languages within their natural and cultural context, which triggered the establishment of a branch within linguistics where the creation of a rich multimedia corpus for languages that are threathened with extinction became the main goal. In addition to collecting large amounts of primary audio and video recordings, numerous derived resources are produced: annotations and transcriptions, lexica, grammars, field notes etc.

The DoBeS (Dokumentation Bedrohter Sprachen/Documentation of Endangered Languages) programme, which started about 10 years ago, was among the first funding initiatives for endangered languages documentation projects. An important aspect of this programme was the establishment of a central, specialized archive to take care of long-term preservation of the valuable material that was collected by the documentation projects. The central archive, which is based at the Max Planck Insitute for Psycholinguistics, was made an essential part of the programme because one had become aware of the fact that large amounts of recordings about languages and cultures were in danger of being lost forever. Old tapes and films that are not stored in specialized climatized rooms rapidly degrade over time, but the situation is even worse for modern digital storage media such as DVDs and hard disks. Even if the media would survive, the technology changes so fast that it is very unlikely that there will be equipment around to read today’s storage media 20 years from now. A specialized digital archive will continuously migrate the stored material to the latest storage technology and will also migrate the stored file formats should they become obsolete.

Some researchers have their doubts about storing their resources in an online archive. Arguments presented to us are in the form of: (1) Once my material is in there, I will not be able to get it out; or (2) Other researchers will use my material without giving me the credit and do all kinds of nice things with it. However, when you store material in the MPI archive, you will maintain full control over the access to the data through an online access management system (AMS). You are the owner of the data, and you will remain the owner of the data. You decide who you will grant access. This opens up opportunities to give access to members of the speech communities or the relatives of those recorded.

The MPI archive accepts deposits from linguists who do not have an affiliation with the MPI or DoBeS. Storing your data in the MPI archive has the advantage that the data is stored in an organized manner and that you can use online tools to search through your data. You can also use online tools to visualize your data in an attractive manner. But most important, we will safeguard your data by making various backup copies in the Netherlands and Germany, by always using the latest state of the art in storage technology and by migrating to newer file formats should the current ones become obsolete in the future.

If you are interested in storing your language data in the MPI archive, please inquire about the conditions with one of the archive managers: Paul Trilsbeek or Jacquelijn Ringersma.

The Language Archive (TLA)

by Peter Wittenburg & Wolfgang Klein

The digital era changed a few characteristics of data management fundamentally. For widely persistent carriers such as the old clay tablets or even for some papyrus rolls it was obvious that they survived thousands of years and still contain the information the creators wanted to convene. Already for analogue electro-magnetic storage media that were introduced during the last century it became obvious that the life time of carriers is very limited and we realized that every copying activity was bound to a decrease in quality. As a consequence of this a UNESCO survey found out that about 80% of the material on cultures and languages in the ethno-linguistic domain are highly endangered. It was good practice to store master tapes in air-conditioned Faraday cages, however, for most of the recordings this was impossible and the implicit “don’t touch” policy created a logistical problem aside from the cost aspect, since the old players were not around anymore after a few years.

The digital area in turn changed the challenges again, since (a) copying is comparatively easy and if done carefully does not lead to a quality decrease and (b) it is just a matter of principle that the stored material needs to be touched regularly to do migrations of the carriers, of the formats and the encodings to maintain interpretability. Digital holdings are inherently dynamic and need a 2-tier framework for life-cycle management: (1) data centers that take care of bit-stream preservation and (2) community centers that know about format and encoding principles. The worldwide debate about the loss of our scientific and cultural memory which is being carried out worldwide gives an impression about the urgency of the lifecycle management problem.

This was the background for the Max Planck Society and the MPI for Psycholinguistics to establish a new unit with the name “The Language Archive (TLA)” to take care of the long-term preservation of the huge treasure which is enclosed in its large digital archive and which has been created in a wide range of initiatives and sub-disciplines. As prominent examples we would like to mention the resources about language studies from MPI researchers, the archive about endangered languages created by the DOBES program and the digital human-ethological archive from Eibl-Eibesfeldt. A plan has been submitted for 25 years of persistence of such a unit to offer the necessary services of a digital archive such as deposit, access, searching, visualization and preservation and beyond these to look after a number of critical characteristics such as integrity, authenticity, usability, discoverability and interoperability. TLA will carry out this task in collaboration with the two big computer centres of the Max-Planck-Society which will focus on bit-stream preservation and in future also on giving access to the material following the agreed principles.

To fulfill its mission TLA will have archiving experts who know about metadata, formats, standards and encoding principles as they are used in our domain and who can deploy curation strategies, software experts who can maintain the existing code base and develop new functionality and system experts who will interact with the storage system managers of the MPI and the computer centres to take care of the bit-stream preservation and proper security. We see digital archiving with its many facets as a networking task as well, i.e. we will participate in relevant collaborations to be able to apply state-of-the-art methods. One such network is the worldwide network of regional centres for language material which will be supported in the future.

Due to the proven Language Archiving Technology (LAT) software-suite which has been developed during the last decades the archive can be open for any serious language and cultural material which is of relevance for researchers. Based on open legal and ethical rules, material can be deposited and accessed via the web using a variety of tools. The archive will continue to participate in national and European projects to maintain the existing software and to provide new advanced functionality, and to establish professional research infrastructures that will improve data lifecycle management and the access to language and cultural material.

TLA will start its operation formally at 1. September 2010 lead by Wolfgang Klein and Peter Wittenburg.

Lossless Video Compression with MJPEG 2000

by Paul Trilsbeek

A lot of compression algorithms in current video codecs are lossy, meaning that they throw away information that is seen as less relevant for the perception of the image in order to reduce bandwidth. This information cannot be reconstructed afterwards which is one of the reasons why archives like ours do not like to use lossy compression formats, since one never knows whether the information that is thrown away might be relevant for future use of the data. Another reason is that every time compression is applied to an already lossy compressed signal, the signal degrades. This happens both when using the same compression algorithm as well as when transcoding to a new (future) compression standard. Since codecs and file formats generally have a limited lifetime, this would mean that if an archive wants to keep their archived content interpretable in the long run, it will degrade over time because of the necessary conversion steps to the latest state of the art.

For these reasons, archives generally want to preserve data in as uncompressed a form as possible. The rapid deterioration of physical audiovisual carriers such as celluloid film and many deprecated video formats have triggered broadcast and film archives but also the film industry to start massive digitization projects. Due to the high costs of these operations and the high economical or cultural value of the material, it makes sense to store this digitized material in the highest possible quality because even if it were at all possible to repeat such digitization operations in the future (e.g. to account for new compression standards), it would be a big waste of time and money. The storage costs for these formats are substantial at this moment but will decrease with the introduction of newer, higher capacity storage technology.

A codec that is being widely used at the moment in the film and video archiving world is Motion JPEG 2000. This codec allows for lossless – i.e. reversible – compression of moving images. It is also used as the standard for Digital Cinema, albeit currently in a lossy variant. In the lossless variant, compression ratios of about 1:2 can be achieved, but the main reason for moving to lossless or uncompressed storage of video is, as stated, to prevent future degradation of the signal if current codecs become obsolete.

The MPI is currently digitizing a large collection of valuable videotapes from the German behavioral scientist Irenäus Eibl-Eibesfeldt. These recordings were originally made on 16mm film and have been transferred to Betacam SP (broadcast quality) video, which was a costly and time-consuming process. The Betacam SP tapes are now being digitized in lossless MJPEG2000 format in order to retain the highest possible quality for this material. In addition, MPEG2 and H.264 distribution copies are created to make the material more accessible over lower-bandwith data connections.

The REPLIX Project

by Willem Elbers

The REPLIX project is studying and implementing the next level in grid based replication and synchronization at a logical level by using iRODS. REPLIX is a joint project between DEISA represented by Rechenzentrum Garching, and CLARIN and DOBES both represented by MPI for Psycholinguistics.

Goals

The two main goals are data preservation and authenticity control:

1) When we are talking about data preservation, we are talking about guaranteeing future generations access and use for the data we are archiving now. This includes managing different copies of the data and associated metadata at different physical locations, this is called replication. Metadata in this context includes system metadata (such as file size, creation date, etc.), complex user metadata (anything defined by the user but also the relations defined by the user) and access restrictions (which user has access to which files and operations).

2) When we are talking about authenticity control, we are talking about making sure the information remains authentic. And not only the data files, but also the metadata associated to the data files. Since the data and metadata is replicated, the authenticity of each copy needs to be controlled. Moreover, access to files is also part of the authenticity control. Only authorized editors should be able to edit the information and associated metadata.

Current Infrastructure

Current Infrastructure

Illustration 1: Current Infrastructure

The current infrastructure takes care of replication at a physical level (using tools like rsync and Andrew File System (AFS)). At the moment this is similar to copying files from one location to another. For future use, this approach is too limited since replication causes source collections to be placed in different contexts, which cannot be properly handled by AFS or rsync. In order to ensure consistency of the collections a new approach is needed.

To be able to identify the archived objects in a unique way, MPI uses the handle system. The handle system creates persistent identifiers (PID) and associates them with file properties (such as a reliable checksum). The use of PIDs ensures the identification of archived objects now and in the future by a single identifier.

The current infrastructure consists of one central archive, located at the MPI in Nijmegen. The central archive is replicated (at the file level) to two large data centers, each managing two copies of the archive. Around the world, several satellite archives exist. Researchers use these satellite archives to ingest the information they collect. The first step in the current preservation process ensures proper ingestion into the central archive. The seconds step in the current preservation process ensures proper ingestion into the two remote archives. This is shown in Illustration 1.

Future Infrastructure

REPLIX is researching possibilities to overcome the limitations of existing replication and authenticity control methods. To be more specific, REPLIX is researching how we can use iRODS to create a solution where information is ingested into the archive and replicated ensuring the integrity and authenticity of the data and metadata. The solution should also take care of synchronization of the central archive to the backup data centers. This step also has to verify the integrity and authenticity of the data.

The iRods Solution

Illustration 2: iRODS Solution

Although tested for the MPI infrastructure, the approach should be easily generalized to a solution where any community can use the solution to deposit information into a central zone and taking advantage of the preservation facilities. Before achieving this, we will start to explore iRODS in general. Then we will start to create a setup to synchronize the central archive to one of the two data centers. The next step is to include both data centers and finally the satellite centers have to be included.

IRODS is a storage grid which uses rules to enforce policies on the actions performed on the data inside the storage grid or execute policies on a regular interval. One of the policies could be replication inside the storage grid. As soon as a file is ingested into the storage grid, it is automatically replicated onto several storage resources (hard discs, tapes, …). Another policy could make sure the file remains authentic, by checking the file hashes and repairing any damaged replication(s). It is also possible to create a connection between two or more storage grids. Each storage grid manages it’s own data collection and policies can be created to synchronize information between the storage grids.

Ideally, the iRODS policies should use the PID system to identify the information in all storage grids and based on these identifiers, and the associated information such as a checksum, perform synchronization between multiple (n) storage grids. This synchronization process will verify the integrity and authenticity of the synchronized data. This is shown in Illustration 2.

ISO639-3 as new standard within the MPI Archive

by Alexander König

As the ISO standard for languages codes ISO 639-3 is widely accepted nowadays, the MPI has decided to adopt these codes as the new standard for metadata stored in its archive. In the process of moving to this new standard we have gone through all the metadata files currently stored in the archive and replaced older code schemes like the two-letter ISO 639-1 variants with their 639-3 equivalent. This was another step in harmonising the linguistic metadata in the archive to make it easier to use for researchers.