Archive for the Category LAT Software

 
 

In the pipeline: LEXAN, an advanced Annotation Framework for ELAN

by Herman Stehouwer & Sebastian Drude

As all linguistic field workers know, transcribing and further annotating audio and video recordings and other texts is a very expensive and time-consuming procedure. For a single hour of a recording of a lesser documented language it can take more than a hundred hours of expert time to create useful linguistic annotations such as “basic annotation” (a transcription and a translation) and “basic glossing”: additional information on individual units – usually morphs, sometimes words – such as an individual gloss (indication of meaning or function) and perhaps categorical information such as a part-of-speech tags (or its equivalents on the morphological level). More advanced glossing can take even longer.

Furthermore, information on the lexical units encountered in the texts need to be transferred to a lexical tool. After all, often one the goals of field work is to create a usable lexicon, describing the endangered language.

Currently, this work is supported best by tools like (The Field Linguist’s) Toolbox or the FieldWorks Language Explorer (FLEx), both without proper support for media-files. Many users have asked for support for advanced annotation tasks in ELAN, ideally using LEXUS to build, access and expand a lexical database. Making this possible is the objective of TLA’s newest project called LEXAN, a modular annotation support framework coupled to a new interface in ELAN. It will support different “annotyzers”,  i.e. modules that produce annotation suggestions for the researcher, including machine-learning modules.

The “annotyzers” will work on a tier or set of tiers, the “source tier[s]”, as chosen by the user, and typically produce an additional tier or a group of tiers, the “target tier[s]”, with content generated based on the source tiers and additional data, e.g. lexical data.

A first annotyzer-like functionality of ELAN (without requiring interaction with a lexicon yet) would be the possibility to copy one entire source tier, for instance a detailed transcript, or a literal translation. The created target tier can then serve as a starting point for preparing another tier with similar but edited content, for instance a cleaner adapted version of the orthographic transcript, or an idiomatic free translation.

Similarly, a basic tokenizer would copy the individual words (recognized by spaces and perhaps hyphens or similar punctuation) on one source tier – containing an orthographical representation of a sentence – into separate annotation units on a new (target) word-tier which can then be corrected (e.g., cells can be joined in the case of composed words such as black board, or on the contrary split in the case of clitics which may orthographically be parts of more comprehensive words).

As a possible next step, already making use of interaction with a lexicon, an annotyzer would use the annotations on the word-tier to build an “intermediate” database of individual inflected word forms. Each entry in this database would have at least a field which contains the citation form of the lexical word for each given inflected word form, possibly together with a semantic label (lexical gloss) and a disambiguating homonym index in case that two lexical words with identical citation forms exist. Some of these fields would be obtained from the lexicon once the citation form has been determined, and the citation form itself and other information (such as a “complete gloss” of the inflected word form which includes semantic effects of inflectional categories and the like) could be written back to new target tiers in ELAN. Although much of this information would still have to be added by hand the first time an inflected word form occurs, this simple setting would already help to: a) create lexical entries for new lexical units, b) reduce writing when the form occurs a second, third etc. time, and c) encourage and support consistency.

Many users acquainted with Toolbox or FLEx would expect a “glossing” functionality like they know it from these tools of the future LEXAN. This would include a parser-module (generic or language-specific, pure string-matching or advanced with using the context, static or with learning capacities etc.) which would split up the individual inflected word forms on a source word-tier into individual morphs on a new target morph-tier. This morph-tier would then serve as a source for adding further target tiers with annotations such as glosses (indication of lexical meaning or functional/categorical effects) and perhaps part-of-speech-like tags (on the morpheme level). In the lexicon, this functionality would presuppose corresponding fields in all entries such as a part-of-speech label for each morph and a gloss, which are probably the most common fields in lexical databases in field research anyway (in addition to the citation and variant forms of the morph and possibly a way to distinguish different but related senses which are given as lexicographical definitions or translation equivalents). Again, correct parses and glosses would be stored in the intermediate database so that they can be re-used and referred to.

It is a well-known fact that general parsers work better for some and less well for other languages (for instance, usually morphological parsers score high with predominantly isolating and agglutinative languages and less good with inflectional and polysynthetic languages). It is also true that glossing schemes and set-ups are based on specific types of linguistic theories – for instance, the setting presented above (which corresponds to the default functionalities of Toolbox and FLEx) is clearly tied to an “item-and-arrangement” (less so “item-and-process”) reasoning on language structure. In principle, an infrastructure as the one proposed here should strive at being as interoperable with different linguistic theories as possible, which would imply that also “word-and-paradigm” theories could fruitfully use the tools and functionalities. The proposal of an “intermediate” database with one entry each for every individual (inflected) word form goes into that direction, allowing, for instance, characterizing forms with respect to their functional categories without assigning these categories to individual morphs. Of course, to be fully functional providing for arbitrary theories and language types, also complex (multiple-word) forms must be covered, which presupposes the development of modules (parsers and the like) that recognize syntactic structures and that are able to cope with, say, discontinuous word forms.

More sophisticated and complete annotations on the morphological, syntactic and even other levels (phonetic/phonological, intonational) can be added by additional annotyzers as corresponding modules become available – for instance, morphological or syntactic constituent structures or grammatical relations could be generated (semi)automatically and represented in corresponding tiers in ELAN.

 

 

Click for bigger version

Figure: A schematic view of the architecture of LEXAN

LEXUS on the TextGrid infrastructure – exploring new potentialities

by André Moreira

A new LEXUS interface is now part of the TextGrid Laboratory environment.

Since 2010 TLA together with the Institut für Deutsche Sprache (IDS), have been developing the required technology to integrate LEXUS into the TextGrid Laboratory. Starting from the last TextGrid Laboratory Beta release on the 7th of December 2011, this work is now available to the public.

TextGrid is a joint research project, part of the D-Grid initiative, and is funded by the German Federal Ministry of Education and Research (BMBF). It aims to support access to and exchange of data in the arts and humanities by means of modern information technology (the grid).
TextGrid serves as a virtual research environment for philologists, linguists, musicologists and art historians. As a single point of entry to the virtual research environment, the TextGrid Laboratory provides integrated access to specialized tools, services and content.

The TextGrid Laboratory is a cross-platform, highly modular software application based on the Eclipse RCP platform. The modularity is brought to the user via various available plug-ins, which can be installed to expand the Laboratory functionality. Each of these plug-ins is usually a different tool, and when put together create “a single point of entry to the virtual research environment”. The application bundles, by default, a set of plug-ins, which are available right after installation and where LEXUS is now included.

Click for bigger version

Figure 1 - TextGrid Laboratory architecture portrait

The LEXUS plug-in itself, aims to emphasize the possibilities of using the LEXUS web service for the language resources commonly encountered in the TextGrid environment, as well as to demonstrate the usability of the TextGrid environment for non-standard languages as the ones commonly found in LEXUS.
From the user point of view it is a very simple plug-in which allows the user to search for occurrences of a certain word or character sequence in a lexical database. As in LEXUS, the search can be conducted in specific datacategories of a lexicon and different search setups can be used, e.g. searching for all the lexical entries that start with a certain prefix (fig. 2).
When displaying the results, LEXUS presents the full structure and data of the lexical entry containing the match, as well as a custom HTML view for that specific entry. This HTML view can be customized by the data manager (lexicon owner) through the regular LEXUS interface, thus enabling very flexible control over the layout of the lexical entries.

Click for bigger version

Figure 2 - TextGrid-LEXUS plug-in screenshot. Searching for the prefix 'auf' in the German syllabification lexicon.

Technically the LEXUS-TextGrid implementation is divided into two main components: the TextGrid Laboratory plug-in, providing to the user a full SWT-based user interface, and a SOAP web-service made available through the LEXUS back-end running on TLA servers.
Even though the web-service was developed within the scope of TextGrid’s, it also allows other clients to interface with it, as currently already happens with the LEXUS plug-in for ELAN.

For the time being every TextGrid Laboratory user will have two lexica available out of the box to search and explore. These are very simple lexica, which were made available for demonstration purposes. One containing the syllabification of most known German words, and the other containing a sample set of lexical entries from the Wichita endangered language.

Click for bigger version

Figure 3 - TextGrid-LEXUS plug-in screenshot. Searching the Wichita lexicon. Note the custom HTML layout on the right.

In the future we plan to extend the functionality made available by LEXUS in the TextGrid Laboratory, for instance by assigning to each user a private LEXUS workspace so that the user can also have private lexica, in addition to the lexica made available for every TextGrid Laboratory user.
Moreover, plans exist to further integrate ANNEX into the TextGrid Laboratory, enabling more TLA software functionality in the Laboratory workbench.

The Language Archive officially launched

by Sebastian Drude

Tuesday, the 11th of October 2011, the new unit of the Max-Planck-Institute for Psycholinguistics “The Language Archive” (TLA) has been officially launched in a public event with more than 150 guests and speeches from eminent representatives from Germany and the Netherlands.

Many more showed up than expected: there were even not enough seats for all guests at the launching of TLA in the Headquarters of the Berlin-Brandenburgische Akademie der Wissenschaften (BBAW) at the Gendarmenmarkt in the center of Berlin. The BBAW is one of the three supporting institutions of TLA, together with the Dutch Koninklijke Nederlandse Akademie van Wetenschappen (KNAW) and the German Max-Planck-Gesellschaft (MPG).

The guests were presented with coffee and snacks, but before and above all with much content: five eminent representatives of the major stakeholders of the new unit gave fascinating talks discussing different topics, all related to the ongoing and future activities of TLA. These were on the one hand the respective representatives of the three supporting institutions: Wolfgang Klein for the MPG, Angelika Storrer for the BBAW, and Theo Mulder for the KNAW. On the other hand, Wilhelm Krull represented the Volkswagenstiftung, the funding agency that supports the programme “Documentation of Endangered Languages” (DOBES) since 2000, which in turn was represented by Nikolaus P. Himmelmann. The DOBES archive is in many respects the core of the archive hosted by TLA. After the talks, Paul Trilsbeek provided a look into the archive itself.

The full program and topics of the speeches

Begrüßung und Zielstellung für das Spracharchiv
Prof. Dr. Wolfgang Klein
Direktor am Max Planck Institut für Psycholinguistik

Sprachforschung und Sprachdokumentation im digitalen Zeitalter
Prof. Dr. Angelika Storrer
Zentrum Sprache der BBAW

E-science: a major challenge for the humanities
Prof. Dr. Theo Mulder
Forschungsdirektor der KNAW

Dokumentation bedrohter Sprachen – eine Aufgabe für Wissenschaft und Gesellschaft
Dr. Wilhelm Krull
Generalsekretär der VolkswagenStiftung

Wie die Sprachwissenschaft zur Empirie fand (und findet)
Prof. Dr. Nikolaus P. Himmelmann
Universität Köln

Blick ins Archiv
(interactive presentation)

The TLA Opening in the media:

Introduction of a new transcription mode in ELAN 4.1.0

by Aarthy Somasundaram & Han Sloetjes

In this new release of ELAN a completely new “Transcription Mode” and an improved “Segmentation Mode” are introduced. Both have been developed in close cooperation with ELAN users.
The Transcription Mode is built for high-speed transcription. Where the traditional Annotation mode can be seen as accuracy-oriented rather than productivity-oriented, the Transcription mode aims at increasing the speed and efficiency of transcription work. The user interface has been designed with convenient text entry in mind: the main element is a table containing the annotations of selected tier types, displayed in a vertical order. Each cell in the table represents an annotation (or a position where a depending annotation can be created). The segments (annotations) need to be created first, in the segmentation or annotation mode, after which text can be typed into the (empty) segments in this mode. Operation in this mode is very much keyboard oriented. Selecting an annotation plays the corresponding segment automatically and brings it into edit mode: ready for you to start typing. Press the TAB key to replay. After editing, hit ENTER (or use the navigation keys) to jump to the next annotation, to play that segment automatically and to start typing right away and so on… Activation of a cell will silently create child annotations if they don’t exist yet — merely clicking an empty cell (or moving there using the keyboard) creates an annotation and opens it for editing. All this brings down the transcription work to just listening and typing, making it easy for the transcriber.

Figure 1: The Transcription Mode

 

“On-the-fly Segmentation” has been moved into the main window as the new Segmentation mode (instead of in a separate dialog). It is now easier to switch between tiers while the media is playing. Segments are created by keyboard strokes and can be modified by dragging with the mouse. This mode introduces a preliminary step-and-repeat playback mode.

Apart from that, some new multiple file processing functions have been added, like annotations from overlaps and annotation statistics. An option to add a group of tiers for a new participant has been implemented, as well as for deletion of multiple tiers in one action. Customization of the program has been improved by the introduction of new preference elements.

The new version can be downloaded at the ELAN web site where you will also find the updated manual, detailing how to use the new modes and other new functionalities.

New release of ELAN – Version 4.0.0

by Aarthy Somasundaram

Toward the end of last year a new version of ELAN has been released, containing lots of new features and improved functionalities, a new media player solution for Windows and fixes for a number of issues and bugs in previous versions.

A first implementation of interaction with LEXUS, the MPI developed web-based lexicon tool for creating and editing lexical databases, has been added. A new lexicon viewer allows the user to perform a look up for values in an online lexicon and to apply a value to the selected annotation.

ELAN has been facing many codec related problems, especially with mpeg-1 and mpeg-2 files. With the intention to eliminate a few of them, a new player, for Windows has been developed based on DirectShow (JDS, Java-Direct Show).
To use this player, it is necessary to select it first in the Platform/OS tab in the “Edit Preferences” window.

This version extends its support for controlled vocabularies with externally defined closed controlled vocabularies (located e.g. on the web). The list of supported file formats for importing controlled vocabularies has been extended with .txt and .csv. The file format of externally defined closed controlled vocabularies files is .ecv, which is close to eaf.

To make life easier and to increase the work speed of ELAN users, several improvements have been made to get things done with fewer steps and clicks.  A few tier-based operations, like removing multiple annotations or annotation values from selected tiers or creating depending annotations recursively on all depending tiers, can be performed much faster and with more ease of use. Now it is also possible to automatically create depending annotations, when an annotation is created on a tier with dependent tiers. The merge transcriptions function is extended with options for appending one file to the other, making the merging process more versatile.

Further support for audio and video recognizers, as developed in e.g. the AVATecH Project, has been implemented. To learn more about this project, visit the AVATecH website.

You can download the new version at the ELAN web site where you will also find the updated manual detailing how to use the new functionalities.

Embeddable Annex

by Thomas Koller

The MPI developers recently made a new Annex feature available which allows users to embed a smaller-sized customised version of Annex into any web page. This new feature has since then been warmly welcomed by researchers inside and outside our institute as it is a great way to easily show research results to outsiders.

The embeddable version of Annex only supports access to freely accessible annotation resource bundles, i.e. resource bundles which can be accessed from the IMDI browser without user login. This restriction helps to avoid authentication issues and effectively protects resources with restricted access.

This new feature can be accessed directly from Annex by clicking on embed in the menu. Then a small dialog pops up where the user can customise the HTML snippet before copying it to the clipboard and pasting it into a webpage. This works pretty much the same way as the similar YouTube feature which users may already be familiar with. The following options are available:

  • Show border around embedded Annex application: the creator can select a border width, a border color and a border type (solid, dotted or dashed)
  • Size of the embedded Annex application: 4 predefined sizes are available. The user can also set any custom sizes directly in the HTML markup. It should be noted, however, that the embedded Annex application has been optimised in layout and components sizes for the 4 predefined sizes. So any custom size set in the HTML snippet can lead to a non-optimal looking Annex instance.
  • Default view: text or subtitle. Setting a different default view (such as timeline or grid) will be ignored, instead the ‘text’ view will be set.
  • Tier text font: This setting may be helpful if the user wants the embedded Annex to display an annotation resource with special characters which may not be contained in a standard font on the user’s computer. If the ‘Tier text font’ parameter is set with a font name which is not available on the user’s computer, then the embedded Annex application will automatically fall back to a standard font. The end user also has the option to change the tier text font and the font size at any time via a dropdown list.

The embedded Annex application has a Start Full ANNEX button in its top right corner. When the end user clicks this button, a new browser tab will open the full Annex version showing the same annotation resource.

The CLARIN-NL metadata tutorial

by Dieter van Uytvanck

On Friday May 27, about 25 persons gathered in the Max Planck Institute in Nijmegen to attend a workshop on the practical use of the Component Metadata Infrastructure (CMDI) for the description of language resources. CMDI is the metadata part of CLARIN, a European initiative to create a Common Language Resources Infrastructure

After a short introduction about metadata in general and a history sketch, the concepts behind CMDI were introduced: The core ideas behind the new metadata format are modularity, reusability, and the use of data categories. A special session was dedicated to the use of ISOcat, the reference implementation of a data category registry. The idea behind this is to have a dependable definition of what is meant with a data category as, for example, Part of Speech. This way it doesn’t matter how you call or spell it in your particular metadata schema, the connection to similar schemata is always clear.

After these more general introductions, the specific CMDI software was presented.

First the Component Registry was shown. It is a web application that can be used for inspecting, searching, creating and editing CMDI metadata components. Afterwards it was illustrated how to create CMDI metadata files using a version of Arbil that has been modified to directly interact with the Component Registry. Both Arbil and the Component Registry are developed by the Max Planck Institute for Psycholinguistics and were presented by their respective developers. Although both applications are still in a development state it was clear that they can already be used now for the production of CMDI metadata.

All slides of the presentations can be downloaded from the CLARIN NL website.

More information about CMDI, including links to the software so you can try it out yourself, can be found on the main CLARIN site.

ANNEX and ELAN – A Comparison

by Thomas Koller and Han Sloetjes

ANNEX and ELAN are two closely related applications designed for handling of digital media files and associated annotation files. While ELAN as a desktop application is used for the creation of rich annotations on audio and video recordings, ANNEX represents a web-based viewer which allows to study annotated resources once they have been properly stored on the archive server.

This short article aims at highlighting on the one hand what features they have in common and on the other hand what features are unique to each tool.

ELAN is a local tool (desktop application) for the creation of annotations to audio and or video recordings. It is a combination of a media player with viewer and editor components for annotations. The annotation documents are stored in the XML-based ELAN Annotation Format (EAF). ELAN is written in the Java programming language and is available for Windows, Mac OS X and Linux. On Windows and Mac the media playback is delegated to an available high performance native media framework: DirectX/DirectShow on Windows and QuickTime on Mac. On Linux JMF is used. The list of supported file types depends on the available media player frameworks.

ELAN main window

Although there is limited support for streaming media via the RTSP protocol, most commonly the media files are accessed directly on a local hard drive or the local network. This guarantees high accuracy in media playback, especially in (repeated) playback of fragments of the media, which is usually a basic step in the process of segmenting the media. The annotation boundaries can be determined with millisecond precision. ELAN supports simultaneous, synchronized playback of up to 4 video files. The annotation documents are stored locally as well. The variant of the TROVA search engine that is distributed with ELAN can query the contents of physical directory structures. To that end it creates temporary in-memory indexes for the content of selected folders and files. The search is limited to EAF files. The ELAN window offers several customizable views on the annotation data, all synchronized with the media player. All viewers are editors at the same time. Many operations are provided for manipulating tiers and annotations.

ANNEX is written as an ELAN compliant browser-based tool (web application) that supports media playback via HTTP pseudostreaming and the Flash Player browser plugin. For freely accessible language resources ANNEX can also be embedded in any web page by pasting a simple HTML snippet into the page (comparable to the way Youtube supports embedding of videos into web pages). Alongside the media player it contains several customizable viewer components for annotations. By default both the media files and the annotation files are streamed from the MPI online archive; there is no need for downloading files in order to be able to view their contents. ANNEX is seamlessly integrated with the archive access management tools and interacts with available web services, for example the ones exposed by the lexicon tool LEXUS. Other tools in turn can make parameterized calls to ANNEX.

ANNEX works with the online version of TROVA, which creates an index for a whole LAMUS archive using the Postgres database system. This version of TROVA supports not only EAF but also Shoebox, CHAT and generic XML, HTML and text files.

Comparison Matrix
Feature ANNEX ELAN
Number of synchronized videos 1 4
Media file types MPG for video files, WAV for audio files Depending on the media framework of the particular platform
Waveform for audio .wav only .wav only
Media playback precision Depends on keyframe rate milliseconds
Streaming media support Pseudostreaming for audio and video files Limited, via rtsp
Annotation formats EAF, Toolbox/Shoebox, Chat. Will be converted to a single XML format for transfer. EAF, import of Toolbox, Chat, Praat, Transcriber, CSV
Annotation editing No Yes
Number of tiers Unlimited Unlimited
Font usage Any font available on the system Any font available on the system supported by Java
Search options TROVA search engine, search in entire (accessible part of) archive Single file search and multiple file search (TROVA) in local corpus
Technology Flash, XML, Quicktime (temporarily for resources with master audio file) Java, XML
Tool interaction, API Support for parameterized calls to ANNEX Extension mechanism for particular parts of the application

LEXUS and ViCoS: a software ‘couple’ in the LAT suite

by Jacquelijn Ringersma

LEXUS is our online tool for the creation of multimedia lexica and encyclopedic dictionaries. LEXUS is targeted at linguistics involved in language documentation, but also actively used by researchers in Sign Language research. LEXUS is based on the ISO recommendation for Language Resource Management (ISO TC37/SC4), providing a Lexical Markup Framework (LMF) lexicon structure and a concept naming registry (ISOcat). With LEXUS, users can create lexica from scratch, but also import lexica created in Toolbox or other XML based tools. Lexica using LMF and ISOcat are interoperable with each other, allowing for multi lexicon searches and merging of lexica. Users may customize views of the word list and lexical entries. Standard functionality, like sorting or filtering of word lists is already available and we are currently working on paper output options. One of the major strengths of the online tool is that users may share their lexica with other users, either on a read only or read/write basis.

ViCoS is an extension of LEXUS, with which users can create relations between lexical entries, using fuzzily defined relation types. The result of this network of relations can be a conceptual space, where each word is represented as an element in a network of other related words. Relations can be ‘universal’ (e.g. A_is_a_B) or specifically defined for a particular lexicon (A_eats_B). In its current version ViCoS can only be used from the LEXUS user interface, since the words are the basis of the conceptual space. Future plans for ViCoS envisage that the tool will be central in the creation of a customized ‘eScience environment’, a user-defined workspace where users can link any type of resource into new organizational layers.

LEXUS and ViCoS training and support

Recently we did a LEXUS/ViCoS training session in the Winter School Saami Language Documentation and Revitalization in Bodø, Norway. Some 25 participants were trained in creating lexica, adding multimedia fragments, customizing views and creating conceptual spaces. Although the training was basic and could not cover the full functionality of LEXUS and ViCoS, most users were enthusiastic about the tools and registered as LEXUS users after the training.

At the Saami Winter School (photo by Lena Karvovskaya)

If you are interested in using the tools, you may request a LEXUS user account by sending an e-mail to Jacquelijn Ringersma. We have regular LEXUS and ViCoS training in the DoBeS training weeks, or in summer schools and language documentation workshops.

Arbil

by Peter Withers

Arbil (short for “Archive Builder”) is an application for arranging research material and associated metadata into a format appropriate for archiving. It is basically the successor to the IMDI Editor, which a lot of our readers probably know. This old tool is now almost ten years old and a lot has happened in software engineering in that time. Therefore, instead of simply updating the IMDI editor to incorporate user suggestions and modern software architecture principles, the MPI has decided to create a completely new application that will replace the IMDI editor altogether.

The most obvious difference to the old IMDI Editor is that Arbil has a tabular display, which allows a comparative view on the data and the ability to copy-paste between matching sets of data.

Arbil Screenshot 1

Arbil main screen

While Arbil is primarily a tool to enter metadata, it also has functions to help organise the collected material and create a local well-organised corpus before it is archived. These functions include the ability to search for and compare metadata, and to open the resource files in associated applications, e.g. ELAN for annotations or a media player to watch videos.

Arbil Screenshot 2

Arbil's embedded image viewer

Arbil has been designed as a local application so that it can also be used offline, for instance in remote field sites. The metadata and resource files can be entered in part or as a whole; once an Internet connection is available the previously entered data and associated structures can be exported from Arbil and then be transferred to the main archive via Lamus. As the idea behind Arbil is to deal with a complete corpus or corpus branch Arbil can mirror branches from the main archive on your local computer so that they can still be referred to offline and in the field.

Unlike the IMDI Editor, Arbil incorporates the functionality of a number of separate tools while being designed around the workflow of the user. This means the application is meant to be the one program around which all your archiving work will be centered. In Arbil the metadata is viewed in tables, which can contain a single node of metadata as a list of fields, or many different nodes each with its fields as a separate row in the table. This tabular view of the data allows multiple metadata nodes to be compared across the rows of the table. If the metadata node is editable then these fields can be edited in any table in which they are viewed. Drag and drop is used extensively both for constructing a hierarchical corpus tree structure and for adding nodes to tables for viewing and editing. Bulk editing of metadata can be done primarily via copy and paste, which allows a string of text to be pasted into multiple fields of multiple rows, or to paste multiple fields into the matching fields of multiple rows.

Arbil Screenshot 3

Tabular view of multiple imdi files

The basic user interface of Arbil is highly customisable. For example, you can decide which columns of a table should be visible and save various table layouts to quickly switch between different views to accommodate for different tasks. Furthermore, columns can be resized, sorted on any column and reordered. Rows can easily be added and then dragged from one table to another, and the cells can be highlighted based on matching text.

With all the various metadata fields it is not always easy to see at a glance what a particular fields is intended for and which fields are specifically required. For this reason a description of the intended usage for each field is displayed in a tool tip. Similarly an indication of which fields are required is given by a textual and colour highlight when the data is not filled in. Likewise in the case of fields requiring specific formatting, such as date fields, the cell and the table will be highlighted when the formatting is incorrect.

Arbil Screenshot 4

Arbil window style

In contrast to the IMDI Editor Arbil is more of a team-player, which means that there are various import and export facilities. Any valid IMDI files can be imported into Arbil. If the metadata refers to resource files, they can optionally be imported at the same time, for instance when migrating or merging from one computer to another. All of the IMDI data and the associated resources within Arbil can be exported into a self-contained directory. The exported files can then be uploaded into LAMUS where any new corpus branches can be uploaded and existing sessions that have been edited can be replaced. During both the import and export processes, all of the metadata is validated and a list of warnings are given if there are any errors. The textual data entered into Arbil can be exported in formats other than IMDI, for instance the contents of a table can be copied and pasted into a text editor or into spreadsheet. A custom style-sheet can be used to transform and export the data in a particular format. The tables used in Arbil can also be embedded in a web page, where the table’s contents, size, columns and highlighting will be displayed in the resulting web page.

Arbil continues to be actively developed to extend these features further and to make the user experience as pleasant as possible. Why don’t you try it out yourself?