Skip to content

Language Archiving Technology

Sections
Personal tools
You are here: Home » Tools » Elan » ELAN Forum » Reclustering tokenized tier annotations

Reclustering tokenized tier annotations

Up one level

Reclustering tokenized tier annotations

Posted by rueter at 2011-04-06 15:15  
I am a relatively new user of ELAN, and I would like to know whether the annotations on a tokenized tier can be clustered and the resulting clusters be referred to in a referring tier.

(1) I have time-aligned utterances with transcriptions of the speech in tier Q-Spch

(2) I have time-aligned (time subdivision) words with transcriptions of the individual word forms in tier Q-Words, (parent: Q-Spch; derived by tokenization from same)

(3) The Q-Words-Orth tier is symbolically associated to the
(parent: Q-words; copied from same) and punctuation marking has been added with intervening space between word forms and punctuation.

(4) The Q-Words-Orth-Token tier has word forms and punctuation all in separate annotations.
(parent: Q-Words-Orth; derived by tokenization from same)

(5) I want to make a new tier in which contiguous tokens in (4) can be clustered, i.e. word forms and punctuation might be joined to form normative-type sentences.

Utterances (1) and words (2) have been seen as primary tiers for further time subdivision, whereas orthographically represented words (3)-(4) and normative sentences (5) are to be symbolically associated.

Since utterances do not necessarily correspond in length and break points with normative sentences I am hoping to find future developments which will allow reclustering. Essentially, this would allow reference to real subsets of all time subdivisions in a mutual parent, or reference to real subsets of all time subdivisions in adjacent parents.

Hence a selection in (5) might contain references to part but not necessarily all of tokens of a selection in (1), manifest in (2)-(4).

Based on the presentation of linguistic type stereotypes in chap 5 section 1 of the manual, I have assumed that work might still be underway for addressing the reclustering of tokenized elements.  Perhaps, I would have preferred to see the orthographically represented tier as a child tier to the phonetic tier, but then I don't know whether the example given is representative of a printed text that is read or not.

Is tier (5) going to be possible in the near future? The presence of orthographic sentences would help in the use of morphological analyzers and syntactic parsers which might be available for written forms of a language but not a phonetic transcription. 

Thank you for your troubles
Jack Rueter

Re: Reclustering tokenized tier annotations

Posted by hasloe at 2011-04-07 16:50  

If I understand the question correctly then this would be what in some (older) papers about the ELAN data model is referred to as "multiple_ref" or "co-reference" annotations. This has been part of the design from the start, but unfortunately has never been implemented so far. That work is indeed still underway and hopefully we'll be able to implement it this year (but no promises can be made here).

-Han

Re: Re: Reclustering tokenized tier annotations

Posted by rueter at 2011-04-08 16:14  

Thanks, Han!<br><br>The multiple-reference annotation sounds like what I would like.<br>(a) break up transcribed utterances into transcribed words (time subdivision)<br>(b) assign normative orthographic representations to the individual transcribed words (symbolic association)<br>(c) cluster normative orthographic representations in multiple-referential tier where annotations do not necessarily align with utterance annotations in (a).<br><br>It would really be nice if multiple-reference annotation could be a part of this years development. <br>Thanks again,<br>Jack Rueter<br>

 

Powered by Plone

This site conforms to the following standards: