Inter-rater reliability
Up one level
Inter-rater reliability
Could you explain how the reliability score using Compare Annotators is calculated? Do you know if it's similar to Krippendorff's alpha, which is 1 - (observed disagreement) divided by (disagreement one would expect when coding of units is attributable to chance)?
It would be great if we were able to say that our team's high "Compare Annotators" score (over .9 for all compared tiers) would roughly translate to a high score for other measures of inter-rater reliability that are accepted in academia, or if we would need to re-compare using other methods.
Thanks!
-Samantha
Re: Inter-rater reliability
I don't think the Compare Annotators function is similar to any of the inter-rater reliability measures accepted in academia. It is on our wishlist to include some often used methods for calculating agreement (kappa or alpha) in ELAN, but it is currently not there. It is, by the way, my impression that there is no common agreement on how to exactly deal with differences in time alignment of the annotations of the raters.
Compare Annotators in ELAN just compares the alignment of annotations on two tiers, by calculating the ratio between the overlap of two annotations (AND) and the "merged" interval (OR) of two overlapping annotations. Closer to 1 is better.
It iterates over the annotations of the first tier, finds annotation(s) on the second tier and calculates the above ratio for the annotation with the largest overlap on the second tier. If there are two annotations within the interval the ratio for the smallest one will not be calculated but results in 0 values. (So the average decreases rapidly by such differences in segmentation).
It only compares the segmentation/alignment, the contents is ignored. You can export the results and filter on the annotation content if you want.
So, this is explicitly not a commonly accepted interrater agreement measurement.
-Han