performance Evaluation Of Analysers¶

When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the stage of science.

GATE provides a variety of tools for automatic evaluation. The Annotation Diﬀ tool compares two annotation sets within a document. Corpus QA extends Annotation Diﬀ to an entire corpus. The Corpus Benchmark tool also provides functionality for comparing annotation sets over an entire corpus.

preparing Gate Developer¶

Restart GATE, or close all documents and PRs to tidy up
• Download the hands-on material from here Click Here
• Load the annie-hands-on/news-texts into a corpus.
• Take a look at the annotations.
• There is a set called “Key”. This is a set of annotations against wish we want to evaluate ANNIE. In practice, they could be manual annotations, or annotations from another application.
• Load ANNIE and run it
• You should have annotations in the Default set from ANNIE, and in the Key set, against which we can compare them.

Annotation Diff Tool¶

The Annotation Diﬀ tool enables two sets of annotations in one or two documents to be compared, in order either to compare a system-annotated text with a reference (hand-annotated) text, or to compare the output of two diﬀerent versions of the system (or two diﬀerent systems).

It can be accessed both from GATE Developer and from GATE Embedded. Annotation Diﬀ compares sets of annotations with the same type. When performing the comparison, the annotation oﬀsets and their features will be taken into consideration. and after that, the comparison process is triggered.

All annotations from the key set are compared with the ones from the response set, and those found to have the same start and end oﬀsets are displayed on the same line in the table. Then, the Annotation Diﬀ evaluates if the features of each annotation from the response set subsume those features from the key set, as speciﬁed by the features names you provide.

Annotation Diff Exercise¶

• Open the document “ft-airlines-27-jul-2001.xml”
• Open the Annotation Diff (Tools → Annotation Diff or click the icon
• For the Key set (may contain the manual annotations) select Key annotation set
• For the Response set (containing annotations from ANNIE) select the Default annotation set
• For the Type option select the Organization annotation.
• Click on “Compare”
• Scroll down the list, to see correct, partially correct, missing and false positive annotations

Annotationdiffoutput

Comparing the individual annotations¶

• In the Annotation Diff, colour codes indicate whether the annotation pair shown are correct, partially correct, missing (false negative) or false positive.
• You can sort the columns however you like.

AnnotDiffComparingAnnots

Measuring success¶

• In IE, we classify the annotations produced in one of 4 ways:
Correct = things annotated correctly
e.g. annotating “Sheffield” as a Location
Missing = things not annotated that should have been
e.g. not annotating “Sheffield” as a Location
False positive (spurious) = things annotated wrongly
e.g. annotating “Sheffield” as a Location in “Sheffield United F.C.”
Partially correct = the annotation type is correct, but the span is wrong
e.g. annotating just “Trump” as a Person (too short) or annotating “Unfortunately Donald Trump” as a Person (too long)

Finding Precision, Recall and F-measure¶

AnnotdifffindingPRFScore

Precision¶

How many of the entities your application found were correct?

precisionFormula

Recall¶

• How many of the entities that exist did your application find?
• Sometimes recall is called coverage

RecallFormula

F-Measure¶

• Precision and recall tend to trade off against one another.
• If you specify your rules precisely to improve precision, you may get a lower recall.
• If you make your rules very general, you get good recall, but low precision.
• This makes it difficult to compare applications, or to check whether a change has improved or worsened the results overall.
• F-measure combines precision and recall into one measure.
• It is also known as the “harmonic mean”.
• Usually, precision and recall are equally weighted.
• This is known as F1.
• To use F1, set the value of the F-measure weight to 1, this is the default setting in Annotation Diff tool.

Fmeasureformula

Annotation Diff defaults to F1¶

Fmeasuredefaultstoone

Statistics can mean what you want them to....¶

How we want to measure partially correct annotations may differ, depending on our goal.
In GATE, there are 3 different ways to measure them
• The most usual way is to consider them to be “half right”.
• Strict: Only perfectly matching annotations are counted as correct.
• Lenient: Partially matching annotations are counted as correct. This makes your scores look better :)
• Average: Strict and lenient scores are averaged (this is the same as counting a half weight for every partially correct annotation).

AnnotDifftoolSLA

Creating a Gold Standard with the Annotation Diﬀ Tool¶

In order to create a gold standard set from two sets you need to show the ‘Adjudication’ panel at the bottom. It will insert two checkboxes columns in the central table. Tick boxes in the columns ‘K(ey)’ and ‘R(esponse)’ then input a Target set in the text ﬁeld and use the ‘Copy selection to target’ button to copy all annotations selected to the target annotation set.

There is a context menu for the checkboxes to tick them quickly.

Each time you will copy the selection to the target set to create the gold standard set, the rows will be hidden in further comparisons. In this way, you will see only the annotations that haven’t been processed. At the end of the gold standard creation you should have an empty table.

To see again the copied rows, select the ‘Statistics’ tab at the bottom and use the button ‘Compare’.

AnnotdiffToolcopyingAnnots