Annotating and Normalizing Dates:¶

Many information extraction tasks beneﬁt from or require the extraction of accurate date information.

While ANNIE does produce Date annotations no attempt is made to normalize these dates, i.e. to ﬁrmly ﬁx all dates, even partial or relative ones, to a timeline using a common date representation.

The PR in the Tagger_DateNormalizer plugin attempts to ﬁll this gap by normalizing dates against the date of the document in order to tie each Date annotation to a speciﬁc date.

This includes normalizing dates such as April 1st, today, yesterday, and next Tuesday, as well as converting fully speciﬁed dates (ones in which the day, month and year are speciﬁed) into a common format.

Init parameters:¶

Diﬀerent cultures/countries have diﬀerent conventions for writing dates, as well as diﬀerent languages using diﬀerent words for the days of the week and the months of the year.

So for example, to specify British English (which means the day usually comes before the month in a date) use en_GB, or for American English (where the month usually appears before the day in a date) specify en_US.

If you need to override the locale on a document basis then you can do this by setting a document feature called locale to a string encoded as above. If neither the initialization parameter or document feature are present or do not represent a valid locale then the default locale of the JVM running GATE will be used.

locale:It is used for parsing dates, in the form lang (_country(_variant)?)?. If left blank the system locale will be used.

Once initialized and added to a pipeline the Date Normalizer has the following runtime parameters that can be used to control it’s behaviour.

Run time parameters¶

● annotationName: the annotation type created by this PR, defaults to Date.
● dateFormat: the format that dates should be normalized to. The format of this parameter is the same as that use by the Java SimpleDateFormat whose documentation describes the full range of possible formats (note you must use MM for month and not mm). This defaults to dd/MM/yyyy. Note that this parameter is only required if the numericOuput parameter is set to false.
● failOnMissingInputAnnotations: if the input annotations (Tokens) are missing should this PR fail or just not do anything, defaults to true to allow obvious mistakes in pipeline conﬁguration to be captured at an early stage.
● inputASName: the annotation set used as input to this PR.
● normalizedDocumentFeature: if set then the normalized version of the document date will be stored in a document feature with this name. This parameter defaults to normalized-date although it can be left blank to suppress storage of the document date.
● numericOutput: if true then instead of formatting the normalized dates as String features of the Date annotations they are instead converted into a numeric representation. Speciﬁcally the ﬁrst converted to the form yyyyMMdd and then cast to a Double. This is useful as dates can then be sorted numerical (which is fast) into order. If false then the formatting string in the dateFormat parameter is used instead to create a string representation. This defaults to false.
● outputASName: the annotation set to which new annotations will be added.
● sourceOfDocumentDate:this parameter is a list of the names of annotations, annotation features (encoded as Annotation.feature), and document features to inspect when trying to determine the date of the document. The PR works through the list getting the text of feature or under the annotation (if no feature is speciﬁed) and then parsing this to ﬁnd a fully speciﬁed date, i.e. one where the day, month and year are all present. Once a date is found processing of the list stops and the date is used as the date of the document.

If you specify an annotation that can occur multiple times in a document then they are sorted based on a numeric priority feature (which defaults to 0) or their order within the document. The idea here is that there are multiple ways in which to determine the date of a document but most are domain speciﬁc and this allows previous PRs in an application to determine the document date.

This defaults to an empty list which is taken to assume that the document was written on the day it is being processed. The same assumption applies if no fully-speciﬁed date can be found once the whole list has been processed.

Note

Note that a common mistake is to think you can use a date annotated by this PR as the document date. The document date is determined before the document is processed, so any annotation you wish to use to represent the document date must exist before this PR executes.

The annotations created by this PR have the following features:

● normalize: the normalized date in the format speciﬁed through the relevant runtime parameters of the PR.
● inferred: an integer which speciﬁes which parts of the date had to be inferred. The value is actually a bit mask created from the following ﬂagd: day = 1, month = 2, and year = 4. You can ﬁnd which (if any) ﬂags are set by using the code (inferred & FLAG) == FLAG, i.e. to see if the day of the month had to be inferred you would do (inferred & 1) == 1.
● complete: if no part of the date had to be inferred (i.e. inferred = 0) then this will be true, false otherwise.
● relative: can take the values past, present or future to show how this speciﬁc date relates to the document date.

Infer the Results:¶

datenormalizeoutput