The Segment Processing PR¶
What is it?¶
● PR which enables you to process labelled sections of a
document independently, one at a time
● Useful for
- ● very large documents
- ● when you want annotations in different sections to be
independent of each other
- ● when you only want to process certain sections within
a document
Processing large documents¶
• If you have a very large document, processing it may be very slow
• One solution is to chop it up into smaller documents and process
each one separately, using a datastore to avoid keeping all the
documents in memory at once
• But this means you then need to merge all the documents back
afterwards
• The Segment Processing PR does this all in one go, by
processing each labelled section separately
• This is quicker than processing the whole document in one go,
because storing a lot of annotations (even if they are not being
accessed) slows down the processing
Processing Sections Independently¶
• Another problem with large documents can arise when you
want to handle each section separately
• You may not want annotations to be co-referenced across
sections, for instance if a web page has profiles of different
people with similar names
• Using the Segment Processing PR enables you to handle each
section separately, without breaking up the document
• It also enables you to use different PRs for each section, using
a conditional controller
• For example, some documents may have sections in different
languages
Problematic co-references¶

Getting rid of the junk¶
• Another very common problem is that some documents contain
lots of “junk” that you don't want to process, e.g. HTML files
contain javascript or contents lists, footers etc.
• There are a number of ways in which you can do this: you may
need to experiment to find the best solution for each case
- – Segment Processing PR enables you to only process the
section(s) you are interested in and ignore the junk
- – Using the AnnotationSetTransfer PR, though this works
slightly differently
- – Using the Boilerpipe PR - this works best for ignoring
standard kinds of boilerplate
How does Segment Processing PR work?¶
• The PR is part of the Alignment Plugin
• A new application needs to be created, containing the
Segment PR
• The PR then takes as one of its parameters, an instance of
the application that you want to run on the document (e.g.
ANNIE)
• You can add other PRs before or after the Segment PR, if
you want them to run over the whole document rather than
the specified section(s)
Running ANNIE on a title segment¶

Segment Processing Parameters¶
• Segment Processing PR calls the ANNIE application
• ANNIE is set to run only on the text covered by the span of the
“title” annotation in the Original markups annotation set
Annotation Result¶

• Green shading shows the title, which spans the section to be
annotated
• The only NE found is the Organization “BBC News” in the title
• Tokens in the rest of the document are not annotated