The Segment Processing PR¶

What is it?¶

● PR which enables you to process labelled sections of a document independently, one at a time
● Useful for

● very large documents

● when you want annotations in different sections to be independent of each other

● when you only want to process certain sections within a document

Processing large documents¶

• If you have a very large document, processing it may be very slow
• One solution is to chop it up into smaller documents and process each one separately, using a datastore to avoid keeping all the documents in memory at once
• But this means you then need to merge all the documents back afterwards
• The Segment Processing PR does this all in one go, by processing each labelled section separately
• This is quicker than processing the whole document in one go, because storing a lot of annotations (even if they are not being accessed) slows down the processing

Processing Sections Independently¶

• Another problem with large documents can arise when you want to handle each section separately
• You may not want annotations to be co-referenced across sections, for instance if a web page has profiles of different people with similar names
• Using the Segment Processing PR enables you to handle each section separately, without breaking up the document
• It also enables you to use different PRs for each section, using a conditional controller
• For example, some documents may have sections in different languages

Problematic co-references¶

problematiccoreferences

Getting rid of the junk¶

• Another very common problem is that some documents contain lots of “junk” that you don't want to process, e.g. HTML files contain javascript or contents lists, footers etc.
• There are a number of ways in which you can do this: you may need to experiment to find the best solution for each case

Segment Processing

AnnotationSetTransfer

Boilerpipe

How does Segment Processing PR work?¶

• The PR is part of the Alignment Plugin
• A new application needs to be created, containing the Segment PR
• The PR then takes as one of its parameters, an instance of the application that you want to run on the document (e.g. ANNIE)
• You can add other PRs before or after the Segment PR, if you want them to run over the whole document rather than the specified section(s)

Running ANNIE on a title segment¶

segementPRonTitle

Segment Processing Parameters¶

• Segment Processing PR calls the ANNIE application
• ANNIE is set to run only on the text covered by the span of the “title” annotation in the Original markups annotation set

Annotation Result¶

segmentPRresult

• Green shading shows the title, which spans the section to be annotated
• The only NE found is the Organization “BBC News” in the title
• Tokens in the rest of the document are not annotated