Skip to content

The Segment Processing PR

What is it?

● PR which enables you to process labelled sections of a document independently, one at a time
● Useful for

    ● very large documents
    ● when you want annotations in different sections to be independent of each other
    ● when you only want to process certain sections within a document

Processing large documents

• If you have a very large document, processing it may be very slow
• One solution is to chop it up into smaller documents and process each one separately, using a datastore to avoid keeping all the documents in memory at once
• But this means you then need to merge all the documents back afterwards
• The Segment Processing PR does this all in one go, by processing each labelled section separately
• This is quicker than processing the whole document in one go, because storing a lot of annotations (even if they are not being accessed) slows down the processing

Processing Sections Independently

• Another problem with large documents can arise when you want to handle each section separately
• You may not want annotations to be co-referenced across sections, for instance if a web page has profiles of different people with similar names
• Using the Segment Processing PR enables you to handle each section separately, without breaking up the document
• It also enables you to use different PRs for each section, using a conditional controller
• For example, some documents may have sections in different languages

Problematic co-references

problematiccoreferences

Getting rid of the junk

• Another very common problem is that some documents contain lots of “junk” that you don't want to process, e.g. HTML files contain javascript or contents lists, footers etc.
• There are a number of ways in which you can do this: you may need to experiment to find the best solution for each case

    Segment Processing PR enables you to only process the section(s) you are interested in and ignore the junk
    – Using the AnnotationSetTransfer PR, though this works slightly differently
    – Using the Boilerpipe PR - this works best for ignoring standard kinds of boilerplate

How does Segment Processing PR work?

• The PR is part of the Alignment Plugin
• A new application needs to be created, containing the Segment PR
• The PR then takes as one of its parameters, an instance of the application that you want to run on the document (e.g. ANNIE)
• You can add other PRs before or after the Segment PR, if you want them to run over the whole document rather than the specified section(s)

Running ANNIE on a title segment

segementPRonTitle

Segment Processing Parameters

• Segment Processing PR calls the ANNIE application
• ANNIE is set to run only on the text covered by the span of the “title” annotation in the Original markups annotation set

Annotation Result

segmentPRresult

• Green shading shows the title, which spans the section to be annotated
• The only NE found is the Organization “BBC News” in the title
• Tokens in the rest of the document are not annotated