August 12, 2009

Balisage 2009 - Streamabilty of XProc Pipelines


Norm Walsh (Mark Logic) gave a talk on streamability of XProc pipelines. XProc lets users define a sequence of atomic operations to apply to a series of documents, using control structures similar to conditionals, iteration, and exception handlers. XProc: An XML Pipeline Language is presently a W3C Candidate Recommendation that is near and dear to Norm since he’s been working on it for awhile. He hinted it should become a Recommendation this fall or certainly by Christmas. As per W3C policy, there must be 2 implementations before a specification is finalized. One of those implementations is by Walsh himself, called XML Calabash which is built on Saxon 9.

Streaming would provide a sliding window in a single pass with output beginning before all input has been seen. Little in said about streaming in the spec, but it is clear it could improve end-to-end performance in certain situations and would be essential for processing documents larger than physical memory. Although there are no explicit requirements for steps to be streaming in the spec, implementations will add value by enabling this.

Norm indicated that certain XProc instructions such a p:count are streamable, wheras others such as p:exec, p:http-request, p:validate-with-relaxng, p:validate-with-schematron, p:validate-with-xml-schema, p:xquery, and p:xslt cannot be streamable. His paper discusses data he collected collected by XML Calabash between 21 Dec 2008 and 11 Jul 2009 representing more than 294,000 pipeline runs. (His implementation has an opt-out, phone home feature so he can collect certain usage data.) In his Submitted Paper, Walsh concluded:
The preliminary analysis performed when this paper was proposed suggested that less than half “real world” pipelines would benefit from a streaming implementation.
The data above seems to indicate that the benefits may be considerably larger than that. Although it is clear that there are pipelines for which streaming wouldn't offer significant advantages, it's equally clear that for essentially any set of pipelines of a given length, there are pipelines which would be almost entirely streamable.
Perhaps the most interesting aspect of this analysis is the fact that as pipeline runs grow longer, they appear to become more and more amenable to streaming. That is to say, it appears that a pipeline that runs to 300 steps is, on average, more likely to benefit from streaming than one that's only 100 steps long. We have not yet had a chance to investigate why this is the case.

No comments: