Showing posts with label streaming XML. Show all posts
Showing posts with label streaming XML. Show all posts

May 21, 2010

Balisage 2010 - XML Conference - Schedule Posted!


"Balisage: The Markup Conference" (http://www.balisage.net) is an annual peer-reviewed XML conference: how to create markup; what it means; hierarchies and overlap; modeling; taxonomies; transformation; query, searching, and retrieval; presentation and accessibility; making systems that make markup dance (or dance faster to a different tune in a smaller space).

Come to lovely Montreal, Canada from August 3rd to 6th for four action-packed days of angle brackets! Here’s a baker dozen (or so) sampling from the much larger list of Balisage 2010 presentations:

  • gXML, a new approach to cultivating XML trees in Java
  • Java integration of XQuery — an information unit oriented approach
  • Reverse modeling for domain-driven engineering of publishing technology
  • Managing semantics in XML vocabularies
  • XML pipeline processing in the browser
  • Where XForms meets the glass: Bridging between data and interaction design
  • Schema component paths for schema analysis
  • A streaming XSLT processor
  • Multi-structured documents and the emergence of annotations vocabularies
  • Processing arbitrarily large XML using a persistent DOM
  • Automatic upconversion using XProc
  • Scripting documents with XQuery
  • XQuery design patterns
  • Parallel processing and your XML data

Want to travel on the weekend so you can talk about angle brackets for an extra day? Then register for the pre-conference symposium on August 2nd, “XML for the Long Haul: Issues in the Long-term Preservation of XML”.

Schedule At-a-Glance: http://www.balisage.net/2010/At-A-Glance.html

Detailed schedule with descriptions: http://www.balisage.net/2010/Program.html

XML for the Long Haul: http://www.balisage.net/longhaul/index.html

Tower of Modern Babel Contest - Chance to win an Apple 15" (i5) MacBook Pro, Apple MacBook Air or USD $2000: http://www.balisage.net/contest.html

Sponsors include: Mark Logic, oXygen XML Editor, and the FLWOR Foundation. Co-sponsors include: W3C, OASIS, Dublin Core Metadata Initiative, XML Guild, TEI Encoding Initiative, Washington Area SGML/XML Users Group, Philadelphia XML Users Group, and many more. Balisage 2010 is a production of Mulberry Technologies, Inc., a Washington area XML and SGML consultancy.

August 13, 2009

Balisage 2009 - Stream and Scream

This is the promised Part 2 of the blog entry about Mike Kay’s XML/XSLT processing optimization talks from August 12th. His second talk, entitled XSLT Screaming in XSLT 2.1 - Saxon-EE 9.2, was actually an impromptu. Kay gave us an unofficial preview of XSLT 2.1, which isn’t yet a public working draft from the W3C. Despite only the change in minor version number, we learned that the changes in 2.1 will be substantial.

Mike defined XSLT streaming as processing source documents without building a tree in memory, making it possible to handle much larger documents and reducing latency. Apparently, implementors haven’t taken advantage of streaming yet. The new XSLT specification will define a subset of the language that is streamable (presumably like the XProc spec does). Boldface is used to highlight new XSLT instructions or attributes below.
  • xsl:stream href=”uri”
  • xsl:mode streamable=”yes” name=”stream1”
  • xsl:template match=... mode=”stream1”
Exactly what is streamable within a template will be defined. For example, no sorting, no sideways navigation, only one download selection, no ancestor::x.child::y, etc.

Other new XSLT instructions include:
  • xsl:iterate - syntax like xsl:for-each, but with semantics of tail recursion. For example, if you are producing a document with all bank transactions, it could be generated with a running total of balance. You can pass parameters to next iteration using xsl:next-iteration and xsl:with-param.
  • xsl:merge - merge multiple streamed input files; also
  • xsl:merge-source, xsl:merge-input, xsl:merge-key, xsl:merge-action
  • xsl:copy-of and xsl:snapshot - retains ancestors and attributes
Specifically with regard to SAXON-EE 9.2, Kay highlighted the following functions and instructions:
  • saxon:stream() function -- xsl:stream, mainly for documents larger than physical memory; lazy evaluation
  • saxon:iterate -- helpful as an alternative to recursion that some programmers can understand more easily
  • saxon:mode streamable=”yes”, but presently with only a subset of the XSLT 2.1 use cases implemented
If anyone caught instructions or details I missed, feel free to add comments below.

August 12, 2009

Balisage 2009 - Streamabilty of XProc Pipelines


Norm Walsh (Mark Logic) gave a talk on streamability of XProc pipelines. XProc lets users define a sequence of atomic operations to apply to a series of documents, using control structures similar to conditionals, iteration, and exception handlers. XProc: An XML Pipeline Language is presently a W3C Candidate Recommendation that is near and dear to Norm since he’s been working on it for awhile. He hinted it should become a Recommendation this fall or certainly by Christmas. As per W3C policy, there must be 2 implementations before a specification is finalized. One of those implementations is by Walsh himself, called XML Calabash which is built on Saxon 9.

Streaming would provide a sliding window in a single pass with output beginning before all input has been seen. Little in said about streaming in the spec, but it is clear it could improve end-to-end performance in certain situations and would be essential for processing documents larger than physical memory. Although there are no explicit requirements for steps to be streaming in the spec, implementations will add value by enabling this.

Norm indicated that certain XProc instructions such a p:count are streamable, wheras others such as p:exec, p:http-request, p:validate-with-relaxng, p:validate-with-schematron, p:validate-with-xml-schema, p:xquery, and p:xslt cannot be streamable. His paper discusses data he collected collected by XML Calabash between 21 Dec 2008 and 11 Jul 2009 representing more than 294,000 pipeline runs. (His implementation has an opt-out, phone home feature so he can collect certain usage data.) In his Submitted Paper, Walsh concluded:
The preliminary analysis performed when this paper was proposed suggested that less than half “real world” pipelines would benefit from a streaming implementation.
The data above seems to indicate that the benefits may be considerably larger than that. Although it is clear that there are pipelines for which streaming wouldn't offer significant advantages, it's equally clear that for essentially any set of pipelines of a given length, there are pipelines which would be almost entirely streamable.
Perhaps the most interesting aspect of this analysis is the fact that as pipeline runs grow longer, they appear to become more and more amenable to streaming. That is to say, it appears that a pipeline that runs to 300 steps is, on average, more likely to benefit from streaming than one that's only 100 steps long. We have not yet had a chance to investigate why this is the case.