September 12, 2010

Abbey Road on the River - All You Need is Time

(And money doesn't hurt either.) The Gaylord Hotel at the National Harbor (south of DC and across the Potomac River from Alexandria, VA) hosted the World's Largest Beatles-Inspired Music Festival from September 2nd to 5th, 2010. The event, called Abbey Road on the River, drew about one thousand Beatles fans of all ages, shapes and sizes -- and nearly as many vendors. Haven't seen any official attendance stats, but I actually expected more people even though it was Labor Day weekend here in the US. I went only for Saturday day and night (that's noon to 2am) but two friends took advantage of a package deal to stay one night at the Gaylord so they could get up in the morning and hear even more Beatles cover bands.

Check out the band lineup - talent from all over the world: Scotland (2!), Puerto Rico, Germany, Canada, Nova Scotia, Italy, Norway, and course, England, not to mention more than a dozen states and several local bands. There were usually 3 or more bands playing at once in the 5 different areas: lawn, pier, hotel, terrace, and flagpole. (Don't ask.) Did they all sound just like the Beatles? Hardly - I still think 1964 The Tribute is the closest to the real thing, but they weren't at AROTR. But being carbon copies (can we still use that term in the digital age?) wasn't exactly the point. It was to bask in the glory of the music from the greatest band in the world. And fortunately the weather cooperated completely.

Candlestick Park and Itchycoo Park, both from Scotland, were lots of fun, covering Wings and Beatles solo in addition to the main Beatles catalog. Itchycoo did great versions of Live and Let Die and Lola (the Kinks).

The Jukebox, the Puerto Rican band, was one of the youngest and for some odd reason seemed to attract a lot of pretty, young, dancing girls. Must be their moptop haircuts, I guess. What was really special for the rest of us, however, was when a cute, 2-year-old boy holding a 10" wooden replica of a Beatles guitar ($30, made in China) got into the act. Egged on (oops - I mean "encouraged") by his parents, this kid who wasn't completely steady on his feet (too much wine) stood in front of the band and started strumming the fake guitar. And here I thought my blowup guitars were cool! This kid had everyone cracking up! The band even posed for a picture with the little boy afterward. Hoping to get a copy to post here later.

Another interesting band from Richmond, VA played a really great selection of 60's songs -- The English Channel - "all English, all the time". Their You Never Give Me the Money Abbey Road medley was absolutely wonderful - truly very exciting, especially at the climax with battling guitars and drum solo. Their keyboard player is quite good; he imitated the piccolo trumpet solo on Penny Lane and did some cool Moody Blues textures.

Vendor food was a rip off. $12 for a sub-standard, medium-sized Philly cheese steak and a half-ice, half-liquid lemonade. Better to walk a few blocks to the many restaurants with incredibly long waits (sit at the bar). Every imaginable Beatles t-shirt was available, as were buttons, pins, photos, books, aforementioned replica guitars (I bought George's Rickenbacker), lunchboxes, purses, and navel lint. Okay, maybe not the lunchboxes.

I was especially delighted to meet Bruce Spizer, the author of  7 detailed books about Beatles records on various labels that I own. Got his autograph and bought 3 little Meet the Beatles booklets for friends. Spizer is working an 8th book covering the UK records.

Another highlight of my day was a special appearance by Pete Best, the original Beatles drummer. He was interviewed live for a podcast by the Fab Fourum. [Haven't found the link to that particular podcast but here is a 5-minute excerpt on YouTube.] Pete, sometimes called the most handsome Beatle (in his day; remember the leather jacket photos?), shared some previously undisclosed details of his 1960-62 stint with the boys. He talked about his abrupt dismissal from the band - still no clear reason. Pete played during the ill-fated Decca audition on New Years Day 1962. He told us that the main reason the Beatles were rejected by Decca's Mike Smith (infamous by his prediction that "guitar groups are on the way out"), was because Smith was personal friends with Brian Poole whose band the Tremeloes beat the Beatles for the single available recording contract. [McCartney has supposedly said that it wasn't until later in 1962 when they signed with EMI that they were really ready for such a record deal.] When asked by the audience who his favorite Beatle was, he answered John. Autographed Pete Best photo: $20. Experience: Priceless.

Jimmy Pou, who plays George, did a solo set. Formerly of 1964 The Tribute and Beatlemania fame, Jimmy played recordings of himself as backup to his live guitar playing. Great set but would you believe he skipped Here Comes the Sun? He also played instrumental Beatles from finger-pickin' Steven King in between songs.

But the absolute highpoint of the day/night was the late night marathon performance by Hal Bruce and the Hard Dazed Knights which began at 10:10pm. As promised Canadian Hal Bruce (AROTR musical director) and his merry men played all 214 of the Beatles songs in one continuous medley without a break (except to change guitars)! The keyboardist was absent due to a family emergency so Hal occasionally switched to electric piano. The medley was also in chronological order by UK release which made it even more fascinating. Before they completed the first album (14 songs in about 10 minutes), the dancing began. My friends started dancing early on in and I eventually joined in, dancing well over an hour straight. The medley continued until 12:35, so it was 214 songs in 2 hours and 25 minutes. OMG, was it tremendous! There was some video recording made; it would be so cool if it was released eventually.

Conclusion: One day of just 14 hours was not enough to hear even a fraction of these bands. Next year I'm going to spring for the hotel so I can at least attend 2 days. Who wants to join me?

See the Abbey Road on the River Wikipedia page.

P.S., I Love You -- 50+ clips from the event by gaylordnational on YouTube. yeah, Yeah, YEAH!

July 31, 2010

The Case of the Disappearing Disk Space

For the past 2 months or so, my old 2004 desktop running Windows XP has mysteriously been losing disk space nearly every day -- between 100 MB to 500 MB a day. (I currently have less than 10 GB free.) I get Windows Updates automatically, have Norton Internet Security 2010, Webroot Antivirus with Spy Sweeper, and Prevx. While I've invested in a new Windows 7 laptop and transferred all my important files and apps over, I still would like to use my desktop, but the dwindling free space has me concerned that someone is hijacking my computer for filesharing or some other nefarious purpose. I never visit filesharing sites myself and am super-careful about email and surfing. There seems to be disk activity even when I'm not doing much myself. And just to make life interesting, Windows Updates tells me there are 16 Security Updates for Microsoft Office 2003 that I cannot install.

Anyone have suggestions about how to discover where the new files are going or what is causing the disappearing space? I've tried a Windows file search for files created in the last month, but haven't found what I wanted. Anyone have experience with System Mechanic or WinCleaner?

I'm leery about using any of the free/shareware disk space cleaners. I've googled for disappearing disk space and many of the hits are flagged by NIS as suspicious or are just plain old. A year ago I purchased a so-called registry fixer called Registry Easy which resulted in more problems. Recent NIS flagged this as SecurityRisk.ADH and removed it, but who knows what damage it did.

Oh -- one more thing. The past few days, with the latest Firefox, I'm getting "The URL is not valid and cannot be loaded" popup error message at seemingly random page loads (although often on Amazon).

Is it time to ship my desktop to the nearest landfill? Man, I hate this kind of stuff! I'll buy a drink for anyone local who provides a solution or for you remote folks, I'll buy you something from Amazon or iTunes.

July 28, 2010

Obama and McCartney - At the White House

There is so much that was special about seeing Sir Paul McCartney perform at the White House. First, there is the award: The Library of Congress Gershwin Prize for Popular Song. The honor was bestowed upon him directly by a president he so admires. Then there were the All Star list of celebs performing McCartney's songs: Stevie Wonder (We Can Work it Out and later Ebony and Ivory with Paul), Elvis Costello (Penny Lane), Jonas Brothers (Drive My Car), Herbie Hancock with Corinne Bailey Rae (Blackbird), Dave Grohl (a blistering version of Band on the Run), Faith Hill (The Long and Winding Road), Emmylou Harris (an emotional For No One), Lang Lang (classical piano) and Jack White (medley of Mother Nature's Son and That Would Be Something). Jerry Seinfeld was hilarious in his mock mini-roast of Paul about his lyrics and song titles and lyrics. ("Well, she was just seventeen/You know what I mean....Do we really know what you mean, Paul? We have a pretty good idea but....")

Sir Paul opened with Got to Get You Into My Life and later played Let it Be and Eleanor Rigby. He apologized to the President before playing Michelle with more than a nod to The First Lady -- what's her name? Afterwords he said, "I could be the first guy to be punched out by a president". The show closed with Hey Jude, ending in audience participation with all performers and the Obama family on stage doing the "na, na, na, na-na-na-nas". But what was really special was seeing senators and congresspeople singing along to every song. Rocking the suits, man! Yeah -- and a major from the Marine Bugle Corp playing a note-perfect piccolo trumpet (?) during Penny Lane. And Mary McCartney whistling a cat call at her 68-year old dad. Gotta love it! And the earlier black and white recital of Paul playing Yesterday on acoustic guitar accompanied by 3 violins and a cello.

Did I mention that President Obama called Sir Paul McCartney the "most successful songwriter in history"? As for American culture, "we stole you, Paul". Music is one of the things that helps us through hard times. He said The Beatles "blew the walls down" and changed everything about music in a few short years.

When Paul accepted the award, he said not only was the award a big honor, but it was even more so being presented to him by "this president". Regarding the Gulf oil spill and other problems, Paul said there were a "billion" people rooting for The Prez. And there are probably that many of us rooting for you, too, Paul!

Great show! Watch it now on

July 22, 2010

Bring Back the Plastic!

Yes, I know it's not PC, but I'm so tired of buying CDs with all-cardboard sleeves. It's too easy to crush, bend, tear, and otherwise destroy. Sorry, I don't care if brown is the new green. Yes, I know you wonderful recording artists are trying so hard to save our environment. (How's your mansion doing, anyway, and your road crew entourage? Use electricity much?) But look at the opportunity you have now! Just sponsor scarfing up all that petroleum from the Gulf and recycle it as plastic jewel cases! If the plastic is cracked or otherwise ruined, I can just replace the jewel case. No more stinking cardboard, please. It's rough on the CDs too, and I prefer my music unscratched.

May 21, 2010

Balisage 2010 - XML Conference - Schedule Posted!

"Balisage: The Markup Conference" ( is an annual peer-reviewed XML conference: how to create markup; what it means; hierarchies and overlap; modeling; taxonomies; transformation; query, searching, and retrieval; presentation and accessibility; making systems that make markup dance (or dance faster to a different tune in a smaller space).

Come to lovely Montreal, Canada from August 3rd to 6th for four action-packed days of angle brackets! Here’s a baker dozen (or so) sampling from the much larger list of Balisage 2010 presentations:

  • gXML, a new approach to cultivating XML trees in Java
  • Java integration of XQuery — an information unit oriented approach
  • Reverse modeling for domain-driven engineering of publishing technology
  • Managing semantics in XML vocabularies
  • XML pipeline processing in the browser
  • Where XForms meets the glass: Bridging between data and interaction design
  • Schema component paths for schema analysis
  • A streaming XSLT processor
  • Multi-structured documents and the emergence of annotations vocabularies
  • Processing arbitrarily large XML using a persistent DOM
  • Automatic upconversion using XProc
  • Scripting documents with XQuery
  • XQuery design patterns
  • Parallel processing and your XML data

Want to travel on the weekend so you can talk about angle brackets for an extra day? Then register for the pre-conference symposium on August 2nd, “XML for the Long Haul: Issues in the Long-term Preservation of XML”.

Schedule At-a-Glance:

Detailed schedule with descriptions:

XML for the Long Haul:

Tower of Modern Babel Contest - Chance to win an Apple 15" (i5) MacBook Pro, Apple MacBook Air or USD $2000:

Sponsors include: Mark Logic, oXygen XML Editor, and the FLWOR Foundation. Co-sponsors include: W3C, OASIS, Dublin Core Metadata Initiative, XML Guild, TEI Encoding Initiative, Washington Area SGML/XML Users Group, Philadelphia XML Users Group, and many more. Balisage 2010 is a production of Mulberry Technologies, Inc., a Washington area XML and SGML consultancy.

August 19, 2009

Balisage 2009 - Best Practices - or Not?

{If it could talk, this blog entry would be begging for your comments via the "comments" link below the article.}

On August 13th, the Balisage Conference 2009 hosted a spirited panel discussion featuring David Chesnutt, Chet Ensign, Betty Harvey, Electronic Commerce Connection, Laura Kelly, National Library of Medicine, and Mary McRae, OASIS. Harvey shared her "Top 10 Mistakes in DTDs" from 1998, most of which are still applicable today. We learned a bit about the standardization process at OASIS from Mary McRae, who differentiated between the codified rules which are enforced, informative guidelines which are essentially only recommendations, and very informal oral exchange of practices (which she emphasized should be recorded in written form).

McRae mentioned the OASIS Technical Committee Process which, among many other things, defines the OASIS Naming Guidelines in two parts: Part 1: Filenames, URIs, Namespaces and Part 2: Metadata and Versioning. For example, OASIS requires a RDDL-like file at the URI given for a namespace. This is in accord the W3C’s Architecture of the World Wide Web “good practice” for namespace documents:
The owner of an XML namespace name SHOULD make available material intended for people to read and material optimized for software agents in order to meet the needs of those who will use the namespace vocabulary.
McRae mentioned Adoption Technical Committees, I believe in reference to DITA Help, DITA Localization, UDDI, SAML, and OpenDocument Format (ODF). She also indicated that all OASIS specifications are either published in DocBook, Word, OpenOffice, DITA or XHTML.

An issue that generated divergent opinions was the question of which version of an XML Schema is the normative one -- the version stored in a separate file that can be validated and directly used, or the version that has been pasted into a Word document or other word processing format? Of course, the two should be identical and ideally pulled from the same source (i.e., the schema file could be imported into the document), but this isn’t always the case. The separate has the advantage of lending itself to code review, whereas the version embedded in a specification that is normative gives the impression that it too is normative.

Does imposing best practices on developers restrict creativity and productive competition? Are Naming and Design Rules (NDRs) inherently evil? Again, this question solicited differing opinions, although it seems that the majority who offered comments were against such impositions.

Laura Kelly's experience at the US National Library of Medicine taught her that the scope of any project is smaller than we like to admit, so perhaps draconian rules are counter-productive. When it comes to XML encoding, focus on giving the users what they should be asking for -- how to markup data so it will work best in their systems. Developers really only want to know "have I tagged my data correctly?" and "what does the data look like?"

David R. Chesnutt offered some lessons learned from his SGML work with the Model Editions Partnership (MEP) circa 1997. This project used a "subset of the SGML markup system developed by the Text Encoding Initiative (TEI)".

{Post comments below.}

August 16, 2009

Airport Blues - You Can't Miss It

After Balisage 2009, I took a taxi to the Montreal airport because I didn't leave in time for the shuttle. (True Confession: I didn't want to schlep 3 bags to another hotel to catch the shuttle.) Arrived 2.5 hours early, so I sat in the ticketing area and had a leisurely "bag" lunch courtesy of the conference caterers (if you can call Mediterranean pasta, Caesar salad and cheesecake cup a "bag" lunch -- thanks, Linda and Chris!). When I went to check in, I found out my plane was delayed an hour and would not meet up with my Philadelphia connection, so I was looking at arriving 2 hours later -- at 10 pm and it was only 2 pm! Not at all what I wanted, so they booked me on a flight to Toronto which was leaving in less than an hour. The catch was I'd have to reclaim my bags in Toronto and go through US Customs in the 1.25 hours between the 2 flights. Although risky (never been to the Toronto airport), it would get me home around 7 pm which sounded great.

Security check in Montreal was going really slowly. Didn't think I'd make the 3 pm flight. They wanted all my neatly packed electronic in see-thru pouches removed. What a pain! Trip to Toronto wasn't bad with 3-seats per aisle. Was able to read "Duel" by Richard Matheson, a gripping short story about road rage (and more). And then the fun began.

Here I am in (one of) Toronto's airports looking for my checked baggage. Getting off the plane, they told me "you can't miss it". Well, I beg to differ! Perhaps that's true when you pass it every day but not when you walk into a huge open area with signs everywhere and people hustling in every direction! I went completely through the area and started down another hallway and then stopped to ask a porter who didn't speak much English (thankfully, enough though). So I re-traced my steps and found the small sign that said "Baggage for US Customs" or something like that. Fortunately found my bag without much difficulty. Then I came to the long line for customs 20 seconds too late to miss a gaggle of at least 30 Japanese tourists who all proceeded to crowd in front of me. (There was a second security check somewhere in Toronto.)

Then I'm told my gate for the connection is on the other side of the airport and I need to catch a shuttle; this is just 15 minutes before boarding time. I'm given directions to get to the shuttle (down these stairs, turn left, go down another flight of stairs, then outside -- you can't miss it ;-). I manage to find it just a minute before it was ready to leave. The large shuttle is moving at what seems like a fast clip, making wide swings that almost send me flying with my laptop and medical device. It dawns on me that I can't recall the exact gate number (which wasn't written on my ticket) and I'm not sure if the shuttle makes multiple stops but there was no way to ask the driver who was behind glass. When we arrive at the stop, everyone got off so I figured it was one stop fits all, so I got off. I found a departure/arrival screen and found my flight. I have just enough time to buy bottled water since by now I am dehydrated from rushing around. I arrive at my gate in time to find out my seat wasn't exactly assigned so I had to take a window seat next to a woman who was fatter than me and seemed annoyed that she had to get up so I could scrunch myself into the small prop plane space. She refused to let me share the arm rest and had this blanket (!) on her lap which kept flapping onto my leg. Very uncomfortable!

I had planned to read the original version of "Nightmare at 20,000 Feet" also by Matheson (later re-worked for the original Twilight Zone with a young and handsome William Shatner and then the TZ movie). Thought it would be cool to read it on a plane. Turns out my window seat was right near the propeller just like the main character in the the story. As we prepared for take off, the stewardess announced we'd be cruising at 21,000 feet. How cool! (Un)fortunately it was not dark and there was no gremlin or banshee on the wing of the plane. I checked. More than once.

So we arrived safely in Baltimore but I was one hour ahead of my airport shuttle reservation and my cell phone (useless in Canada) was completely dead. I was also far from where the shuttle would be. A helpful information desk lady pointed me in the right direction and gave me the airport shuttle phone number (which of course had been conveniently stored in dead-as-a-door-nail cell phone). When I reached the shuttle area, I figured what the heck -- I have no change anyway, so let's see if I can convince the next shuttle to take me. In this I was successful. And I made it home an hour ahead of my original schedule! Whoopie! Party at my house! You can't miss it!

August 14, 2009

Balisage 2009 - XForms and Genericode at NARA

On Thursday, August 13th, Quyen L. Nguyen (National Archives and Records Administration) and Betty Harvey, (Electronic Commerce Connection) presented their paper entitled Agile Business Objects Management Application for Electronic Records Archive Transfer Process. [Submitted Paper]. U.S. National Archives and Records Administration (NARA) processes a very high volume of documents from most government agencies in their Electronic Records Archive (ERA), designed for long term preservation and access to digital objects. Quyen Nguyen explained that ERA has many challenges including dealing with a number of different media types (data types) and an ever-increasing volume of submissions. Archival Business Object requirements include the standard CRUD capabilities plus Versioning and Searching. NARA made the decision years ago to store documents in XML and transform to PDF. XML is used at business, communication and storage levels.

Management of Authority Lists (aka controlled vocabularies, aka code lists) is a big issue for NARA and other agencies. List changes should not require coding changes or re-compiling. NARA submission forms have fields that are conditionally optional or required, with inter-dependency between fields. Depending on state, some fields need to be insensitive to input and other fields need to be displayed or hidden.

The traditional HTML with JSP approach is too subject to code list changes. Schema changes cause a recompile of JSP code. It is harder to programmatically determine when to validate data input. Use of xsd:enumeration and annotations is inadequate for representing complex, multi-column code lists.

In contrast, according to Betty Harvey, XForms offer NARA many benefits such as modularity, reuse, separate evolvability, consistency of error messages, data integrity, performance, easier data exchange as XML, etc. The ERA team has implemented a comprehensive XForms solution by leveraging genericode from the OASIS Code List Representation Technical Committee. Their solution which includes an Orbeon XForms server provides an intuitive archive submission authoring system. The verbose genericode files is processed into smaller “fat free” versions via a custom XSLT. User form interactions can control dynamic form changes, such as which code list to display, fields that appear or are hidden, and so on. XForms Bind is pretty powerful and permit variables in XPath expressions.

Use of genericode in particular and, to a lesser extent, XForms are of interest to me personally so I know I’ll be reading their paper. Betty Harvey has also made her XForm Controls examples available. She also prepared her slides in XML using XSLT to conform to the W3C Slidy presentation library.

August 13, 2009

Balisage 2009 - Stream and Scream

This is the promised Part 2 of the blog entry about Mike Kay’s XML/XSLT processing optimization talks from August 12th. His second talk, entitled XSLT Screaming in XSLT 2.1 - Saxon-EE 9.2, was actually an impromptu. Kay gave us an unofficial preview of XSLT 2.1, which isn’t yet a public working draft from the W3C. Despite only the change in minor version number, we learned that the changes in 2.1 will be substantial.

Mike defined XSLT streaming as processing source documents without building a tree in memory, making it possible to handle much larger documents and reducing latency. Apparently, implementors haven’t taken advantage of streaming yet. The new XSLT specification will define a subset of the language that is streamable (presumably like the XProc spec does). Boldface is used to highlight new XSLT instructions or attributes below.
  • xsl:stream href=”uri”
  • xsl:mode streamable=”yes” name=”stream1”
  • xsl:template match=... mode=”stream1”
Exactly what is streamable within a template will be defined. For example, no sorting, no sideways navigation, only one download selection, no ancestor::x.child::y, etc.

Other new XSLT instructions include:
  • xsl:iterate - syntax like xsl:for-each, but with semantics of tail recursion. For example, if you are producing a document with all bank transactions, it could be generated with a running total of balance. You can pass parameters to next iteration using xsl:next-iteration and xsl:with-param.
  • xsl:merge - merge multiple streamed input files; also
  • xsl:merge-source, xsl:merge-input, xsl:merge-key, xsl:merge-action
  • xsl:copy-of and xsl:snapshot - retains ancestors and attributes
Specifically with regard to SAXON-EE 9.2, Kay highlighted the following functions and instructions:
  • saxon:stream() function -- xsl:stream, mainly for documents larger than physical memory; lazy evaluation
  • saxon:iterate -- helpful as an alternative to recursion that some programmers can understand more easily
  • saxon:mode streamable=”yes”, but presently with only a subset of the XSLT 2.1 use cases implemented
If anyone caught instructions or details I missed, feel free to add comments below.

Balisage 2009 - Pull, Push, Stream and Scream

On August 12, Mike Kay (Saxonica) presented two back-to-back topics related to XML/XSLT pipeline processing optimization. The first talk, You pull, I’ll push: On the polarity of pipelines, [Submitted Paper] compared and contrasted the control flow in the pipeline, which can run either with the data flow ("push") or against it ("pull"). That is, in “push”, control flow and data flow in the same direction, whereas in “pull”, control flow and data flow in opposite directions.In the main loop, data is pulled on input and then pushed. Kay discussed other combinations, such as fully streamable case of pull, pull, control, push, push pipelines. In branch and merge pipelines, pull is needed for multiple inputs, whereas push is needed for multiple outputs. Schema validation in Saxon is written in push style because it forks. This led Kay to say there is no clear winner between push and pull; each is appropriate in different situations.

Mike Kay’s paper discusses various combinations and approaches such as the other “JSP” (Jackson Structured Programming), the concept of inversion, and coroutines, which involve multiple stacks in a single thread; 2 programs are written as if they each own the control loop. Kay relates these concepts to XSLT processors and concludes:
As the usage of XML increases and more and more users find themselves applying languages like XSLT and XQuery to multi-gigabyte datasets, a technology that can remove the problems caused by pipeline polarity clashes has great potential.

[Will add his second talk here when I'm not so sleepy.]

August 12, 2009

Balisage 2009 - GODDAGS and EARMARKS, Just Ducky And We Love It

Fabio Vitali both figuratively and literally gave an animated talk addressing the problem of overlapping markup and the problem of modeling documents as trees. The title of his presentation, Towards markup support for full GODDAGs and beyond: the EARMARK approach, does little to convey how entertaining he made the subject. Let's just say he didn't duck and run for cover.

The fact that Vitali used a song by my all-time favorite band, the Fab Four, certainly got my attention. And I love it! His case study was a karaoke application which he postulated poses interesting markup challenges. First, the selected song requires pronoun changes based on the gender of the singer. Lines are displayed twice for a one-line lookahead. Chord changes do not exactly match line changes. And the final challenge is embedded fun facts that popup at appropriate points in the song.

Vitali's paper discusses his approach to these challenges -- EARMARK (Extreme Annotational RDF Markup), an OWL ontology with RDF triples. See his Submitted Paper and also his EARMARK site. See also the earlier work by C. M. Sperberg-McQueen and Claus Huitfeldt, GODDAG: A Data Structure for Overlapping Hierarchies.

Balisage 2009 - Streamabilty of XProc Pipelines

Norm Walsh (Mark Logic) gave a talk on streamability of XProc pipelines. XProc lets users define a sequence of atomic operations to apply to a series of documents, using control structures similar to conditionals, iteration, and exception handlers. XProc: An XML Pipeline Language is presently a W3C Candidate Recommendation that is near and dear to Norm since he’s been working on it for awhile. He hinted it should become a Recommendation this fall or certainly by Christmas. As per W3C policy, there must be 2 implementations before a specification is finalized. One of those implementations is by Walsh himself, called XML Calabash which is built on Saxon 9.

Streaming would provide a sliding window in a single pass with output beginning before all input has been seen. Little in said about streaming in the spec, but it is clear it could improve end-to-end performance in certain situations and would be essential for processing documents larger than physical memory. Although there are no explicit requirements for steps to be streaming in the spec, implementations will add value by enabling this.

Norm indicated that certain XProc instructions such a p:count are streamable, wheras others such as p:exec, p:http-request, p:validate-with-relaxng, p:validate-with-schematron, p:validate-with-xml-schema, p:xquery, and p:xslt cannot be streamable. His paper discusses data he collected collected by XML Calabash between 21 Dec 2008 and 11 Jul 2009 representing more than 294,000 pipeline runs. (His implementation has an opt-out, phone home feature so he can collect certain usage data.) In his Submitted Paper, Walsh concluded:
The preliminary analysis performed when this paper was proposed suggested that less than half “real world” pipelines would benefit from a streaming implementation.
The data above seems to indicate that the benefits may be considerably larger than that. Although it is clear that there are pipelines for which streaming wouldn't offer significant advantages, it's equally clear that for essentially any set of pipelines of a given length, there are pipelines which would be almost entirely streamable.
Perhaps the most interesting aspect of this analysis is the fact that as pipeline runs grow longer, they appear to become more and more amenable to streaming. That is to say, it appears that a pipeline that runs to 300 steps is, on average, more likely to benefit from streaming than one that's only 100 steps long. We have not yet had a chance to investigate why this is the case.