Sunday, August 28, 2011

Printing XML: Why CSS Is Better than XSL

Longtime readers of will remember the battles between XSL and CSS that took place in these columns in 1999 and that were emorialized in XSL and CSS: One Year Later. Since then, the two languages have coexisted in relative peace: CSS is now used to style most web sites, XSLT (the transformation part of XSL) is used by many server-side, and XSL-FO (the formatting part of XSL) has found a niche in the printing industry. A recent entry in the blog of a web luminary may signal the start of a second round of hostilities. Norman Walsh, a member of the W3C’s Technical Architecture Group and co-author of the W3C’s Web Architecture document (WebArch), recently blogged: … web browsers suck at printing. … And CSS is never going to fix it. Did you hear me? CSS is never going to fix it. It’s unclear if this statement is a prediction or a threat. Or just blogging on a bad day. Anyway, the pronounciation of CSS’ printing ineptness gives us a splendid opportunity to explain why CSS is a better language than XSL for most printing needs. As we have just used CSS to style a 400-page book which will be published later this year (Cascading Stylesheets, designing for the web by HÃ¥kon Lie and Bert Bos, 3rd ed, forthcoming from Addison-Wesley, this year), this is not purely an academic excercise in stylesheet linguistics. So, would-be authors should continue reading. The Problem Both camps agree that a printed document is, in many ways, more difficult to format than on-screen presentation. A printed document must be split into numbered pages, with added headers and footers. Page margins must be specified, and they may be different on left and right pages. References that appear as hyperlinks on-screen often include page numbers on paper. The disagreement starts with how best to express all this. Walsh’s solution is to write a 1000-line XSL transformation that generates XSL-FO, which is subsequently turned into PDF. We will argue that it’s much easier for most authors to express styling in CSS; in the case of the WebArch document, one can reuse the existing CSS stylesheets (200 lines or so) and add some print-specific lines. And, although rowsers tend to focus on dynamic screens rather than on printing, products like Prince happily combine CSS with XML and produce beautiful PDF documents. (Some disclosure at this point is appropriate. We, the authors, have been actively involved in shaping CSS and are now working hard to build software–Opera and Prince–that supports CSS.) The Flavors Before going into the print-specific features, let’s compare the basic flavors of XSL and CSS. Consider this fragment from Walsh’s XSL transform: The purpose of this code is to select certain elements (specified in the match attribute) and to set certain formatting properties on these elements (e.g., font-size). Using CSS, this can be written: div.head p.copyright { margin-top: 8pt; margin-bottom: 8pt; font-size: 75% } Compare the two fragments. Which do you find more readable? Which language would be easier to learn? Explaining this XSL snippet to a non-programmer would also be awkward: always The CSS equivalent, however, is more intuitive: ol li:first-of-type { page-break-after: avoid } Printing with CSS As we all know, simple tools cannot always perform advanced tasks.Even if CSS were able to simplify some fragments, it wouldn’t do much good if the language had inherent limitations that made it impossible to describe advanced features. The question becomes, then, whether there are any inherent limitations in CSS that could make it unfit for producing printed documents. The answer is no. CSS2, which became a W3C Recommendation in 1998, introduced The concept of pages in CSS. By using it, one can set page breaks (even Internet Explorer supports this) and page margins. More recently, a W3C Candidate Recommendation (called CSS3 Paged Media Module) added functionality to describe headers, footers, and more.Let’s start with a simple example: @page { size: A4 portrait; } This simple statement tells the formatter that the resulting PDF document should be of size "A4" (which is common outside North America), and that the orientation should be portrait. To change the size of the generated PDF document, one simply changes "A4" into another size. Peeking inside the XSL sheet again, we find two 40-line switch statements to enable similar functionality. One of the statements is reprinted in full below for entertainment purposes: 210mm 11in 8.5in 2378mm 1682mm 1189mm 841mm 594mm 420mm 297mm 210mm 148mm 105mm 74mm 52mm 37mm 1414mm 1000mm 707mm 500mm 353mm 250mm 176mm 125mm 88mm 62mm 44mm 1297mm 917mm 648mm 458mm 324mm 229mm 162mm 114mm 81mm 57mm 40mm 11in As the alert reader will already have inferred, the statement lists the heights of many different paper sizes. As such, it is interesting reading. However, we do not understand why this list belongs in a stylesheet. CSS provides a simple and elegant alternative by naming the different sizes in the specification rather than in each stylesheet. Another example that shows the elegant simplicity of CSS is that of page numbering. Page numbers are commonly printed on the "outside" of a page so that they are easily visible when flipping through a book. So, on a right page the page number should be on the right side, and on a left page it should be on the left side. On the first page, there should be no page number. In CSS, you can express this with: @page :left { @bottom-left { content: counter(page); } } @page :right { @bottom-right { content: counter(page); } } @page :first { @bottom-right { content: normal; } } The statements, while not pure English prose, are easily understandable for anyone who has read this far, and it would be a simple exercise for the reader to move the page number from the bottomof each page to the top. Because of size constraints, we’re not going to show you how page numbers are expressed in XSL. We challenge you to find it and then try explaining it to the first person you meet. Reuse and Cascading One reason why the web took off in the early 90′s was the manner in which HTML is authored. By looking at the source code of other documents, web authors could easily get started in web publishing. In a sense, HTML is the most successful open source movement. CSS also encourages reuse of code and has formalized how it works through the cascading rules. For authors, this means they can take an existing stylesheet and add to it their own rules instead of writing a new one themselves. One case in point is how to express page breaks for printed documents. Typically, you want to avoid page breaks after headings, and this can be expressed by adding a simple rule: h1, h2, h3, h4, h5, h6 { page-break-after: avoid; } Here, the first line lists elements to which the second line applies. As a result, the formatter will avoid page breaks after these elements. XSL has no concept of cascading and cannot easily express the above example. Instead of grouping elements, one has to add a rule to each element’s template. Here is what the template for h1 elements looks like: (XSL has chosen another name for the property, i.e., keep-with-next instead of page-break-after.) Likewise, it is easy in CSS to remove text decorations (e.g. underlining) on all elements: * { text-decoration: none } Table of Contents Many documents start with a table of contents (TOC). On-screen, the TOC is clickable and takes the user to the requested section. Paper, being more static in nature, needs references that can be followed manually. A TOC on paper, therefore, lists the number of the page where the section can be found. Expressing this in CSS results in a slightly more complex rule than the examples you have seen so far. Consider this: ul.toc a:after { content: target-counter(attr(href), page); } In English, the rule would read as follows: inside ul elements of class toc, all a elements should be trailed (:after) by some generated content. The generated content is the page number where the target of the link is found. The link is expressed in the href attribute of the a element. One reason for the added complexity is that CSS, contrary to a common misconception, has been designed to work with generic XML as well as HTML. In HTML, links are expressed in href attributes on a elements. In generic XML, however, links can be anywhere, and their position must be specified. Another common feature of TOCs on paper is a dotted line between section titles and the respective page numbers. This is called a leader in typesetting terminology and can be expressed in CSS as follows: ul.toc a:after { content: leader('.') target-counter(attr(href), page); } Compared with this three-line CSS solution, expressing TOCs in the WebArch XSL stylesheet takes more than 50 lines. In fairness, the XSL code also expresses other properties for TOCs (for example, that page breaks should be avoided). The CSS syntax in the above examples is still at the draft stage. By combining the print-specific CSS stylesheet described above with the WebArch document, a nicely formatted PDF document can be created. Multi-Column Layouts On paper, content is often laid out in multiple columns. Stylesheets must be able to express this. Using CSS, one can easily create multi-column layouts: body { column-count: 2; column-gap: 8mm; } The content of the body element will now be poured into two columns, between which there is an 8mm gap. Multi-column layouts are also available in XSL, but the obligatory verbosity/complexity warnings apply. Conclusions So can CSS do everything better than XSL? Not quite. XSL is a Turing-complete language which, in principle, can be used for all programming tasks and is particularly suited for document transformations. Styling documents is only one of many things XSL can do. CSS, on the other hand, has been developed with only one task in mind: styling documents. On the web, CSS is the style sheet language of choice. However, the usefulness of CSS is not limited to screens. If you want to transfer web content--be it XML or HTML--onto paper, there are good reasons to use CSS. The language is radically simpler than that of XSL, and it is suitable both on-screen and on paper. This means that you probably don't have to write a stylesheet at all but can reuse an existing one. Finally, by using CSS you can preserve the semantics of your content all the way to

No comments: