Introduction
The alphabet soup of information technology standards can be hard to swallow. Why XML? Don't we have enough with HTML, SGML, and all the other MLs?
Marking The Schools
A decade ago I had the job of helping design a census of Australian schools. This was a serious business, as Federal Government funding to the schools (more than ten billion dollars a year) was based on what the census found.
A complex computer program was required to divide the schools into dozens of categories, so they could be sent the right census booklet to complete. Current information about each school had to be printed and collated with the booklets to be sent.
Part way into the design it occurred to me it would be easier to print a custom designed booklet for each school. The custom booklet could include the information currently on file and only ask for the information needed for that school. As only a few questions were relevant to the average school, the booklets could be much smaller and simpler.
The largest laser printer then available was used for the printing task, taking output from a database on a magnetic tape. But how to tell the laser printer what to print where? An uncomfortable mix of proprietary commands in the laser printer's language and mainframe line printer commands were hand coded into a custom program. This worked, but proved difficult to maintain.
GML, SGML, HTML and XML
What was needed for the census was a general mark-up language. IBM created such a language, the Generalised Markup Language (GML). This was the genesis of the Standard Generalised Mark-up Language (SGML) became an ISO standard in 1986. SGML was part of the ill fated Government Open Systems Interconnection Profile (GOSIP), which was supposed to revolutionise Government communications, but collapsed under its own complexity in the mid 1990s.
SGML has been used by specialist publishers and database developers in niche applications. However, it was too complex for the average person and there was not the need to communicate electronic documents widely. The Internet and the web changed all that with Hypertext Mark-up Language (HTML) in 1989. HTML demonstrated that a simple language could be very powerful, when combined with the Internet. But HTML is an inelegant combination of mark-up features, combining structural elements, such as <H1> (for a level 1 heading) and <i> for italics. Also HTML's structure is fixed and web browser developers have had to add their own non-standard extensions to provide missing features.
The Extensible Markup Language (XML) from the World Wide Web Consortium, attempts to overcome HTML's limitations, but be simpler than SGML. This it almost accomplishes. In fact you can't write a document directly in XML. You must have a Document Type Definition (DTD) which defines the syntax which can be used to write a particular class of documents and software which can interpret that syntax to carry out some function, such as display a web page.
One example of one use of XML is for XHTML, which provides a more carefully formatted implementation of HTML using the XML syntax. XHTML is designed to allow a bridge between the existing web and new features. XHTML's stricter definition will require some minor adjustments for web authors, such as TAGS being in lower case (<i> for italics, not <I>). In return for these minor inconveniences, XHTML allows extensions to be easily added for special applications.
Some examples of using XML syntax are:
Resource Description Framework (RDF) for web-based metadata to allow content ratings, search engine data collection, business metadata, digital library collections, and distributed authoring,
Cascading Style Sheets (CSS) for adding styles, such as font size, font type and colour to Web documents.
Electronic forms for e-commerce, such as the standard remittance advice format required to implement the Commonwealth Government's electronic procurement strategy
Synchronised Multimedia Integration Language (SMIL) for integrating streaming audio, video, still images and text into a web presentation.
Open eBook Publication Structure for publishing of paper and electronic books.
Proposals for standard syntax for metadata, styles, multimedia and e-books are not new. What is new is that with XML these standards can be expressed using the same basic notation and applications for them can be built with the same functional building blocks. Previously a video editing program would be built from different software to a book typesetting system. With XML, the same editor can handle a book or a movie. An XML enabled web browser will already have built in the ability to interpret an XML document, with applets needed to render the document's components, to be presented by the browser as a book or a movie.
What are the specific limitations of the PDF format?
Portable Document Format (PDF) commonly used for publishing electronic documents where proprietary word processing and HTML formats are inadequate. However, PDF's origins are as a page description language and it inherent limitations from the printed page metaphor. PDF comes from a publishing model where the document creator decides what the final document will look like. Readers passively accept the content and format of the document as given, they are locked out of changing the document or how it is displayed. The web has encouraged a more interactive mode, where readers can adjust the layout of the document to suit them, select parts of a document to use and where anyone can be a publisher.
While extensions for the web have been added to PDF, it is still primarily for producing static documents consisting of fixed size pages, with the font, text size and location of images fixed by the originator. PDF documents are designed to be self contained, so they can be sent in one file. PDF is designed as the format for the final published work, rather than the format the document would be edited in.
However, an electronic document read on screen needs to have the font size, style and screen layout dynamically adjusted to suit the display device and the person reading it. Screens have a lower resolution than a printed page, so larger font sizes and simpler font designs are needed for easy reading. The whole of a page designed for printing will not fit on today's computer screens. The reader has to clumsily scroll back and forth across a line to read the screen. The disabled have an added problem with increasing the font size further, or using Braille devices.
In contrast web browsers designed for HTML don't have a page size built in. Text wraps to fit the screen size. The font style and size can be selected by the reader. Options allow the disabled requirements to be met.
Web pages use hypertext links to break up a large document into components. PDF documents can use the book metaphor of chapters, but all chapters are stored within the one large document. This can make reading one chapter of a large document a very slow process. Also HTML's simple text tags allow for the analysis and conversion of documents, for example the automated translation of web pages into other languages using the AltaVista Babel Fish Translation Service.
XML for Common File Formats
On 19 July, Sun Microsystems, Inc. announced it will release the source code of its StarOffice (TM) Suite, to the open source community under a GNU General Public License (GPL). Part of this is to define a set of XML-based file formats for word processing, spreadsheets and presentation tools. Combined with the capabilities of XML enabled web browsers, such as the open-source Mozilla, this provides the possibility of low cost software generating portable file formats. A document created in a presentation tool could then be presented using a web browser. There would be no need to convert the file from one proprietary format to another, or download a special viewer program, the web browser would display the document directly. It also creates the possibility of more flexible document formats, such as integrating a printable text document and a slide show in the one file, or displaying database records as a document. An ambitious example of attempting such a system is the Mozilla.org's proposal for an open source combined word processor and Web editor, based on the Mozilla web browser.
Will XML Succeed?
XML is not certain to succeed. If fast and flexible editing and display software can be built for XML, it has a chance. However, the flexibility of the format may be its undoing. Anyone can easily invent a n XML DTD and propose it as a standard. In theory any XML browser should be able to display the document. In reality document formats must be careful designed and widely agreed, to be useful. Well meaning groups and companies trying to gain market share could create a Babel of incompatible, overlapping and unimplementable XML standards. Examples of XML standards which perhaps need moire thought are the DTD for Digital Talking Books, the Building Construction Extensible Markup Language (bcXML) and the Schools Interoperability Framework (SIF)
Tom Worthington is Director of the ACS Publishing Board, an e-business consultant to the federal government and a Visiting Fellow at the Australian National University. See: www.tomw.net.au
This material was prepared for the unit Information Technology in Electronic Commerce (COMP3410) at the Australian National University, semester 2, 2000. Accompanying documents discuss The Eighteen Character Problem and Metadata.
Further Information
- Also slides at: http://www.tomw.net.au/2000/yxml.ppt
PREVIOUS: Electronic Document Management and the Digital Library for E-commerce
- Published in Information Age magazine, September 2000
- Revised version YXML? - What is wrong with PDF? for Open Publish 2001, 31 July 2001, Sydney, with slides and Media Release: "Open Source Challenge to E-document Companies"
- Computing 3410
Comments and corrections to: webmaster@tomw.net.au
Copyright © Tom Worthington 2000.