An Introduction to XML Technologies

Postscript, 23 May 2003

These are a few thoughts after the presentation. My main message on XML is that it is useful, but is relatively new. E-publishing can use XML with confidence, as can simple e-commerce. But the efficiency of XML applications for very large document stores and high volume, high reliability, high security e-commerce is not proved.

The Ms-Word 11 and OpenOffice.Org 1.1 Betas show that XML will make it easier to interface to office automation applications to publishing and e-commerce systems. This is despite press speculation that not all the XML functionality in the Ms-Word Beta versions of the final product and limitations in OOO 1.1. But XML is not yet shown to work for interchange between OA applications or using formal document standards.

Last year I speculated that free open source XML tools could be used to build an e-publishing system for the ACS, equivalent to SGML ones which have cost millions of dollars . This year I sufficiently confident to start a pilot of such a system.

What had been holding me up from building an e-publishing system was the feeling that “real” systems were not built by tacking together a few free XML utilities. But it now seems this is how real commercial systems and large government XML systems are being built.

This is similar to the conceptual leap which had to be made with SMTP e-mail and the web. These used simple text based formats and so looked like toys for academic experiments, not suitable for serious work. But simple email and HTML have proved useful for large scale business and government systems. XML should prove similarly useful, but will still need a lot of work.

Slides available

First see: Quick Evaluation of MS-Word 11 XML Features

What is XML Really?

There is much myth and mystery in the IT business as to what XML is and what it can do. This has created both false expectations amongst some business people as to what XML can do and skepticism amongst IT professionals as to if it is of value.

At a technical level XML is a very simple:

The Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web.

From: Extensible Markup Language (XML) Activity Statement, 1996-2003 W3C^® (MIT, ERCIM, Keio), URL: http://www.w3.org/XML/Activity.html

Here is an example of XML in an Ms-Office document:

<o:DocumentProperties>

<o:Title>Role of Interface Manipulation Style and Scaffolding on Cognition and Concept Learning in Learnware</o:Title>

<o:Author>Jono Hardjowirogo</o:Author>

</o:DocumentProperties>

If giving a presentation on XML, W3C suggests starting with these 10 points. I have interspersed my comments:

1. XML is for structuring data

Structured data includes things like spreadsheets, address books, configuration parameters, financial transactions, and technical drawings. XML is a set of rules (you may also think of them as guidelines or conventions) for designing text formats that let you structure your data. {But is XML expressive enough? } XML is not a programming language {XSLT is}, and you don't have to be a programmer to use it or learn it {Don't you?}. XML makes it easy for a computer to generate data, read data, and ensure that the data structure is unambiguous. XML avoids common pitfalls in language design: it is extensible, platform-independent, and it supports internationalization and localization. XML is fully Unicode-compliant.

2. XML looks a bit like HTML

Like HTML, XML makes use of tags (words bracketed by '<' and '>') and attributes (of the form name="value"). While HTML specifies what each tag and attribute means, and often how the text between them will look in a browser, XML uses the tags only to delimit pieces of data, and leaves the interpretation of the data completely to the application that reads it. In other words, if you see "<p>" in an XML file, do not assume it is a paragraph. Depending on the context, it may be a price, a parameter, a person, a p... (and who says it has to be a word with a "p"?). {The result is that semantics are scattered in program code?}

3. XML is text, but isn't meant to be read

Programs that produce spreadsheets, address books, and other structured data often store that data on disk, using either a binary or text format. One advantage of a text format is that it allows people, if necessary, to look at the data without the program that produced it; in a pinch, you can read a text format with your favorite text editor. Text formats also allow developers to more easily debug applications. Like HTML, XML files are text files that people shouldn't have to read, but may when the need arises. Less like HTML, the rules for XML files are strict. A forgotten tag, or an attribute without quotes makes an XML file unusable, while in HTML such practice is tolerated and is often explicitly allowed. The official XML specification forbids applications from trying to second-guess the creator of a broken XML file; if the file is broken, an application has to stop right there and report an error. {If computers are meant to read the XML, then why not make it more machine readable: ie: binary?}

4. XML is verbose by design

Since XML is a text format and it uses tags to delimit the data, XML files are nearly always larger than comparable binary formats. That was a conscious decision by the designers of XML. The advantages of a text format are evident (see point 3), and the disadvantages can usually be compensated at a different level. Disk space is less expensive than it used to be, and compression programs like zip and gzip can compress files very well and very fast. In addition, communication protocols such as modem protocols and HTTP/1.1, the core protocol of the Web, can compress data on the fly, saving bandwidth as effectively as a binary format. {Compression and uncompression consume computer resources. Also the text then has to be parsed afer being un compressed (effectively compressing it again). A binary format would do this better.}

5. XML is a family of technologies

XML 1.0 is the specification that defines what "tags" and "attributes" are. Beyond XML 1.0, "the XML family" is a growing set of modules that offer useful services to accomplish important and frequently demanded tasks. {The family is getting a bit large. Are all members equally useful?} Xlink describes a standard way to add hyperlinks to an XML file. XPointer and XFragments are syntaxes in development for pointing to parts of an XML document. An XPointer is a bit like a URL, but instead of pointing to documents on the Web, it points to pieces of data inside an XML file. CSS, the style sheet language, is applicable to XML as it is to HTML. XSL is the advanced language for expressing style sheets. It is based on XSLT, a transformation language used for rearranging, adding and deleting tags and attributes. The DOM is a standard set of function calls for manipulating XML (and HTML) files from a programming language. XML Schemas 1 and 2 help developers to precisely define the structures of their own XML-based formats. There are several more modules and tools available or under development. Keep an eye on W3C's technical reports page.

6. XML is new, but not that new

Development of XML started in 1996 and has been a W3C Recommendation since February 1998, which may make you suspect that this is rather immature technology. In fact, the technology isn't very new. Before XML there was SGML, developed in the early '80s, an ISO standard since 1986, and widely used for large documentation projects. {An ISO standard is not necessarly a guarantee of success.} The development of HTML started in 1990. The designers of XML simply took the best parts of SGML, guided by the experience with HTML, and produced something that is no less powerful than SGML, and vastly more regular and simple to use. Some evolutions, however, are hard to distinguish from revolutions... And it must be said that while SGML is mostly used for technical documentation and much less for other kinds of data, with XML it is exactly the opposite.

7. XML leads HTML to XHTML

There is an important XML application that is a document format: W3C's XHTML, the successor to HTML. XHTML has many of the same elements as HTML. The syntax has been changed slightly to conform to the rules of XML. A document that is "XML-based" inherits the syntax from XML and restricts it in certain ways (e.g, XHTML allows "<p>", but not "<r>"); it also adds meaning to that syntax (XHTML says that "<p>" stands for "paragraph", and not for "price", "person", or anything else). {XHTML might be seen as having all the complexities of XML with none of the benefits of HTML.}

8. XML is modular

XML allows you to define a new document format by combining and reusing other formats. Since two formats developed independently may have elements or attributes with the same name, care must be taken when combining those formats (does "<p>" mean "paragraph" from this format or "person" from that one?). To eliminate name confusion when combining formats, XML provides a namespace mechanism. XSL and RDF are good examples of XML-based formats that use namespaces. XML Schema is designed to mirror this support for modularity at the level of defining XML document structures, by making it easy to combine two schemas to produce a third which covers a merged document structure. {Just because you combine two XML document formats doesn't mean that software designed for either will operate on the combination.}

9. XML is the basis for RDF and the Semantic Web

W3C's Resource Description Framework (RDF) is an XML text format that supports resource description and metadata applications, such as music playlists, photo collections, and bibliographies. For example, RDF might let you identify people in a Web photo album using information from a personal contact list; then your mail client could automatically start a message to those people stating that their photos are on the Web. Just as HTML integrated documents, menu systems, and forms applications to launch the original Web, RDF integrates applications and agents into one Semantic Web. Just like people need to have agreement on the meanings of the words they employ in their communication, computers need mechanisms for agreeing on the meanings of terms in order to communicate effectively. Formal descriptions of terms in a certain area (shopping or manufacturing, for example) are called ontologies and are a necessary part of the Semantic Web. RDF, ontologies, and the representation of meaning so that computers can help people do work are all topics of the Semantic Web Activity.{But how many real applciations use RDF or the Semantic Web?}

10. XML is license-free, platform-independent and well-supported

By choosing XML as the basis for a project, you gain access to a large and growing community of tools (one of which may already do what you need!) and engineers experienced in the technology. {How mature are the tools? The big vendors are on the XML standard bodies so are they doing their usual “extend and conquer”?} Opting for XML is a bit like choosing SQL for databases: you still have to build your own database and your own programs and procedures that manipulate it, and there are many tools available and many people who can help you. And since XML is license-free, you can build your own software around it without paying anybody anything. The large and growing support means that you are also not tied to a single vendor. XML isn't always the best solution, but it is always worth considering.

From “XML in 10 points”, Bert Bos, W3C, 27 Mar 1999 (Revised 13 Nov. 2001), URL: http://www.w3.org/XML/1999/XML-in-10-points

XML is very similar to SGML. What has brought XML to prominence may not be so much any of its inherent characteristics, but the success of the Internet, the Web and open source software development:

The Internet: XML uses pre-existing Internet standards, such as URLs to refer to XML documents in other documents,
The Web: XML is proposed as an improved replacement for HTML and other web formats,
Open Source: XML software is provided free as part of various open source projects (mostly web related).

XML started, like SGML, as a publishing format, but has later been applied as a data format for e-commerce and then as a programming language. The later the proposed use, the less accepted XML is.

Like HTML the XML document consists of data surrounded by <tags>. The tags can be nested. Unlike HTML, there is not a fixed set of tags (such as <h1>, <h2>, <b>, <i>); the tags for a particular document are defined in a separate document (or not at all).

An XML is only text. Any binary data has to encoded as text or referenced via a URL. If a simple XML format is used for data interchange, then there is no need for complex XML software. As XML is just text, any computer programming language can be used to create or read an XML document. However, complex XML documents, or large numbers of different formats can use specialized XML software.

XML is not necessarily the best format to use. As an example the National Center for Biotechnology Information (NCBI) will supply the structure of the SARS virus in XML format, but they store it in Abstract Syntax Notation 1 (ASN.1). ASN.1 is used in the telecommunications industry is more compact than XML and easier to encode. But XML has become more popular, so NCBI support it, even though they see limitations with it:

ASN.1 does not require that a name be unique except within a structure, similar to C or C++. XML however requires that all names be unique across the DTD, unless they are attributes which must come from a limited repertoire. Many XML parsers rely on this so that callback functions are associated wth a tag, not a tag within context. ...

NCBI Data in XML, (undated), URL: <http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/ncbixml.txt>

Agreeing a Format

Many of the benefits and problems attributed to XML actually come from the process of standardization of the data format. Before two parties can use XML (or any data interchange format) to communicate information, they must agree a common set of data to communicate. It is reasonably simple to encode data in XML format, but only after the parties have agreed on the data to encode. Where there is not an agreement and two different XML encoding are used, there is no XML standard.

Defining a format

Unit about a year ago XML used the same DTD standard for defining XML document formats as SGML has used. DTDs are defined using their own syntax, which was developed for structuring documents, not data. It is difficult to read and write. The DTD for OpenOffice XML files is thousands of lines long, was not actually used by the OOO software and contained syntax errors in the first version (which were reported to the OOO project and fixed).

There is now the option of defining XML formats in XML Schema, which uses an XML syntax. XML Schema has features more suited to data processing and can be processed using XML tools, but is still a complex language to use.

Verbosity

XML encourages verbose tags to be used to make the encoding understandable. For example <DocumentProperties> from Microsoft Office XML. But tags can be cryptic, such as <ABN> from the ATO's Business Activity Statement XML format. It has been assumed that this verbose text format results in inefficient data storage and transmission. Some XML based formats, such as WML, define a short byte code option for ease of transmission. However, as they are text documents with a regular structure, XML documents compress well. A will be seen later an MsWord XML document can compress smaller than the native “DOC” format. Many data transmission systems use compression by default, so there may be little benefit in preprocessing the XML documents.

xHTML

XHTML is a derivative of HTML which has been tidied up to conform to the XML syntax. This allows xHTML to be processed by XML tools and to display reasonably well with HTML browsers. However, xHTML is more difficult to produce manually and XML tools will not correctly process documents which do not strictly conform to the xHTML standard. In practice web browsers will need to continue to accept various versions of HTML and common mis formatting of HTML documents. A web browser which only displayed strictly correct xHTML documents would be of little practical value.

XML Word processing formats

Newer versions of word processors are providing XML output. However, two word processors with XML formats cannot necessarily exchange documents. An an example Ms-Word 11 Beta and OOO 1.1 Beta both use XML document formats, but these formats are very different. Exchange of documents will require translation, with some features not translating well, or at all. Creation of such translations is a little easier with XML, as the XSLT language can be used to write the transformation (essentially a macro language in XML syntax to operate on XML documents). Ms-Word 11 comes with XSLT facilities built in, OOO relies on having a Java environment with XSLT (such as Sun's free Java 2 SDK 1.4.1_02). However, someone has to write a to and from transformation for each pair of word processing formats. In theory one set of transformations could be written to and from one universal common standard format, but such standardization does not yet exist. Also XSLT is a new computer programming language and implementations are likely to be slow, even compared to languages such as Java.

Undocumented XML Formats

Even where XML is used, extensive manual work may be needed to have a system export or import the format. As an example the Australian Tax office has defined XML formats for interchange of Tax data, such as Business Activity Statements. However, the ATO does not provide an XML based definition of the format which can be input to a system. Instead examples of the format and tables of definition for the data fields is provided. The system developer has to manually translate this into XML Schema format to use (and then check it). Software companies who wish to claim XML compatibility for marketing purposes need not provide any documentation at all about the XML format they use.

Layers of Languages

From the establishment of XML several years ago, we have seen the creation of a bewildering array of XML based languages. In general these languages have acronyms ending in ML. One of XML's advantages is that anyone can create an XML based language. One of the disadvantages is that it is difficult to distinguish formal official XML standards, issued by bodies such as W3C, from de-facto standards and company products. But the fewer letters in the acronym, the more likely it is to be a recognized standard.

Not all formal XML standards are widely used and some (such as “Web Services”) are yet to be formally standardized. Using XML technologies becomes an exercise in selecting combinations of versions, subsets and extensions of standards, which will work on particular computer systems.

What is emerging from XML are layers of meaning in linked documents. An XML document can contain references to documents which define its structure, possible data values, how to display and transform its contents. As the document references in XML are URLs to on-line documents also in XML, these can be read by software to automatically find out what to do with the document.

Database Related Standards
Web Services Description Language (WSDL) Version 1.2		W3C Working Draft 9 July 2002	http://www.w3.org/TR/wsdl12
Web Services Description Language (WSDL) Version 1.2	Bindings	W3C Working Draft 9 July 2002	http://www.w3.org/TR/wsdl12-bindings
SOAP Version 1.2 A lightweight protocol for exchanging structured information in a decentralized, distributed environment.	Part 0: Primer	W3C Working Draft 26 June 2002	http://www.w3.org/TR/soap12-part0
	Part 1: Messaging Framework	W3C Working Draft 26 June 2002	http://www.w3.org/TR/soap12-part1
	Part 2: Adjuncts	W3C Working Draft 26 June 2002	http://www.w3.org/TR/soap12-part2
XML Schema Extended capabilities for describing the structure AND datatypes of XML documents, beyond DTDs.	Part 0: Primer	W3C Recommendation, 2 May 2001	http://www.w3.org/TR/xmlschema-0/
	Part 1: Structures	W3C Recommendation 2 May 2001	http://www.w3.org/TR/xmlschema-1/
	Part 2: Datatypes	W3C Recommendation 02 May 2001	http://www.w3.org/TR/xmlschema-2/
Document Related Standards
Extensible Stylesheet Language (XSL) Version 1.0 (aka XSL-FO)		W3C Recommendation 15 October 2001	http://www.w3.org/TR/xsl/
XSL Transformations (XSLT) Version 1.0		W3C Recommendation 16 November 1999	http://www.w3.org/TR/xslt
Extensible Markup Language (XML) 1.0 (Second Edition)		W3C Recommendation 6 October 2000	http://www.w3.org/TR/REC-xml

From: Documents and databases, Tom Worthington, December 2002, URL: http://www.tomw.net.au/2002/ebcwxml.html

Opportunities with XML

The major benefit of XML comes not from any of its technical features, but as a way to focus the IT industry on adopting usable standards for data interchange. Arguments about what syntax should be used are difficult to sustain with the simple XML syntax being available and widely supported. Standards committees can get on with the important and difficult task of defining the data they want to interchange. If committees do not agree on what XML syntax to use, then we will not have the benefits of standardization, even if XML is used.

Assuming the Ms-Word 11 and OOO 1.1 Beta become released products and XSLT transformations are written for them, it should be possible to much more easily exchange word processing documents. It should also be much easier to convert those documents into other formats. It may be possible to replace WP, PDF and HTML with one common format, suitable for creating documents, on screen display and high quality printing.

Assuming standardized XML e-commerce formats can be agreed, it should be possible to use these formats within systems, as well as between them. This should make it possible to build complex systems using low cost standardized components.

The Extreme Programming equivalent in XML is xtUML. UML is a diagramming language for describing computer systems, defined in XML format. xtUML is a version of UML where the diagrams are automatically converted in computer software and executed. UML diagrams are converted into executable code using “model compliers” which may be written in XML format. Using XSLT t do the translation.