Documents and databases

Making sense of developments in eBusiness, eCommerce, ePublishing and eLaw

Tom Worthington FACS

Visiting Fellow, Department of Computer Science, Australian National University, Canberra and Director of Publications For the Australian Computer Society

For the Information Industry Outlook Conference (IO2002), Canberra, 9th November;
Twilight Forum, ACS Western Sydney Chapter, 12th November; and
Canberra, 4th December, 2002.
This document is Version 5.0 2 December 2002: http://www.tomw.net.au/2002/ebcwxml.html
Version 4.1 11 November 2002 (as presented in Canberra and Sydney) is at: http://www.tomw.net.au/2002/ebcwxml2.html

Summary

This presentation discusses the use of web based electronic document and transaction formats for e-commerce. A simple enhancement of web services transactions is proposed to bridge the gap between the needs of small and large businesses, to make e-commerce practical. The convergence of electronic document formats on XML is demonstrated using open source software is demonstrated. It is argued that XML can now be used to replace both the Microsoft Word document format for word processing and PDF for electronic publishing. A proposal to use these techniques to implement the Australian government's e-commerce strategy is presented.

Preface: What I Have Been Doing Recently

Those reading or attending my presentations will understand that they try to combine personal experience to illustrate the technical complexities. The last major update to the story was in mid 1999, with the publication by the ACS of Net Traveller; as it says in the introduction:

This book is about how the Internet and the World Wide Web became a part of my everyday life - for business and pleasure. It consists of edited versions of web pages and other on-line documents prepared during my last five years working and living on-line... I hope this work will dispel some myths about how inevitable technological developments are and how infallible technologists are.

The release of the book marked a change in my working life. After nineteen years in government, (nine on IT policy at the Department of Defence), I had become an independent consultant and a Visiting Fellow at the Faculty of Engineering and Information Technology, Australian National University.

Apart from the loss of a secure income and not having to go to so many meetings, not much changed. The challenge was still to attempt to act as a conduit between my highly technical information technology colleagues and my business orientated clients. As well there was the opportunity to teach this process to IT and e-commerce students.

Three of the courses I helped teach were:

These would not seem to have much to do with each other, the first is about the design of web pages and web sites, but for IT people, rather than publishers or graphic designers. The second is a grab bag of topics which no one else could be found for and which I happened to have some experience in. The third is a service course for non-IT people, to teach them the basics of using presentation tools. Curiously the issues and technologies in these courses have converged in the last three years. This doesn't mean that the same technology can be used for all, or that it is easier, in some ways it is harder.

Introduction

In its report "Accelerating the Uptake of E-Commerce by SMEs: A Report and Action Plan" ( July 2002 ) the SME E-Commerce Taskforce published an action plan for accelerating the uptake of e-commerce. The suggested steps for Australian small and medium sized enterprises (SMEs) were:

  1. Get on-line and get email

  2. Get Internet banking

  3. Get a website, initially to advertise the business phone number and email address

  4. Get an interactive dynamic e-commerce system integrated with traditional business systems

  5. Get voice and data systems integrated.

The first three steps are good simple advice, but step 4 is an absurdly large leap. This is the equivalent of telling a new aerospace company: "first build a balsa wood glider, then a space shuttle". There need to be more steps for an easier transition by SMEs between a simple web site and an interactive dynamic e-commerce system. Also the last step of voice and data integration doesn't appear to relate to e-commerce and seems to have been included because the list came from a telecommunications vendor.

A more appropriate list might be obtained by dropping the last item and inserting the extra step of using manual electronic documents. Those manual e-document can then be integrated into an automated system. This makes the introduction of the e-commerce something which can be done gradually as resources allow:

Suggested steps for small business e-commerce:

  1. Internet access

  1. Internet banking

  2. Check statements

  1. Website

  1. Start replacing paper documents with electronic ones

  1. Progressivly automate the processing of the electronic documents

The 2002 Yellow Pages E-business report suggests that small businesses aren't adopting more than e-mail and simple web sites because they can't see how to make money out of it. SMEs need to be given a solid case which shows how to reduce costs or increase profits with e-commerce. Offering complex, unproven new technology, will just increase business skepticism.

Using the Internet for business is much harder than it looks. This is made worse by "experts" recommending complicated implementations. Small business can be shown how to save money by using the Internet to do simple things like replacing paperwork with electronic documents. They will then be ready to do something more complex, with integrated e-commerce. New XML technologies, including Web Services, can make that transition possible. Open Source software can ease the implementation issues.

The Australian Taxation Office has an ambitious project to introduce electronic processing of taxation forms using XML technology. This could be made more tangible to the small business by making the forms used directly printable.

Demonstrations:

  1. Electronic Document Conversion with OpenOffice.Org

    • The OpenOffice word processor can be used to convert MS-Word documents to XML and HTML format while retaining structural information.

    • MS-Word -> OOO XML -> HTML

  2. Web Services

    • Amazon provide a database query service using web services XML documents and XSLT transformations.

    • Amazon DB -> Web Services XML -> XSLT Transformation -> HTML

    • Data returned from the Amazon database can transformed for display with XSLT to produce a formatted HTML document using recent XSLT capable browsers, such as Mozilla 1 and IE 6. No additional software is required. No HTML need be stored or transmitted as it is created in the browser.

    • Amazon DB -> Web Services XML -> XSLT Transformation -> HTML

  3. Formatting the eBAS with XSL

    • The Australian Taxation Office (ATO) provides specifications for electronic versions of tax forms, including the Business Activity Statement (BAS) in relation to the Goods and Services Tax (GST). This is a demonstration of one of these, the BAS, transformed into printable documents.

Documents or databases?

document, n. ...

Something written, inscribed, etc., which furnishes evidence or information upon any subject, as a manuscript, title-deed, tomb-stone, coin, picture, etc.

Database ...

A structured collection of data held in computer storage; esp. one that incorporates software to make it accessible in a variety of ways; transf., any large collection of information.

From: OED Online, SECOND EDITION, 1989

Documents and databases represent two extremes in the aims and methods of electronic commerce. At the one extreme we have electronic documents which are fixed in content and format, are individual distinct entities, can be displayed using software from different suppliers, are expected to last for years and outlive the software which created them. At the other extreme a database has content which changes, can be displayed in different ways, may only be of value for minutes or months and may depend on one version of database software. This is not to say that all documents are fixed and all databases fluid, but is a useful generalisation.

Where technology offers the features of documents and databases, it may still be useful to partition the application into these categories to make it easier to understand. As an example a dynamic web site can display different content to each user each time they look at it. But this can confuse the user, who goes back to look at something again and find it has changed, or tell a acquaintance to look at it and they see something different. Where there is a legal dispute, a company needs to be able reproduce what was on their e-commerce web site at a particular time, with sufficient fidelity to satisfy a court.

The web designer can help lessen confusion by identifying which parts of a web site are relatively unchanging and which parts are dynamic. A static web page is usually seen as less sophisticated and less useful than a dynamic one, but if the need is for evidence or information about a subject at particular point in time, then it may be more suitable. Some organisations log a copy of each web page as viewed, effectively turning the dynamic site into a series of static documents.

This not to say that static is good and dynamic bad. A PDF file can provide a reasonable facsimile of a static printed document, complete with the pages numbers, columns and signatures. But if the need is for something which can be read on a small screen, then a transformation of the information is required. The latest version of PDF provides more options for display on small screens and for interfacing to display devices for the disabled.

Flexible documents

At the one extreme HTML provides a way to create simple electronic documents which can display on a variety of systems, including small wireless devices, TV displays and on special devices for the disabled. But HTML doesn't provide fine control over the format of the document, especially when printed.

At the other extreme PDF provides a format for close control over the look of a document, as to layout, font and such like, but less flexibility. While recent improvements in PDF do allow more options for flowing text to make it more readable and to structure the document in an XML-like format, this requires extra work from the author and so far few people have bothered. In practice two versions of a document have to be produced: the web version for on-screen display and the PDF one for printing. Even where these two versions are automatically generated from the one common source, they involve extra effort for the people creating and reading them.

XML now provides formatting options to allow the HTML-like flexibility, plus the fine formatting control of PDF. OpenOffice.org's XML based file format is not perfect, but it does provide an efficient way to package up all the elements of an XML document (including images) into one compressed file. This provides the prospect of formats which can be edited in a word processor, displayed as a web page, transformed for a hand held device or printed with specific styles.

If, as reported, Microsoft includes XML support in the next version of its Office suite and it doesn't work with Windows 9x, then OpenOffice.org may provide a viable alt development:

Microsoft's newest Office suite -- still tagged with the mundane Office 11 moniker during beta testing -- will only work with Windows 2000, XP, and later releases, Microsoft has confirmed.

... Among Office 11's most touted new features is its reliance on the open XML as a native file format...

Office 11 Won't Work With Older Microsoft Oses, By Gregg Keizer, TechWeb News, Thursday, October 31, 2002, 9:48 p.m. ET, URL: http://www.internetwk.com/breakingNews/INW20021031S0016

OASIS has announced a committee to work on an Open Office XML Format based on OpenOffice.Org:

The resulting file format must meet the following requirements:

  1. it must be suitable for office documents containing text, spreadsheets, charts, and graphical documents,

  2. it must be compatible with the W3C Extensible Markup Language (XML) v1.0 and W3C Namespaces in XML v1.0 specifications,

  3. it must retain high-level information suitable for editing the document,

  4. it must be friendly to transformations using XSLT or similar XML-based languages or tools,

  5. it should keep the document's content and layout information separate such that they can be processed independently of each other, and

  6. it should 'borrow' from similar, existing standards wherever possible and permitted.

Since the OpenOffice.org XML format specification meets these criteria and has proven its value in real life, this TC will use it as the basis for its work.

From: OASIS TC Call For Participation: Open Office XML TC, Karl Best, Mon, 04 Nov 2002 08:32:27 -0500, URL: http://lists.oasis-open.org/archives/tc-announce/200211/msg00001.html

Self Documenting Transaction processing Systems

Web Services provides a way to define a set of transactions and the format of the data which those transactions can be performed on. The XML document to used to define a web service includes the list of transactions which are available from the particular transaction processing system, plus the data types they use. The format is self documenting as any external definitions used are included in the document as URLs.

In theory you could point your system at a web services definition and the system could extract all the information it needed to interact with the system. In practice security concerns and implementation issues may intervene. Also the size and complexity of the system may make any simple automated interface difficult. As an example the HR-XML Consortium Specifications currently are 55 Mbytes in 1624 files.

Are Documents and Databases different?

The same XML syntax is used to define document formats and transaction processing formats, but is this a good idea? Are documents and databases used for different human purposes and therefore have different requirements? Are the technical requirements different? As an example transactions are small, while documents are large; this effects the way they are defined and the design of the processing systems to handle them.

In computer programming there is a well established distinction between declarative programming languages and imperative ones. The details are a bit technical, but imperative languages are the ones that are commonly used, like Java, JavaScript and Basic, where changing values are assigned to variables (such as total = price + shipping charge). The designer of the program explicitly says what the program does at each step and which step follows which. When something goes awry a lot of time is spent "debugging", to find where in the program was up to and what variable was changed.

Declarative programs are less commonly used, but in these you give a variable a value once and it can't change. To create new values, new variables are created. Which order is carried out first doesn't matter and is left to the software to decide. If you can work out how to write such programs, they have few places for bug to occur.

Early web sites and electronic documents were created by graphic designers and print publication designers with tools similar to those for desk top publishing. The documents had static structure and content. With these WYSIWYG web editors you could be reasonably confident what you see on your screen would be what the user of the document would see later.

Dynamic web sites added imperativecomputer programming logic. Commands retrieved data from databases and created web pages based on the results. Complex web sites are complex programming operations and are debugged like computer programs.

New web technologies are making web sites more like using a declarative programming language. Think of a document in XML as a computer program. To run the program you submit it to a web browser program. The web browser reads the XML text and interprets it. The XML can involve references to CSS style sheets and other items which have to be processed to create the final result (a document displayed or printed). A simple HTML web page can appear very differently, depending on the interaction of several different Cascading Style Sheets (CSS). The individual CSS statements are very short and easy to understand and don't have an variables, but the behaviour of a few together can be remarkably difficult to understand.

Support for CSS has been patchy,with different versions of browsers having different CSS bugs. Even without bugs in browsers understanding what a series of CSS statements will do can be difficult to predict. This shouldn't be surprising, as laying out a set of boxes to fit on a screen is like solving a set of equations, or executing a declarative program.

If CSS is difficult to understand, then who will be able to cope with newer XML technologies? XML provides the power to define your own electronic document format, and with XSLT to transform documents into other XML documents, HTML or something else. Better tools to create XML document may help, but XML may need to be simplified to be widely usable.

If XML is so difficult to use then why bother to do it? If we can get these standards to work we can create the e-commerce equivalent of the web: a global e-commerce system composed of relatively simple interlinked individual systems which are each self documenting. Start anywhere in the system and you will be able to find out what that part of the system uses and is used by. Request a new component and it can automatically build the links needed to be part of the global system. Just as the web made hypertext systems practical and popular, XML based technologies such as web services might make e-commerce systems practical.
The lastest web browsers, including Mozilla 1 and IE 6 have good support for CSS. They also have direct support for XSLT . This allows XML documents and Web Services transactions to be displayed on screen, or printed, in a human readable, well formatted form.

The Law Recognises electronic documents and databases

Recently the High Court of Australia has considered the curly question as to if the MIGRATION ACT's definition of documents included electronic documents stored in a database and how you "give" someone a document which is stored in a database:

  1. ... The ordinary dictionary meaning of "document" is a printed or written paper containing information. That definition of "document" is not apt to cover the sequence of electronic impulses in the electronic circuits of a computer disc that store information. ... No violence is done to the object or language of s 418(3) by holding that "document" includes information that is stored in a computer or a fax machine and which can be printed out by pressing one or more keys or buttons. No reason appears for thinking that Parliament intended to distinguish between information stored on paper and information stored in the electronic impulses of a computer that can be printed on paper by pressing a key or keys on the computer's keyboard. Statutes are always speaking to the present. If we can, we should give the words of a statute - which after all are only the means of conveying ideas and information to the public - a meaning that covers contemporary processes and accords with the object of the enactment[25]. ...

  2. "Documents" may include electronic documents: What, then, does the word "document" mean in such a context? Today, in ordinary speech, one can readily refer to a "document" in a database, although such a document may never have been reduced to tangible form. Typically, a database will yield information that appears in paginated format....
  3. ... Electronic "documents" could perhaps be "given" by separate identification and annexure to an electronic transmission. Yet even that was not done in the present case. Merely making such "documents" (or some of them) "available" in a mass of undifferentiated material in a database of constantly changing content does not comply with the language and particular design of the Act ...

Muin v Refugee Review Tribunal; Lie v Refugee Review Tribunal [2002] HCA 30 (8 August 2002), Last Updated: 11 September 2002, HIGH COURT OF AUSTRALIA

I am not a lawyer, but what the High Court seem top be saying is that information stored in a computer which can be printed out and then looks like the paper documents we are used to are documents. The documents can be stored in a database and you can provide an identifier to "give" someone the document, but just telling them it is in the database somewhere is not sufficient.

The High Court didn't say if it wanted documents printed in a particular font or with page numbers, but the decision itself is published as a web page with no font style or size specified and with paragraph numbers, rather than page numbers. As the High Court web site has links to these documents, it could be assumed the court is happy with this format.

XML Schema: Documents and Databases Combine?

Both electronic document formats and database data formats can now be defined using XML Schema, in XML syntax. The result is that both documents and transactions can be defined in the same way and, in theory, processed by the same software. However, in practice the needs of documents and databases are different. Transaction processing systems are designed to deal with small, relatively simple transactions at a very high speed, whereas document systems tend to deal with more complex large documents but at a speed an individual user can cope with.

However, it should be possible to create combined documents, so that an electronic invoice should be able to be generated by a transaction processing system and displayed, and printed out as a well formated document, as well as being automatically read by a transaction processing system.

Definitions Needed

The Internet and the web show the benefits of making information in a simple standardised format. One of the problems with XML is that is allows new formats to be easily defined. If it is to be of practical value then we need a few standard formats widely used. These might be existing pre-XML formats simply translated into the new syntax.

Web Services Related Standards

W3C provide a very useful table to compare XML protocols . As with all good standards development, W3C has been taking technologies developed by industry and turning them into standards. W3C started at the bottom end developing document standards and has more recently working its way up into data definitions, structure, transaction formats and discovery services.
The web services standards are relatively new. There tends to be a heavy overlap of the companies involved. SOAP was developed by a consortium of Ariba, Inc., Commerce One, Inc., Compaq, HP, IBM, Microsoft, SAP and other major companies and is now being standardised by W3C. BizTalk was developed by Microsoft. WSDL was developed by Ariba, IBM and Microsoft. Beyond W3C's technical brief there are other standards which describe specific commercial transactions, such as EbXML from UN/CEFACT oasis.

Making th situation more confusing is the overlap between business domains and technical standards. Early work mixed up the development of what sort of business information could be described (for example a payment advice note) and the format in which the information was encoded (such as in XML). Also many of the standards document are difficult to find, being stored in large PDF documents or at web addresses which change (where is the document defining Microsoft's BiZTalk).

The W3C standards publication process has greatly improved this situation by providing well formatted web documents which are easily found at fixed URLs and by avoiding addressing the business domain. It is easy to find a W3C standard using a web search, to copy a section out of it and paste it (complete with formatting) into a document and to cite the URL of the standard with a reasonable expectation it will still be there when someone goes looking for it. What is needed is for those proposing business standards to follow W3C's lead, by providing documents addressing the business domain and which can be used easily.

Database Related Standards

Web Services Description Language (WSDL) Version 1.2

W3C Working Draft 9 July 2002

http://www.w3.org/TR/wsdl12

An XML language for describing Web services

Bindings

W3C Working Draft 9 July 2002

http://www.w3.org/TR/wsdl12-bindings

Defines binding WDSL to SOAP, HTTP and MIME.

SOAP Version 1.2

Part 0: Primer

W3C Working Draft 26 June 2002

http://www.w3.org/TR/soap12-part0

Part 1: Messaging Framework

W3C Working Draft 26 June 2002

http://www.w3.org/TR/soap12-part1

A lightweight protocol for exchanging structured information in a decentralized, distributed environment.

Part 2: Adjuncts

W3C Working Draft 26 June 2002

http://www.w3.org/TR/soap12-part2

A set of adjuncts that MAY be used with the SOAP messaging framework.

XML Schema

Part 0: Primer

W3C Recommendation, 2 May 2001

http://www.w3.org/TR/xmlschema-0/

Part 1: Structures

W3C Recommendation 2 May 2001

http://www.w3.org/TR/xmlschema-1/

Extended capabilities for describing the structure of XML 1.0 documents, beyond DTDs.

Part 2: Datatypes

W3C Recommendation 02 May 2001

http://www.w3.org/TR/xmlschema-2/

Superset of capabilities in XML 1.0 DTDs for specifying datatypes on elements and attributes.

Document Related Standards

Extensible Stylesheet Language (XSL) Version 1.0 (aka XSL-FO)

W3C Recommendation 15 October 2001

http://www.w3.org/TR/xsl/

A language for expressing stylesheets

XSL Transformations (XSLT) Version 1.0

W3C Recommendation 16 November 1999

http://www.w3.org/TR/xslt

A language for transforming XML documents into other XML documents.

Extensible Markup Language (XML) 1.0 (Second Edition)

W3C Recommendation 6 October 2000

http://www.w3.org/TR/REC-xml

A subset of SGML designed for ease of implementation and for interoperability with both SGML and HTML.

Conclusion

New web browsers provide good support for advanced XML technologies. This is sufficient to support XML electronic documents and transaction formats for business purposes. OpenOffice.Org shows potential for manipulation of XML documents, but lacks support for the newer XML standards. The Commonwealth Government can quickly introduce simple XML based e-commerce transactions, beginning with a remittance advice.

Further Information

Comments and corrections to: webmaster@tomw.net.au

Copyright © Tom Worthington. 2002