Metadata and Electronic Document Management

Searching for a Common Understanding

These are notes on website design for ANU course "Information Technology in Electronic Commerce" (COMP3410/COMP6341). This section of the course is prepared and presented by Tom Worthington FACS HLM, a Visiting Fellow in the Department of Computer Science at the Australian National University (and Director Tomw Communications Pty Ltd).

Two topics are introduced: metadata and data management (digital library, electronic document management). Use of the technology for practical e-commerce and e-publishing applications is emphasised using case studies and anecdotes drawing on the author's personal experience. There are three lectures on metadata, three lectures on Data Management, a tutorial on Metadata and a tutorial on Data Management, assignments and examination questions. There have been minor revisions to the material from 2003, with the lecture notes reformatted to be more easily printed.

This document is intended to provide both for live group presentation and accompanying lecture notes for individual use. The material may also be of use to those interested in the issues, but not undertaking formal study. However, it is not intended as an on-line course. Those wishing to use the material as part of a formal or for-profit course are invited to contact the author.

Getting there

This section is adapted from "Documents and databases: Making sense of developments in eBusiness, eCommerce, ePublishing and eLaw, Tom Worthington, Information Industry Outlook Conference, Canberra, 2002, URL: http://www.tomw.net.au/2002/ebcwxml.html

  1. Get on-line and get email
  2. Get Internet banking
  3. Get a website, initially to advertise the business phone number and email address
  4. Get an interactive dynamic e-commerce system integrated with traditional business systems
  5. Get voice and data systems integrated.

From "Accelerating the Uptake of E-Commerce by SMEs: A Report and Action Plan, SME E-Commerce Taskforce, July 2002 URL: http://www.setel.com.au/smeforum2002

The first three steps of this action plan for accelerating the uptake of e-commerce by Australian small and medium sized enterprises (SMEs) are good simple advice. But step 4 ("Get an interactive dynamic e-commerce system...") is an absurdly large leap. This is the equivalent of telling a new aerospace company to first build a wood glider, then a space shuttle. There need to be more steps for an easier transition between a simple web site and an e-commerce system.

The last step of voice and data integration doesn't appear to relate to e-commerce and seems to have been included because the list came from a telecommunications vendor. ;-)

Electronic publishing provides a transition step between e-mail and e-commerce. Business documents can first be made to be transmitted electronically (e-publishing) and then made able to be automatically processed (e-commerce).

Using the Internet for business is much harder than it looks. Small business can be shown how to save money by using the Internet to do simple things like replacing paperwork with electronic documents. They will then be ready to do something more complex, with integrated e-commerce. New XML technologies, can make that transition possible.

Steps for SME e-commerce:

  1. Internet access: email & web
  2. Internet banking
  3. Website for business details
  4. Electronic documents to replace paper
  5. Automate processing of e-documents

Documents or databases?

document, n. ...

Something written, inscribed, etc., which furnishes evidence or information upon any subject, as a manuscript, title-deed, tomb-stone, coin, picture, etc.

Database ...

A structured collection of data held in computer storage; esp. one that incorporates software to make it accessible in a variety of ways; transf., any large collection of information.

From: OED Online, SECOND EDITION, 1989

Documents and databases represent two extremes in the aims and methods of electronic commerce. At the one extreme we have electronic documents which are fixed in content and format, are individual distinct entities, can be displayed using software from different suppliers, are expected to last for years and outlive the software which created them. At the other extreme a database has content which changes, can be displayed in different ways, may only be of value for minutes or months and may depend on one version of database software. This is not to say that all documents are fixed and all databases fluid, but is a useful generalisation.

At the one extreme HTML provides a way to create simple electronic documents which can display on a variety of systems, including small wireless devices, TV displays and on special devices for the disabled. But HTML doesn't provide fine control over the format of the document, especially when printed.

At the other extreme PDF provides a format for close control over the look of a document, as to layout, font and such like, but less flexibility. While recent improvements in PDF do allow more options for flowing text to make it more readable and to structure the document in an XML-like format, this requires extra work from the author and so far few people have bothered. In practice two versions of a document have to be produced: the web version for on-screen display and the PDF one for printing. Even where these two versions are automatically generated from the one common source, they involve extra effort for the people creating and reading them.

XML The Answer?

XML now provides formatting options to allow the HTML-like flexibility, plus the fine formatting control of PDF. OpenOffice.org's XML based file format is not perfect, but it does provide a way to package up all the elements of an XML document (including images) into one compressed file. This provides the prospect of formats which can be edited in a word processor, displayed as a web page, transformed for a hand held device or printed with specific styles.

In 2002 OASIS announced a committee to work on an office XML standard format:

  1. it must be suitable for office documents containing text, spreadsheets, charts, and graphical documents,
  2. it must be compatible with the W3C Extensible Markup Language (XML) v1.0 and W3C Namespaces in XML v1.0 specifications,
  3. it must retain high-level information suitable for editing the document,
  4. it must be friendly to transformations using XSLT or similar XML-based languages or tools
  5. it should keep the document's content and layout information separate such that they can be processed independently of each other, and
  6. it should "borrow" from similar, existing standards ...

From: "OASIS TC Call For Participation: Open Office XML TC", Karl Best, 4 Nov 2002, URL: http://lists.oasis-open.org/archives/tc-announce/200211/msg00001.html

Open Office Specification

The committee decided the OpenOffice.org XML format specification met these criteria and had proven its value in real life, so used it as the basis for its work. The first draft was released in March 2004, but at 607 pages long is complex to implement.

This document defines an XML schema for office applications and its semantics. The schema is suitable for office documents, including text documents, spreadsheets, charts and graphical documents like drawings or presentations, but is not restricted to these kind of documents.

The schema retains high-level information suitable for editing document and is friendly to transformations using XSLT or similar XML-based languages or tools.

From: "Open Office Specification 1.0", Committee Draft 1, 22 Mar 2004 Document identifier: office-spec-1.0-cd-1.sxw, URL:http://www.oasis-open.org/committees/download.php/6037/office-spec-1.0-cd-1.pdf

Are e-documents legal?

In 2003 the High Court of Australia has considered the difficult question as to if the MIGRATION ACT's definition of documents included electronic documents stored in a database:

  1. ... The ordinary dictionary meaning of "document" is a printed or written paper containing information. ... No violence is done to the object or language of s 418(3) by holding that "document" includes information that is stored in a computer or a fax machine and which can be printed out by pressing one or more keys or buttons. No reason appears for thinking that Parliament intended to distinguish between information stored on paper and information stored in the electronic impulses of a computer that can be printed on paper by pressing a key or keys on the computer's keyboard. ...

From: "Muin v Refugee Review Tribunal; Lie v Refugee Review Tribunal", 8 August 2002, High Court of Australia, http://www.austlii.edu.au/au/cases/cth/high_ct/2002/30.html

Identifying e-documents

The High Court also considered how you "give" someone a document which is stored in a database.

  1. "Documents" may include electronic documents: ... Today, in ordinary speech, one can readily refer to a "document" in a database, although such a document may never have been reduced to tangible form. Typically, a database will yield information that appears in paginated format....

  2. ... Electronic "documents" could perhaps be "given" by separate identification and annexure to an electronic transmission. Yet even that was not done in the present case. Merely making such "documents" (or some of them) "available" in a mass of undifferentiated material in a database of constantly changing content does not comply with the language and particular design of the Act ...

From: "Muin v Refugee Review Tribunal; Lie v Refugee Review Tribunal", 8 August 2002, High Court of Australia, http://www.austlii.edu.au/au/cases/cth/high_ct/2002/30.html

High Court Referencing Web Pages

The High Court didn't say if it wanted documents printed in a particular font or with page numbers, but the decision itself is published as a web page with no font style or size specified and with paragraph numbers, rather than page numbers. As the High Court web site has links to such documents, it could be assumed the court is happy with this format.

From: "Legal Links", High Court of Australia, 2001(?), URL: http://www.hcourt.gov.au/legal.html