Metadata and Electronic Document Management

Introduction

These are notes on website design for ANU course "Information Technology in Electronic Commerce" (COMP3410/COMP6341). This section of the course is prepared and presented by Tom Worthington FACS HLM, a Visiting Fellow in the Department of Computer Science at the Australian National University (and Director Tomw Communications Pty Ltd).

Metadata

The Oxford English Dictionary describes metadata as:

metadata n., a set of data that describes and gives information about other data...

[1968 Proc. IFIP 4th Congr.: Suppl. 10 I. 113/2 There are categories of information about each data set as a unit in a data set of data sets, which must be handled as a special meta data set.] 1987 Philos. Trans. Royal Soc. A. 322 373 The challenge is to accumulate data..from diverse sources, convert it to machine-readable form with a harmonized array of *metadata descriptors and present the resulting database(s) to the user. 1998 New Scientist 30 May 35/2 With XML, attaching metadata to a document is easy, at least in theory.

Oxford English Dictionary, (Online) Draft entry Dec. 2001, URL: http://dictionary.oed.com/cgi/entry/00307096/00307096se19

Metadata can be described more simply as "Data about Data". As an example the "creator" of this document is "Tom Worthington". The data is "Tom Worthington" and the medadata is "creator".

Metadata is essential for e-commerce, as it provides standard data items to allow parties to communicate about their organisations, products, terms and conditions. The actual payment and the "money" itself consists of data in an agreed metadata format, in an electronic transaction. Without suitable metadata standards, e-commerce could not take place and "money" in our online financial systems would cease to exist.

Metadata can also be used to describe published documents. The use of metadata for e-commerce and for publishing has converged in the last few years with the use of the same XML based technology for both applications.

ANU home page Metadata

Here is an example of the metadata for the ANU home page:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name='area' content='Corporate Information Services' >
<meta name='contentStatus' content='official'>
<meta name='dc.creator' content='webmaster@anu.edu.au'>
<meta name='dc.creator.email' content='webmaster@anu.edu.au'>
<meta name='dc.date' content='2003-2-1'>
<meta name='dc.date.validTo' content='2003-12-31'>
<meta name='dc.description' content='The Australian National University's home page'>
<meta name='dc.publisher' content='Director, Corporate Information Services'> ...

From: The Australian National University, Marketing and Communications, 17 July 2003, URL: http://www.anu.edu.au/

Tax Office e-commerce transaction

Here is an example of an e-commerce transaction. This is an Australian Taxation Office electronic tax form for the Goods and Services Tax (GST):

<FORM_PERIOD_LABEL_TEXT>July to September 2001</FORM_PERIOD_LABEL_TEXT>
<EFT_CODE> 51111 121 059 9059</EFT_CODE>
<BILLER_CODE>75556</BILLER_CODE>
<PAYG_WITHHOLDING>0</PAYG_WITHHOLDING>
<PAYG_INSTALMENT>12541</PAYG_INSTALMENT>
<DEFERRED_COMPANY_FUND_INSTALMENT>7879801 </DEFERRED_COMPANY_FUND_INSTALMENT>
<TOTAL_DEBITS>7892342</TOTAL_DEBITS>
<TOTAL_CREDITS>0</TOTAL_CREDITS>
<NET_AMOUNT_FOR_THIS_STATEMENT>7892342 </NET_AMOUNT_FOR_THIS_STATEMENT>
<GST_LABEL_TEXT>for the QUARTER from 1 Jul 2001 to 30 Sep 2001</GST_LABEL_TEXT>
<GST_ACCOUNTING_METHOD_LABEL_TEXT>Cash ...

From: Formatting the eBAS with XSL, Tom Worthington, 29 November 2002, URL: http://www.tomw.net.au/2002/atoxml.html

Australian Government Entry Point index

The (Australian) Commonwealth Government Entry Point: www.fed.gov.au indexes over 500 Federal Government web sites using AGLS, with 1,000,000 pages using metadata:

Commonwealth Departments and agencies (including authorities) are required to:

From: Government Policy fed.gov.au, Department of Communications, Information Technology and the Arts, 2001, URL: http://fed.gov.au/KSP?phase=2fb0c7:f6f213b4c5:-6ad2&action=menuGovernance

The Politics of Data Standards

The common theme of this work is the creation, transmission, storage, discovery and display of information in electronic format. The subtitle "Searching for a Common Understanding" comes from the need for those creating electronic information to agree a common format for the information to be understood. The challenge is to create formats which are sufficiently expressive to be able to communicate what is needed, but simple enough to be implemented efficiently.

Those involved in creating a standard, and in using it, must have a common understanding of what is needed and what is enough. In implementing metadata and data management standards IT professionals need to keep the politics of standards development in mind. Most standards need to be profiled, to create a workable subset, before they can be used for practical purposes. Some standards need to be enhanced and others not used at all.

The World Wide Web Consortium (W3C) standard for Scalable Vector Graphics (SVG), provides a way to define images in web pages. As well as the expected features of shapes, filling, symbols, colours and patterns there is the 'metadata' element:

<!ENTITY % metadataExt "" >

<!ELEMENT metadata (#PCDATA %metadataExt;)* >

<!ATTLIST metadata %stdAttrs; >

From 21.2 The 'metadata' element, Scalable Vector Graphics (SVG) 1.0 Specification W3C Proposed Recommendation 19 July, 2001, URL: http://www.w3.org/TR/SVG/

Simple definition politically complex

This apparently technically simple definition is made politically complex by a preceding paragraph:

Individual industries or individual content creators are free to define their own metadata schema but are encouraged to follow existing metadata standards and use standard metadata schema wherever possible to promote interchange and interoperability. If a particular standard metadata schema does not meet your needs, then it is usually better to define an additional metadata schema in an existing framework such as RDF and to use custom metadata schema in combination with standard metadata schema, rather than totally ignore the standard schema.

From 21.1 Introduction, Scalable Vector Graphics (SVG) 1.0 Specification W3C Proposed Recommendation 19 July, 2001, URL: http://www.w3.org/TR/SVG/

The important points here are: "...free to define their own metadata schema but are encouraged to follow existing metadata standards ... better to define an additional metadata schema in an existing framework ...". In some ways the ease of defining metadata using new web based tools has made the standardization process more difficult. It is very tempting if an exiting definition is not quite right to define a new standard and hope that some tool will allow conversion between the standards. However, having many standards is a similar problem to no standards at all.

Anecdote: AUSGILS to AGLS?

According to the official version of events, the Australian Government Locator Service (AGLS) metadata standard (discussed in detail later) was originally called "AUSGILS" and intended to be based on the U.S. Government Information Locator Service (GILS), but this was abandoned in favour of the Dublin Core metadata standard in 1997:

At the time of the IMSC it was thought that an Australian Government Locator Service would be a variant of the U.S. Government Information Locator Service (GILS). Consequently, for much of its gestation period what is now known as AGLS was referred to as AUSGILS. However, late last year when a workshop of experts convened to develop the AUSGILS standard it was decided to abandon the GILS framework and instead base the online locator service on the Dublin Core metadata standard.

From: Enabling Seamless Online Access to Government, Adrian Cunningham, National Archives of Australia, 26 August 1998, URL (archived copy): http://www.naa.gov.au/recordkeeping/gov_online/agls/Metadata_paper22sept98.html

AGILS to AGLS?

However, this author's recollection differs. The proposed standard was first called "AGILS" in an earlier architecture proposal:

META tag of HTML is used in the header section of HTML documents. Example: <META NAME="Date" CONTENT="1966-01-12">. The field identifiers from the selected meta-data set is used in the NAME field and the field value in CONTENT. The set of meta-data definitions being used (the meta-meta-data) should be included in a tag. Example: <meta name="metadata" content="AGILS">.

From: Architecture For Access To Government Information, Report of the IMSC -Technical Group, Commonwealth of Australia, 25 July 1996, URL (archived copy): http://www.defence.gov.au/imsc/imsctg/imsctg1c.htm#RTFToC87

This was done for political reasons, to suggest compatibility with the US Government standard. There was not necessary any intention to achieve computability. The name was later shortened to AGLS.

Standards, Definitions and Dollars

Standards politics are very important to metadata and electronic document development in the real world. Few of decisions are made based on the technical merits of proposals. There are few cases where metadata standards are developed from first principles. Selections are made from existing metadata standards, based on the level of support for those standards, and the perceived importance of those organisations and individuals supporting them. Standards are then adapted, extended, made into subsets or combined.

Thousands of millions of dollars in business for e-commerce and electronic publishing depend on decisions to be made over what standards to use. Previously separate standards for electronic commerce, documents and television are converging to use the same format (XML). These same formats are proposed to be used for areas such as TV. How rapidly and how effectively will this convergence happen?

One example of where standards for document formats and commercial interests collide is the Portable Document Format (PDF). Developed by Adobe as an extension to the Postscript format for desk-top publishing, PDF has provide a popular electronic document format. However, PDF has a number of limitations as an on-screen format and for disabled users. Adobe have attempted to address these limitations with "Tagged Adobe PDF", which added some XML interoperability to the PDF format.

Adobe Acrobat 5.0 software introduces tagged Adobe PDF, an enhancement to the PDF specification that allows PDF files to contain logical document structure. Logical structure refers to the organization of a document, such as the title page, chapters, sections, and subsections. Tagged Adobe PDF documents can be reflowed to fit small-screen devices and offer better support for repurposing content. They also are more accessible to the visually impaired.

From Adobe PDF, Adobe Systems Incorporated, 2001, URL: http://www.adobe.com/products/acrobat/adobepdf.html

However additional work is needed by document creators to use these new features. There is also an inherent contradiction between one of PDF's original selling point of providing an accurate representation of a printed document and the aims of the enhancement of allowing the representation to be transformed. Adobe are not the only ones struggling with this problem. One possible solution is OpenOffice.org's XML Packages format. This packages up XML documents and supplementary binary format data, such as images, in ZIP file format.

Dublin Core

Dublin Core (DC)is a metadata standards project originating from a workshop held in Dublin, Ohio, USA in 1995. "Dublin Core" metadata element set is a small set of metadata definitions intended for cross-domain information resources. However, DC has its origins in the work of librarians and so tends to work better for describing printed text, than other items, such as video.

The intention with DC is to provide a brief standard set of essential metadata items for resources: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights.

Title Typically, Title will be a name by which the resource is formally known.
Creator Examples of Creator include a person, an organization, or a service. ...
Subject ... keywords, key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme.
Description ... an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content. ...

Adapted from "Dublin Core Metadata Element Set", Version 1.1: Reference Description, DCMI, 2003-06-02, URL: http://dublincore.org/documents/dces/

Other examples of controlled vocabulary are using the Internet Media Types ( MIME) for defining computer media formats in the format element and language tags, such as "en-AU" for Australian English.

Australian Digital Theses Program

The Australian Digital Theses Program provides a database of digitised theses produced at Australian Universities. Authors at ANU use a deposit form, the data from which is expressed as DC metadata. and provided via a search facility.

Metadata Standards

Dublin Core metadata will be automatically generated out of the ADT Deposit form. This metadata will form the basis of the database of distributed digitised theses across the 7 participating institutions. ...

<meta name="DC.language" scheme="RFC3066" content="en">

*** English will be the default language. In order to add another language the Deposit form will need to be amended to add another field. As theses will be predominantly in English, this will remain the default and the issue of other languages and the appropriate scheme to use will be investigated at a future date if necessary.

From: "Metadata standard:, Australian Digital Theses Program, UNSW Library 1997, Updated 12/09/03 URL: http://www.library.unsw.edu.au/thesis/adt-ADT/info/metadata.html

Other Dublin Core Projects are listed at URL: http://dublincore.org/projects/subject.shtml

AGLS

The Australian Government Locator Service (AGLS) metadata standard is a set of 19 descriptive elements to improve the visibility and accessibility of services and information over the Internet. The AGLS standard is based the 15 Dublin Core elements, plus four extra elements:

Element

Example

Function

<META NAME="AGLS.Function" CONTENT="School Education">

Availability

<META NAME="AGLS.Availability" CONTENT="Medical assistance is available by contacting the after hours hotline on ...">

Audience

<agls:audience>anglers</agls:audience>

Mandate

<META NAME="AGLS.Mandate.case" SCHEME="URI" CONTENT="http://...">

Complied from AGLS Metadata Element Set, Part 2: Usage Guide, Version 1.3 , National Archives of Australia, 2002, URL: http://www.naa.gov.au/recordkeeping/gov_online/agls/metadata_element_set.html

AGLS Mandatory Elements

No elements are mandatory for DC, but AGLS requires five (or six) of the following:

From: AGLS Metadata Element Set, Part 2: Usage Guide, Version 1.3 , National Archives of Australia, 2002, URL: http://www.naa.gov.au/recordkeeping/gov_online/agls/metadata_element_set.html

Qualifiers

Qualifiers are used to restrict the semantics of the relationship between the resource and the element value. AGLS encourages more use of qualifiers than DC, but does not require it:

Qualifiers are additions and extensions to the metadata elements that give metadata creators the option to refine the semantics of the element set, and add precision to the values of the metadata elements. For example, it may be useful to indicate that the value has been selected from a particular controlled vocabulary, such as a list of keywords, or is encoded using a particular convention - the format for dates is an important case - or in a particular natural language.

From: AGLS Metadata Element Set, Part 2: Usage Guide, Version 1.3 , National Archives of Australia, 2002, URL: http://www.naa.gov.au/recordkeeping/gov_online/agls/metadata_element_set.html

AGLS Qualifiers

AGLS uses two types of qualifiers:

  1. Element refinements are represented in HTML <meta> syntax with qualifiers appended to to the element names. For example: "DC.Type.documentType". Note that the "T" in "Type" in the example is in upper case, whereas the "d" of "document" is not. This is a somewhat odd practice in DC.

  2. Encoding schemes indicate how the value is to be interpreted if it has been chosen from a controlled vocabulary, or externally defined standard. For example:

<META NAME="DC.Date.modified" SCHEME="ISO8601" CONTENT="1998-08-27">.

Metadata Tools

Metadata is rarely entered be the document author typing in text. When encoded in the header of a HTML document the metadata is not displayed by a web browser. Specialized software, such as a content management systems, or features in word processors are used to enter and display the metadata. The user of the system is likely to be unaware they are using a metadata standard or how it is encoded. Examples of how these systems will be shown later.

The Distributed Systems Technology Centre (DSTC Pty Ltd), has produced a metadata tool to create AGLS and Dublin Core metadata. Rege, can be used to generate AGLS metadata syntax. This would be too cumbersome for creating real metadata, but is a useful way to learn about the process.

This is a demonstration of DSTC's Reg metadata editor. Reg allows you to:

Reg uses metadata schemas to customize itself for different metadata element sets. ...

"Reg - Metadata Editor", DSTC Pty Ltd, 1998, 2000, URL: http://metadata.net/cgi-bin/reg/demo.cgi.