Lecture notes on Metadata and Electronic Data Management. First presented for Course: COMP3410 "Information Technology in Electronic Commerce" at the Australian National University, 25 July 2000, revised: 2001, 2003, 2004, 2005 to 2007, 2006 to 2008. This is version 9 October 2009.
1 Introduction
Metadata, document and e-commerce standards add an additional layer over the web to provide a global services for civil society, government and business. The systems of non-government, government and commercial organisations can be designed to securely inter-operate to provide services to the public. These services can use formats which are globally standardized, usable for decades and have legal standing. Services can be made available from hand held wireless devices, as well as desktop computers.
HTML markup is designed for human readable web pages. XML can be used to provide documents with sufficient structure to be processed by an automated system. The same XML dcouemtns can be transformed using XSLT and formatted using CSS to be read as a web page.
Building systems which can be read by both people and machines is challenging. Such formats need to be efficient for storage and transmission, while being able to be converted into a format for human reading (rendered). The format needs to be agreed by all those who use it (ideally worldwide) and fixed for long enough to be useful (ideally for decades), but adaptable for use.
Metadata provides a tool to make electronic documents more efficient and flexible. In many cases a short summary of a document (the metadata) can be used in place of the full document, saving on transmission and processing as well as saving time for the human reader. The metadata can be used to manipulate the information in documents to create new documents. The same encoding used for describing documents can be used by data processing systems to carry out electronic commerce.
The Internet has allowed lower cost access to information, placing pressure on organisations to provide access to the information. Systems such as Creative Commons provide a way to licence to provide information freely, while retaining ownership.
Access to Public Sector Information is discussed in the Final Report of the Inquiry into Improving
Access to Victorian Public Sector Information and Data (Parliament of Victoria, June 2009) and more recenly at Public Sphere #2 - Government 2.0: Policy & Practice (Senator Lundy, June 2009).
Social networking software allows for a computer system to help people interact in groups. While normally thought of for social purposes, it is now being adopted for business. Linked-In provides a way for professionals to interact with each other, find colleagues. Naymz provides a reputation management service. It is likely such systems will be used within and between organisations, including government, to manage work, grant access to information, and work out remuneration for staff. This requires the metadata about people and their actions to be carefully encoded and stored.
HTML has only limited provision for metadata. Systems such as Liniked in get around the problem using Microformats, using HTML class names for the metadata element names. This allows the metadata to be included in the body of the HTML document, instead of the header and requires less duplication of information.
In 2003 the High Court of Australia concluded that the MIGRATION ACT's definition of documents included electronic documents stored in a database.
The High Court also concluded electronic documents need to be separately identified.
1.1 Metadata
Metadata can be described as data about data:
metadata n., a set of data that describes and gives information about other data...
[1968 Proc. IFIP 4th Congr.: Suppl. 10 I. 113/2 There are categories of information about each data set as a unit in a data set of data sets, which must be handled as a special meta data set.] 1987 Philos. Trans. Royal Soc. A. 322 373 The challenge is to accumulate data..from diverse sources, convert it to machine-readable form with a harmonized array of *metadata descriptors and present the resulting database(s) to the user. 1998 New Scientist 30 May 35/2 With XML, attaching metadata to a document is easy, at least in theory.
Oxford English Dictionary, (Online) Draft entry Dec. 2001, URL: http://dictionary.oed.com/cgi/entry/00307096/00307096se19
In e-commerce, metadata provides standard data items to allow parties to communicate about their organisations, products, terms and conditions. Electronic payment details of a transaction, and the money itself, consists of data defined by metadata.
Metadata is also used to describe published documents. The same XML technology can be used to express metadata for e-commerce and for publishing.
This metadata from the Australian Government home page. It was intended that data in this format would be inserted into the HEAD of all government web pages, to aid data retrieval.
The challenge is to create formats which are sufficiently expressive to be able to communicate what is needed, but simple enough to be implemented efficiently.
Creating and using metadata standards is both a technical and political process. Most standards need to be profiled, to create a workable subset, before they can be used for practical purposes. Some standards need to be enhanced and others should not be used at all.
Here is an example of an e-commerce transaction. This is an Australian Taxation Office electronic tax form for the Goods and Services Tax (GST). This is a different use of metadata, for defining the data in a financial transaction.
The World Wide Web Consortium (W3C) standard for Scalable Vector Graphics (SVG), provides a way to define images in web pages. As well as the expected features of shapes, filling, symbols, colours and patterns there is the 'metadata' element.
Simple definition politically complex
The apparently technically simple definition of metadata for SVG is made politically complex by this paragraph in the standard.
The ease of defining metadata using new web based tools has made standardization more difficult. It is technically simple to define a new standard if an exiting definition is not quite right. However, having many standards is as much a problem as having no standards at all.
According to this official version of events, the Australian Government Locator Service (AGLS) metadata standard (discussed later) was originally called "AUSGILS" and intended to be based on the U.S. Government Information Locator Service (GILS), but this was abandoned in favour of the Dublin Core metadata standard in 1997. However, the proposed standard was first called "AGILS" in an earlier architecture proposal:
META tag of HTML is used in the header section of HTML documents. Example: <META NAME="Date" CONTENT="1966-01-12">. The field identifiers from the selected meta-data set is used in the NAME field and the field value in CONTENT. The set of meta-data definitions being used (the meta-meta-data) should be included in a tag. Example: <meta name="metadata" content="AGILS">.
This was done for political reasons, to suggest compatibility with the US Government standard. The name was later shortened to AGLS.
Standards politics are important to metadata and electronic document development in the real world. Standards are selected based on the importance of the organisations and individuals supporting them, not technical merit. Standards are then adapted, extended, made into subsets or combined.
E-commerce and electronic publishing depend on decisions made on what standards to use. Previously separate standards for electronic commerce, documents and television are converging to use the same format (XML).
One example of where standards for document formats and commercial interests collide is the Portable Document Format (PDF). Developed by Adobe as an extension to the Postscript format for desk-top publishing, PDF has provide a popular electronic document format. However, PDF has a number of limitations as an on-screen format and for disabled users. Adobe have attempted to address these limitations with "Tagged Adobe PDF", which added some XML interoperability to the PDF format.
However additional work is needed by document creators to use these features. There is also an inherent contradiction between one of PDF's original selling point of providing an accurate representation of a printed document and the aims of the enhancement of allowing the representation to be transformed. Adobe are not the only ones struggling with this problem. One possible solution is OpenOffice.org's XML Packages format. This packages up XML documents and supplementary binary format data, such as images, in ZIP file format.
Dublin Core (DC)is a metadata standards project originating from a workshop held in Dublin, Ohio, USA in 1995. "Dublin Core" metadata element set is a small set of metadata definitions intended for cross-domain information resources. However, DC has its origins in the work of librarians and so tends to work better for describing printed text, than other items, such as video.
The intention with DC is to provide a brief standard set of essential metadata items for resources: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights.
Other examples of controlled vocabulary are using the Internet Media Types ( MIME) for defining computer media formats in the format element and language tags, such as "en-AU" for Australian English.
The Australian Digital Theses Program provides a database of digitised theses produced at Australian Universities. Authors at ANU use a deposit form, the data from which is expressed as DC metadata. and provided via a search facility.
The Australian Government Locator Service (AGLS) metadata standard is a set of 19 descriptive elements to improve the visibility and accessibility of services and information over the Internet. The AGLS standard is based the 15 Dublin Core elements, plus four extra elements:
No elements are mandatory for DC, but AGLS requires five (or six) of them.
Qualifiers are used to restrict the semantics of the relationship between the resource and the element value. AGLS encourages more use of qualifiers than DC, but does not require it.
AGLS uses two types of qualifiers.
Metadata is rarely entered be the document author typing in text. When encoded in the header of a HTML document the metadata is not displayed by a web browser. Specialized software, such as a content management systems, or features in word processors are used to enter and display the metadata. The user of the system is likely to be unaware they are using a metadata standard or how it is encoded. Examples of how these systems will be shown later.
The Distributed Systems Technology Centre (DSTC Pty Ltd), has produced a metadata tool to create AGLS and Dublin Core metadata. Rege, can be used to generate AGLS metadata syntax. This would be too cumbersome for creating real metadata, but is a useful way to learn about the process.
1.2 Standards for E-commerce
Metadata for managing documents tends to have a few dozen elements for each document. Most elements are text fields, rather than numeric values or qualified values. Metadata for electronic commerce uses more elements, more qualified and numeric values.
Early E-commerce Standards: UN/EDIFACT andANS X12
The United Nations agreed standards for world e-commerce called UN/EDIFACT. This is one of the two early internationally cited family of standards for Electronic Data Interchange (EDI). The other standard is the USA's ANS X12 Syntax. In most cases the same metadata elements can be used with EDIFACT and ANS X12.
Standards were developed as electronic versions of commonly used business forms, such as invoices and Remittance Advice.
The Interim Report for the CEN/ISSS XML/EDI Pilot Project gave an example of an XML version of an EDIFACT National Payment Order.
Some elements used for the CEN/ISSS Payment Order.
Part of the XML document type definition (DTD) of the CEN/ISSS Payment Order
W3C provide a very useful table to compare XML protocols.
1.3 E-commerce Examples
Web Services Demonstration
Web services can be thought of as the transaction processing equivalent of the world wide web. The web provided a relatively easy and standardized way to create distributed hypertext. Web Services is a set of standards which aims to provide easy and standardized distributed transaction processing.
Formatting the eBAS with XSL
The Australian Taxation Office (ATO) provides specifications of an electronic versions of tax forms, including the Business Activity Statement (BAS) in relation to the Goods and Services Tax (GST). This is a demonstration of how XML transactions can be transformed into printable documents.
Research Data Australia
Research Data Australia is a directory of Australian research data collections which makes use of metadata to provide a marketplace for researchers.
2 Electronic Document Management
Electronic Document Management allows business to be conducted with legally recognised e-commerce transactions.
Electronic document management systems are more than just systems for tracking the location of electronic documents. Such systems should manage documents for their complete life cycle based on the value of the document to the agency's business. Just as there are standard procedures for the registration of paper documents and records, suitable procedures should be implemented to manage each electronic document throughout its life from creation to disposal...
From: Improving Electronic Document Management: Guidelines for Australian Government Agencies, Office of Government Information Technology, 1995, Archive copy at URL: http://www.defence.gov.au/imsc/edmsc/iedmtc.htm
The State Records of South Australia has a useful description of the process of: Records Creation to Archive.
In 1995 the Australian Government released Guidelines for Australian Government Agencies, on Electronic Document Management. The gudielines identified seven requirements for e-document management:
The guidelines identified three design responses to the requriements:
The Recordkeeping Metadata Standard for Commonwealth Agencies (RKMS) defines 20 elements (eight mandatory) and 65 sub-elements for the record keeping systems used by Commonwealth government agencies. It is based on the Australian Government Locator Service (AGLS) metadata standard, but adds metadata items for maintaining government records:
... help agencies to identify, authenticate, describe and manage their electronic records in a systematic and consistent way to meet business, accountability and archival requirements. The standard is designed to be used as a reference tool by agency corporate managers, IT personnel and software vendors involved in the design, selection and implementation of electronic recordkeeping and related information management systems. ...
From: "Recordkeeping Metadata Standard for Commonwealth Agencies", Version 1.0, National Archives of Australia, 1999, URL: http://www.naa.gov.au/recordkeeping/control/rkms/summary.htm
2.1 The Digital Library
A digital library allows access to electronic documents, while respecting the intellectual property rights of the author.
An overview of e-publishing issues is provided in the Australian Government Information Management Office Web Publishing Guide:
However, it is more useful if the metadata is converted to Dublin Core format for use in non-library systems.
2.2 Electronic Publishing
Capturing Australia's Scholarly Publishing
At a roundtable in 2004 a Thought experiment was outlined to transform the process of distribution of scholarly information in Australia. It proposed to allow research results to be available online to government funding bodies, universities where the research was conducted, industry and the public. This process can now be automated for open access publications, using digital libraries using metadata standards and XML.
The term "ingest" is used to describe the process of incorporating an electronic document and its metadata into an electronic archive. Usually only the metadata is converted, with content of the paper remaining in the original format (usually PDF).
Journal and conference indexes are traditionally provided in formats such as Refer and BibTeX. These formats can be converted using XSLT to transform the metadata into the XML format used by OAI and OJS.
2.3 Electronic Document Management Issues
IFIP Digital Library
The digital library for the International Federation for Information Processing (IFIP) has abstracts of conference papers and the full text of some papers. Metadata standards are used and materials are provided using XML based interfaces. However, conferences are more than just the papers presented. How can the discussions be facilitated and represented in digital format?
Public Sphere
Senator Lundy set up a series of public policy development, using the web, with wikis, blogs, instant messages, digital video and Google Docs, called Public Sphere. See:
-
Senator Lundy describes her Public Sphere initiative (11 minute video).
Issued for metadata and e-documents:
- Complexity of tools and information: would metadta and XML help?
- How can the materials, including video, be archived for log term use?