Metadata for Publishing
Tom Worthington FACS HLM
Visiting Fellow, Department of Computer Science, Australian National University, Canberra
For: COMP3410: Information Technology in Electronic Commerce, at The Australian National University
This document is Version 2.1 13 August 2003: http://www.tomw.net.au/2003/dm/mpubs.html
Contents
Introduction
This material is part of Metadata and Electronic Document Management: Searching for a Common Understanding, prepared for "Information Technology in Electronic Commerce" (COMP3410), at the Australian National University, semester 2, 2003.
Metadata
The Oxford English Dictionary describes metadata as:
metadata n., a set of data that describes and gives information about other data...
[1968 Proc. IFIP 4th Congr.: Suppl. 10 I. 113/2 There are categories of information about each data set as a unit in a data set of data sets, which must be handled as a special meta data set.] 1987 Philos. Trans. Royal Soc. A. 322 373 The challenge is to accumulate data..from diverse sources, convert it to machine-readable form with a harmonized array of *metadata descriptors and present the resulting database(s) to the user. 1998 New Scientist 30 May 35/2 With XML, attaching metadata to a document is easy, at least in theory.
Oxford English Dictionary, (Online) Draft entry Dec. 2001, URL: http://dictionary.oed.com/cgi/entry/00307096/00307096se19
Metadata can be described more simply as "Data about Data". As an example the "creator" of this document is "Tom Worthington". The data is "Tom Worthington" and the medadata is "creator".
Metadata is essential for e-commerce, as it provides standard data items to allow parties to communicate about their organisations, products, terms and conditions. The actual payment and the "money" itself consists of data in an agreed metadata format, in an electronic transaction. Without suitable metadata standards, e-commerce could not take place and "money" in our online financial systems would cease to exist.
Metadata can also be used to describe published documents. The use of metadata for e-commerce and for publishing has converged in the last few years with the use of the same XML based technology for both applications.
Here is an example of the metadata for the ANU home page:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name='area' content='Corporate Information Services' >
<meta name='contentStatus' content='official'>
<meta name='dc.creator' content='webmaster@anu.edu.au'>
<meta name='dc.creator.email' content='webmaster@anu.edu.au'>
<meta name='dc.date' content='2003-2-1'>
<meta name='dc.date.validTo' content='2003-12-31'>
<meta name='dc.description' content='The Australian National University's home page'>
<meta name='dc.publisher' content='Director, Corporate Information Services'>
<meta name='dc.publisher.email' content='director.cis@anu.edu.au'>
<meta name='dc.subject' content='Australia, Canberra, university, Australian National University, Institute of Advanced Studies, research, undergraduate, graduate, students, CRICOS Provider Number: 00120C '>
<title>The Australian National University</title>
From: The Australian National University, Marketing and Communications, 17 July 2003, URL: http://www.anu.edu.au/
Here is an example of an e-commerce transaction. This is an Australian Taxation Office electronic tax form for the Goods and Services Tax (GST):
<FORM_PERIOD_LABEL_TEXT>July to September 2001</FORM_PERIOD_LABEL_TEXT>
<EFT_CODE> 51111 121 059 9059</EFT_CODE>
<BILLER_CODE>75556</BILLER_CODE>
<PAYG_WITHHOLDING>0</PAYG_WITHHOLDING>
<PAYG_INSTALMENT>12541</PAYG_INSTALMENT>
<DEFERRED_COMPANY_FUND_INSTALMENT>7879801 </DEFERRED_COMPANY_FUND_INSTALMENT>
<TOTAL_DEBITS>7892342</TOTAL_DEBITS>
<TOTAL_CREDITS>0</TOTAL_CREDITS>
<NET_AMOUNT_FOR_THIS_STATEMENT>7892342 </NET_AMOUNT_FOR_THIS_STATEMENT>
<GST_LABEL_TEXT>for the QUARTER from 1 Jul 2001 to 30 Sep 2001</GST_LABEL_TEXT>
<GST_ACCOUNTING_METHOD_LABEL_TEXT>Cash </GST_ACCOUNTING_METHOD_LABEL_TEXT>
<GST_ACCOUNTING_METHOD_LABEL>01 </GST_ACCOUNTING_METHOD_LABEL>
<GST_OPTION_1>true</GST_OPTION_1>
...
From: Formatting the eBAS with XSL, Tom Worthington, 29 November 2002, URL: http://www.tomw.net.au/2002/atoxml.html
The (Australian) Commonwealth Government Entry Point: www.fed.gov.au indexes over 500 Federal Government web sites using AGLS, with 1,000,000 pages using metadata:
Commonwealth Departments and agencies (including authorities) are required to:
notify the Commonwealth Government Entry Point, www.fed.gov.au, of the existence of their information on the web so that appropriate web-site addresses may be included in the www.fed.gov.au regular indexing and updating routine; or
provide access to, and notification of, a Harvest Control List (HCL) pointing to AGLS Metadata information and services resources.
From: Government Policy – fed.gov.au, Department of Communications, Information Technology and the Arts, 2001, URL: http://fed.gov.au/KSP?phase=2fb0c7:f6f213b4c5:-6ad2&action=menuGovernance
The Politics of Data Standards
The common theme of this work is the creation, transmission, storage, discovery and display of information in electronic format. The subtitle "Searching for a Common Understanding" comes from the need for those creating electronic information to agree a common format for the information to be understood. The challenge is to create formats which are sufficiently expressive to be able to communicate what is needed, but simple enough to be implemented efficiently.
Those involved in creating a standard, and in using it, must have a common understanding of what is needed and what is enough. In implementing metadata and data management standards IT professionals need to keep the politics of standards development in mind. Most standards need to be profiled, to create a workable subset, before they can be used for practical purposes. Some standards need to be enhanced and others not used at all.
The World Wide Web Consortium (W3C) standard for Scalable Vector Graphics (SVG), provides a way to define images in web pages. As well as the expected features of shapes, filling, symbols, colours and patterns there is the 'metadata' element:
<!ENTITY % metadataExt "" >
<!ELEMENT metadata (#PCDATA %metadataExt;)* >
<!ATTLIST metadata %stdAttrs; >
From 21.2 The 'metadata' element, Scalable Vector Graphics (SVG) 1.0 Specification W3C Proposed Recommendation 19 July, 2001, URL: http://www.w3.org/TR/SVG/
This apparently technically simple definition is made politically complex by a preceding paragraph:
Individual industries or individual content creators are free to define their own metadata schema but are encouraged to follow existing metadata standards and use standard metadata schema wherever possible to promote interchange and interoperability. If a particular standard metadata schema does not meet your needs, then it is usually better to define an additional metadata schema in an existing framework such as RDF and to use custom metadata schema in combination with standard metadata schema, rather than totally ignore the standard schema.
From 21.1 Introduction, Scalable Vector Graphics (SVG) 1.0 Specification W3C Proposed Recommendation 19 July, 2001, URL: http://www.w3.org/TR/SVG/
The important points here are: "...free to define their own metadata schema but are encouraged to follow existing metadata standards ... better to define an additional metadata schema in an existing framework ...". In some ways the ease of defining metadata using new web based tools has made the standardization process more difficult. It is very tempting if an exiting definition is not quite right to define a new standard and hope that some tool will allow conversion between the standards. However, having many standards is a similar problem to no standards at all.
Anecdote:
According to the official version of events, the Australian Government Locator Service (AGLS) metadata standard (discussed in detail later) was originally called "AUSGILS" and intended to be based on the U.S. Government Information Locator Service (GILS), but this was abandoned in favour of the Dublin Core metadata standard in 1997:
At the time of the IMSC it was thought that an Australian Government Locator Service would be a variant of the U.S. Government Information Locator Service (GILS). Consequently, for much of its gestation period what is now known as AGLS was referred to as AUSGILS. However, late last year when a workshop of experts convened to develop the AUSGILS standard it was decided to abandon the GILS framework and instead base the online locator service on the Dublin Core metadata standard.
From: Enabling Seamless Online Access to Government, Adrian Cunningham, National Archives of Australia, 26 August 1998, URL (archived copy): http://www.naa.gov.au/recordkeeping/gov_online/agls/Metadata_paper22sept98.html
However, this author's recollection differs. The proposed standard was first called "AGILS" in an earlier architecture proposal:
META tag of HTML is used in the header section of HTML documents. Example: <META NAME="Date" CONTENT="1966-01-12">. The field identifiers from the selected meta-data set is used in the NAME field and the field value in CONTENT. The set of meta-data definitions being used (the meta-meta-data) should be included in a tag. Example: <meta name="metadata" content="AGILS">.
From: Architecture For Access To Government Information, Report of the IMSC -Technical Group, Commonwealth of Australia, 25 July 1996, URL (archived copy): http://www.defence.gov.au/imsc/imsctg/imsctg1c.htm#RTFToC87
This was done for political reasons, to suggest compatibility with the US Government standard. There was not necessary any intention to achieve computability. The name was later shortened to AGLS.
Standards, Definitions and Dollars
Standards politics are very important to metadata and electronic document development in the real world. Few of decisions are made based on the technical merits of proposals. There are few cases where metadata standards are developed from first principles. Selections are made from existing metadata standards, based on the level of support for those standards, and the perceived importance of those organisations and individuals supporting them. Standards are then adapted, extended, made into subsets or combined.
Thousands of millions of dollars in business for e-commerce and electronic publishing depend on decisions to be made over what standards to use. Previously separate standards for electronic commerce, documents and television are converging to use the same format (XML). These same formats are proposed to be used for areas such as TV. How rapidly and how effectively will this convergence happen?
One example of where standards for document formats and commercial interests collide is the Portable Document Format (PDF). Developed by Adobe as an extension to the Postscript format for desk-top publishing, PDF has provide a popular electronic document format. However, PDF has a number of limitations as an on-screen format and for disabled users. Adobe have attempted to address these limitations with "Tagged Adobe PDF", which adds some XML interoperability to the PDF format.
Adobe Acrobat 5.0 software introduces tagged Adobe PDF, an enhancement to the PDF specification that allows PDF files to contain logical document structure. Logical structure refers to the organization of a document, such as the title page, chapters, sections, and subsections. Tagged Adobe PDF documents can be reflowed to fit small-screen devices and offer better support for repurposing content. They also are more accessible to the visually impaired.
From Adobe PDF, Adobe Systems Incorporated, 2001, URL: http://www.adobe.com/products/acrobat/adobepdf.html
However additional work is needed by document creators to use these new features. There is also an inherent contradiction between one of PDF's original selling point of providing an accurate representation of a printed document and the aims of the enhancement of allowing the representation to be transformed. Adobe are not the only ones struggling with this problem. One possible solution is OpenOffice.org's XML Packages format. This packages up XML documents and supplementary binary format data, such as images, in ZIP file format.
Dublin Core
Dublin Core (DC)is a metadata standards project originating from a workshop held in Dublin, Ohio, USA in 1995. "Dublin Core" metadata element set is a small set of metadata definitions intended for cross-domain information resources. However, DC has its origins in the work of librarians and so tends to work better for describing printed text, than other items, such as video.
The intention with DC is to provide a brief standard set of essential metadata items for resources:
The Elements
Element Name: Title
Label: Title
Definition: A name given to the resource.
Comment: Typically, Title will be a name by which the resource is formally known.
Element Name: Creator
Label: Creator
Definition: An entity primarily responsible for making the content of the resource.
Comment: Examples of Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity.
Element Name: Subject
Label: Subject and Keywords
Definition: A topic of the content of the resource.
Comment: Typically, Subject will be expressed as keywords, key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme.
Element Name: Description
Label: Description
Definition: An account of the content of the resource.
Comment: Examples of Description include, but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content.
Element Name: Publisher
Label: Publisher
Definition: An entity responsible for making the resource available
Comment: Examples of Publisher include a person, an organization, or a service. Typically, the name of a Publisher should be used to indicate the entity.
Element Name: Contributor
Label: Contributor
Definition: An entity responsible for making contributions to the content of the resource.
Comment: Examples of Contributor include a person, an organization, or a service. Typically, the name of a Contributor should be used to indicate the entity.
Element Name: Date
Label: Date
Definition: A date of an event in the lifecycle of the resource.
Comment: Typically, Date will be associated with the creation or availability of the resource. Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and includes (among others) dates of the form YYYY-MM-DD.
Element Name: Type
Label: Resource Type
Definition: The nature or genre of the content of the resource.
Comment: Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary (for example, the DCMI Type Vocabulary [DCT1]). To describe the physical or digital manifestation of the resource, use the FORMAT element.
Element Name: Format
Label: Format
Definition: The physical or digital manifestation of the resource.
Comment: Typically, Format may include the media-type or dimensions of the resource. Format may be used to identify the software, hardware, or other equipment needed to display or operate the resource. Examples of dimensions include size and duration. Recommended best practice is to select a value from a controlled vocabulary (for example, the list of Internet Media Types [ MIME] defining computer media formats).
Element Name: Identifier
Label: Resource Identifier
Definition: An unambiguous reference to the resource within a given context.
Comment: Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Formal identification systems include but are not limited to the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL)), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN).
Element Name: Source
Label: Source
Definition: A Reference to a resource from which the present resource is derived.
Comment: The present resource may be derived from the Source resource in whole or in part. Recommended best practice is to identify the referenced resource by means of a string or number conforming to a formal identification system.
Element Name: Language
Label: Language
Definition: A language of the intellectual content of the resource.
Comment: Recommended best practice is to use RFC 3066 [ RFC3066] which, in conjunction with ISO639 [ISO639]), defines two- and three-letter primary language tags with optional subtags. Examples include "en" or "eng" for English, "akk" for Akkadian", and "en-GB" for English used in the United Kingdom.
Element Name: Relation
Label: Relation
Definition: A reference to a related resource.
Comment: Recommended best practice is to identify the referenced resource by means of a string or number conforming to a formal identification system.
Element Name: Coverage
Label: Coverage
Definition: The extent or scope of the content of the resource.
Comment: Typically, Coverage will include spatial location (a place name or geographic coordinates), temporal period (a period label, date, or date range) or jurisdiction (such as a named administrative entity). Recommended best practice is to select a value from a controlled vocabulary (for example, the Thesaurus of Geographic Names [TGN]) and to use, where appropriate, named places or time periods in preference to numeric identifiers such as sets of coordinates or date ranges.
Element Name: Rights
Label: Rights Management
Definition: Information about rights held in and over the resource.
Comment: Typically, Rights will contain a rights management statement for the resource, or reference a service providing such information. Rights information often encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. If the Rights element is absent, no assumptions may be made about any rights held in or over the resource.
Reformatted from: Dublin Core Metadata Element Set, Version 1.1: Reference Description, DCMI, 2003-06-02, URL: http://dublincore.org/documents/dces/
Encoding the date value
Metadata items may be free text fields, but will typically be constrained in some way. As an example the element Date is recommended is to be encoded using ISO 8601 (the International Standard for representation of dates). DC give the example of the form YYYY-MM-DD.
Controlled vocabulary
Some elements are recommended to be used with a value from a controlled vocabulary . As an example for the Type element, the DCMI Type Vocabulary includes "Event" and "Text":
Term Name: Event ...
Definition: An event is a non-persistent, time-based occurrence. Metadata for an event provides descriptive information that is the basis for discovery of the purpose, location, duration, responsible agents, and links to related events and resources. The resource of type event may not be retrievable if the described instantiation has expired or is yet to occur. Examples - exhibition, web-cast, conference, workshop, open-day, performance, battle, trial, wedding, tea-party, conflagration. ...
Term Name: Text ...
Definition: A text is a resource whose content is primarily words for reading. For example - books, letters, dissertations, poems, newspapers, articles, archives of mailing lists. Note that facsimiles or images of texts are still of the genre text. ...
From DCMI Type Vocabulary, DCMI Usage Board, 2003-02-12, URL: http://dublincore.org/documents/dcmi-type-vocabulary/
Other examples of controlled vocabulary are using the Internet Media Types ( MIME) for defining computer media formats in the format element and language tags, such as "en-AU" for Australian English.
Australian Digital Theses Program
The Australian Digital Theses Program provides a database of digitised theses produced at Australian Universities. Authors at ANU use a deposit form, the data from which is expressed as DC metadata. and provided via a search facility.
Other Dublin Core Projects are listed at URL: http://dublincore.org/projects/subject.shtml
Australian Government Locator Service (AGLS)
The Australian Government Locator Service (AGLS) metadata standard is a set of 19 descriptive elements to improve the visibility and accessibility of services and information over the Internet. The AGLS standard is based the 15 Dublin Core elements, plus four extra elements:
Element |
Description |
Example |
---|---|---|
Function |
The business function to which the resource relates |
<META NAME="AGLS.Function" CONTENT="School Education"> |
Availability |
How the resource can be obtained or accessed, or contact information |
<META NAME="AGLS.Availability" CONTENT="Medical assistance is available by contacting the after hours hotline on 1800 123456"> |
Audience |
The target audience of the resource |
<agls:audience>anglers</agls:audience> |
Mandate |
A specific legal instrument which requires a resource to be created or made available |
<META NAME="AGLS.Mandate.case" SCHEME="URI" CONTENT="http://www.austlii.edu.au/au/cases/cth/irc/990003.html"> |
Complied from AGLS Metadata Element Set, Part 2: Usage Guide, Version 1.3 , National Archives of Australia, 2002, URL: http://www.naa.gov.au/recordkeeping/gov_online/agls/metadata_element_set.html
No elements are mandatory for DC, but AGLS requires five (or six) of the following:
Creator
Publisher (note: this element is not mandatory for descriptions of services)
Title
Date
Subject OR Function
Identifier OR Availability
From: AGLS Metadata Element Set, Part 2: Usage Guide, Version 1.3 , National Archives of Australia, 2002, URL: http://www.naa.gov.au/recordkeeping/gov_online/agls/metadata_element_set.html
Qualifiers
Qualifiers are used to restrict the semantics of the relationship between the resource and the element value. AGLS encourages more use of qualifiers than DC, but does not require it:
Qualifiers are additions and extensions to the metadata elements that give metadata creators the option to refine the semantics of the element set, and add precision to the values of the metadata elements. For example, it may be useful to indicate that the value has been selected from a particular controlled vocabulary, such as a list of keywords, or is encoded using a particular convention "“ the format for dates is an important case "“ or in a particular natural language.
From: AGLS Metadata Element Set, Part 2: Usage Guide, Version 1.3 , National Archives of Australia, 2002, URL: http://www.naa.gov.au/recordkeeping/gov_online/agls/metadata_element_set.html
AGLS uses two types of qualifiers:
Element refinements are represented in HTML <meta> syntax with qualifiers appended to to the element names. For example: "DC.Type.documentType". Note that the "T" in "Type" in the example is in upper case, whereas the "d" of "document" is not. This is a somewhat odd practice in DC.
Encoding schemes indicate how the value is to be interpreted if it has been chosen from a controlled vocabulary, or externally defined standard. For example:
<META NAME="DC.Date.modified" SCHEME="ISO8601" CONTENT="1998-08-27">.
Metadata Tools
Metadata is rarely entered be the document author typing in text. When encoded in the header of a HTML document the metadata is not displayed by a web browser. Specialized software, such as a content management systems, or features in word processors are used to enter and display the metadata. The user of the system is likely to be unaware they are using a metadata standard or how it is encoded. Examples of how these systems will be shown later.
The Distributed Systems Technology Centre (DSTC Pty Ltd), has produced a metadata tool to create AGLS and Dublin Core metadata. Reggie, can be used to generate AGLS metadata syntax. This would be too cumbersome for creating real metadata, but is a useful way to learn about the process.
Further Information
Copyright © Tom Worthington 2000 - 2003