The Power of Data

by Tom Worthington FACS

Visiting Fellow Department of Computer Science, Australian National University

For Enterprise Data Management by marcus evans, 23rd & 24th July 2002, Sydney, Australia

Discover how to exploit the power of data

Slides available.

Summary

  1. Metadata is Money
  2. Data Standards Are Political
  3. Data Standards Are Like Soup
  4. XML Schema is a Soup Mix
  5. SOAP in the Soup?
  6. Electronic documents are legal
  7. Open Source Powers XML
  8. The Web is a naming scheme, protocols and hypertext
  9. Interactive TV Isn't Difficult
  10. Metadata is the Killer iTV Application

Metadata is Money

Metadata (Data about Data) is essential for e-commerce, as it provides standard data items to allow parties to communicate about their organisations, products, terms and conditions. Also the actual payment and the "money" itself consists of data in an agreed meta-data format, in an electronic transaction. Without suitable meta-data standards, e-commerce could not take place and "money" in our online financial systems would cease to exist.

The Oxford English Dictionary describes metadata as:

metadata n., a set of data that describes and gives information about other data...

[1968 Proc. IFIP 4th Congr.: Suppl. 10 I. 113/2 There are categories of information about each data set as a unit in a data set of data sets, which must be handled as a special meta data set.] 1987 Philos. Trans. Royal Soc. A. 322 373 The challenge is to accumulate data..from diverse sources, convert it to machine-readable form with a harmonized array of *metadata descriptors and present the resulting database(s) to the user. 1998 New Scientist 30 May 35/2 With XML, attaching metadata to a document is easy, at least in theory.

Oxford English Dictionary, (Online) Draft entry Dec. 2001, URL: http://dictionary.oed.com/cgi/entry/00307096/00307096se19

Data Standards Are Political

There is a need for those creating electronic information to agree a common format for the information to be understood. The challenge is to create formats which are sufficiently expressive to be able to communicate what is needed, but simple enough to be implemented efficiently. The politics of standards development needs to be kept in mind. Most standards need to be profiled, to create a workable subset, before they can be used for practical purposes. Some standards need to be enhanced and others not used at all.

The World Wide Web Consortium (W3C) standard for Scalable Vector Graphics (SVG), provides a way to define images in web pages. As well as the expected features of shapes, filling, symbols, colors and patterns there is the 'metadata' element:

<!ENTITY % metadataExt "" >

<!ELEMENT metadata (#PCDATA %metadataExt;)* >

<!ATTLIST metadata %stdAttrs; >

From 21.2 The 'metadata' element, Scalable Vector Graphics (SVG) 1.0 Specification W3C Proposed Recommendation 19 July, 2001, URL: http://www.w3.org/TR/SVG/

This apparently technically simple definition is made politically complex by a preceding paragraph:

Individual industries or individual content creators are free to define their own metadata schema but are encouraged to follow existing metadata standards and use standard metadata schema wherever possible to promote interchange and interoperability. If a particular standard metadata schema does not meet your needs, then it is usually better to define an additional metadata schema in an existing framework such as RDF and to use custom metadata schema in combination with standard metadata schema, rather than totally ignore the standard schema.

From 21.1 Introduction, Scalable Vector Graphics (SVG) 1.0 Specification W3C Proposed Recommendation 19 July, 2001, URL: http://www.w3.org/TR/SVG/

The important points here are: "...free to define their own metadata schema but are encouraged to follow existing metadata standards ... better to define an additional metadata schema in an existing framework ...". In some ways the ease of defining metadata using new web based tools has made the standardization process more difficult. It is very tempting if an exiting definition is not quite right to define a new standard and hope that some tool will allow conversion between the standards. However, having many standards is a similar problem to no standards at all.

Data Standards Are Like Soup

Soup is a way to use up leftovers in the kitchen. Similarly, data standards should be used to cook together all the bits of data around the enterprise. Don't go out and buy expensive new ingredients (data), just mix together what you have.

  1. NAA (2000a) The Australian Government Locator Service: Summary, National Archives of Australia, Commonwealth of Australia, 2000, URL: http://www.naa.gov.au/recordkeeping/gov_online/agls/summary.html

  2. OCLC (2000) The Dublin Core Metadata Initiative, OCLC Online Computer Library Center, Inc., 2000, URL: http://purl.oclc.org/dc/

  3. NIST (1998) Federal Procurement Code List One (FP1), National Institute of Standards and Technology, 1998 URL: http://snad.ncsl.nist.gov/dartg/edi/fededi-coding.html

  4. UCC (1999) UCC: Voluntary Interindustry Commerce Standard (VICS EDI), Uniform Code Council, Inc., 2000 URL: http://www.uc-council.com/e_commerce/ec_voluntary_interindustry_com.html

  5. UNECE (19xx) UN/EDIFACT DRAFT DIRECTORY, United Nations Trade Division, 19xx URL: http://www.unece.org/trade/untdid/welcom1.htm

XML Schema is a Soup Mix

XML Schema Instance

<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
    <shipTo country="US">
        <name>Alice Smith</name>
        <street>123 Maple Street</street>
        <city>Mill Valley</city>
        <state>CA</state>
        <zip>90952</zip>
    </shipTo> ...
</purchaseOrder>
  
From: XML Schema Part 0: Primer, W3C Recommendation, 2 May 2001
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

...

 <xsd:complexType name="USAddress">
  <xsd:sequence>
   <xsd:element name="name"   type="xsd:string"/>
   <xsd:element name="street" type="xsd:string"/>
   <xsd:element name="city"   type="xsd:string"/>
   <xsd:element name="state"  type="xsd:string"/>
   <xsd:element name="zip"    type="xsd:decimal"/>
  </xsd:sequence>
  <xsd:attribute name="country" type="xsd:NMTOKEN"
     fixed="US"/>
 </xsd:complexType>

...

</xsd:schema>

From: XML Schema Part 0: Primer, W3C Recommendation, 2 May 2001

SOAP in the Soup?

SOAP Version 1.2 (SOAP) is a lightweight protocol intended for exchanging structured information in a decentralized, distributed environment. It uses XML technologies to define an extensible messaging framework providing a message construct that can be exchanged over a variety of underlying protocols. The framework has been designed to be independent of any particular programming model and other implementation specific semantics.

From: SOAP Version 1.2 Part 1: Messaging Framework, W3C Working Draft 26 June 2002

While XML Schema defines details of data within applications, SOAP defines the way the applications may communicate with each other (and makes use of XML Schema). The catch is that SOAP tries to achieve simplicity by leaving out features needed for real-world applications, such as reliability and security. When these are added with extensions, SOAP becomes much more complex.

Example 1: SOAP message containing a header block and a body
<env:Envelope xmlns:env="http://www.w3.org/2002/06/soap-envelope">
 <env:Header>
  <n:alertcontrol xmlns:n="http://example.org/alertcontrol">
   <n:priority>1</n:priority>
   <n:expires>2001-06-22T14:00:00-05:00</n:expires>
  </n:alertcontrol>
 </env:Header>
 <env:Body>
  <m:alert xmlns:m="http://example.org/alert">
   <m:msg>Pick up Mary at school at 2pm</m:msg>
  </m:alert>
 </env:Body>
</env:Envelope>
  

From: SOAP Version 1.2 Part 1: Messaging Framework, W3C Working Draft 26 June 2002

ps: If you are wondering what SOAP standards for, it doesn't. ;-)

1.1 Notational Conventions

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [3].

This specification uses a number of namespace prefixes throughout; they are listed in Table 1. Note that the choice of any namespace prefix is arbitrary and not semantically significant (see [10]).

Table 1: Prefixes and Namespaces used in this specification.
Prefix Namespace Notes
env "http://www.w3.org/2002/06/soap-envelope" A normative XML Schema [4], [5] document for the "http://www.w3.org/2002/06/soap-envelope" namespace can be found at http://www.w3.org/2002/06/soap-envelope.
flt "http://www.w3.org/2002/06/soap-faults" A normative XML Schema [4], [5] document for the "http://www.w3.org/2002/06/soap-faults" namespace can be found at http://www.w3.org/2002/06/soap-faults.
upg "http://www.w3.org/2002/06/soap-upgrade" A normative XML Schema [4], [5] document for the "http://www.w3.org/2002/06/soap-upgrade" namespace can be found at http://www.w3.org/2002/06/soap-upgrade.
enc "http://www.w3.org/2002/06/soap-encoding" Defined by Part 2 [1].

Namespace names of the general form "http://example.org/..." and "http://example.com/..." represent application or context-dependent URIs [6].

All parts of this specification are normative, with the exception of examples and sections explicitly marked as "Non-Normative".

1.2 Relation to Other Specifications

A SOAP message is specified as an XML Information Set [10]. While all SOAP message examples in this document are shown using XML 1.0 [8] syntax, other representations MAY be used to transmit SOAP messages between nodes (see 4. SOAP Protocol Binding Framework).

Some of the information items defined by this document are identified using XML namespace [7] names (see 5. SOAP Message Construct). In particular, this document defines the following namespace names:

Normative XML Schema [4], [5] documents for these namespace names can be found by dereferencing the namespace names above.

SOAP does not require that XML Schema processing (assessment or validation) be performed to establish the correctness or 'schema implied' values of element and attribute information items defined by this specification. The values associated with element and attribute information items defined in this specification MUST be carried explicitly in the transmitted SOAP message except where stated otherwise (see 5. SOAP Message Construct).

SOAP attribute information items have types described by XML Schema: Datatypes [5]. Unless otherwise stated, all lexical forms are supported for each such attribute, and lexical forms representing the same value in the XML Schema value space are considered equivalent for purposes of SOAP processing, e.g. the boolean lexical forms "1" and "true" are interchangeable. For brevity, text in this specification refers only to one lexical form for each value, e.g. "if the value of the mustUnderstand attribute information item is 'true'".

Specifications for the processing of application-defined data carried in a SOAP message but not defined by this specification MAY call for additional validation of the SOAP message in conjunction with application-level processing. In such cases, the choice of schema language and/or validation technology is at the discretion of the application.

SOAP uses XML Base [11] for determining a base URI for relative URI references used as values in information items defined by this specification (see 6. Use of URIs in SOAP).

The media type "application/soap+xml" [13] SHOULD be used for XML 1.0 serializations of the SOAP message infoset.

Electronic documents are legal

In 1995 a government committee, which I chaired, made recommendations for electronic document management in Australian Government Agencies (OGO 1995):

Whatever strategy is adopted, the document management system must:

  • provide adequate context information for documents;
  • provide means to prove the authenticity of documents used as evidence
  • provide for the disposal of records in conformance with the Archives Act 1983;
  • be robust against organisational or technological change;
  • provide levels of support for different types of document that accord with agency policy; and
  • provide links between paper and electronic documents.
From Improving Electronic Document Management: Guidelines For Australian Government Agencies (OGO 1995)

Since that time the uncontrolled use of e-mail and poor advice from lawyers has resulted in government and private organisations getting into a mess with electronic documents.

Open Source Powers XML

Technical and administrative solutions are available and courts are unlikely to accept the excuse: "the computer ate my records". One interesting technology is the OpenOffice.org package format . OpenOffice is an Open Source project with the aim to design and build free software to rival Microsoft Office. One spinoff of the project is a portable format for electronic documents, which builds on public XML standards developed for the web. Ian Barnes at the Australian National University has demonstrated it is relatively simple to manipulate the OpenOffice.org format.

The Web is a naming scheme, protocols and hypertext

There is no formal definition of what the World Wide Web (Web) is. However, the World Wide Web Consortium (W3C), which develops web specifications, describes it as:

... a network of information resources. The Web relies on three mechanisms to make these resources readily available to the widest possible audience:

  1. A uniform naming scheme for locating resources on the Web (e.g., URIs).
  2. Protocols, for access to named resources over the Web (e.g., HTTP).
  3. Hypertext, for easy navigation among resources (e.g., HTML).

W3C (1999)

With this in mind, web content can be manipulated for use on devices other than desktop computers.

Interactive TV Isn't Difficult

Neon Technology NTV - 500 entry level set-top Advanced set-top-boxes (STBs) for digital television provide web-like features for the viewer. These allow the viewer to look at the electronic program guide and other information. While previously based on proprietary formats, more recent boxes use some form of HTML. Some Internet TV boxes, such as the Neon Technology NTV - 500, use the TV just as a display device and use a conventional dial-up telephone line for Internet access. Canberra's Transact broadband system currently uses Motorola Streammaster 5000 STBs.

Web-t and iTV characteristics

Web telephones and iTV share common characteristics:

Web content can be generated to provide adaption to specific devices. However, the target device may not know, or may change, or resources may not available to closely tailor for a large range of device. W3C's accessibility guidelines were specifically developed with the needs of small devices in mind, as well as disabled users. It is therefore possible to use the guidelines to design content to suit a generic hand held web device, iTV and larger screens.

Typical Small Device

While it is possible to create web sites for specific telephone and TV web devices, a reasonable target to aim for is a device with a one Quarter VGA (QVGA) screen (320 x 240). Web pages designed for this size can display on a conventional larger PC screen, as well as a PDA or a TV. There is no exact size in pixels for a TV set, as TV standards differ across the world and the edge of the screen may be masked, so not all the image is displayed. Small devices may have limited or no Java, they are more likely to have ECMAScript. While CSS may be supported, it may not produce a readable display on a small screen.

Metadata is the Killer iTV Application

The TV broadcasting industry is undergoing a difficult period where it attempts to adopt interactive features:

AUSTRALIA'S free-to-air television networks have agreed to use a new interactive standard, but they are far from unified about how to go about introducing it. The Nine Network has criticized the slow advancement of the interactive television standard, MHP, but insists it is still committed to the embryonic technology. Nine's director of digital services Kim Anderson expressed frustration with the development of MHP, or Multimedia Home Platform, yesterday, and said her network had proceeded with developments in the DVB-HTML standard in the meantime... Confusion over iTV standard Kate Mackenzie,The Australian, FEBRUARY 27, 2002

TV metadata can be extremely valuable :

Within the TV-Anytime environment, the most visible parts of metadata are the attractors/descriptors used e.g. in Electronic Program Guides (EPG), or in Web pages to describe content. This is the information that the consumer, or intelligent agents, will use to search and select content available from a variety of internal and external sources.

Another important set of metadata consists of describing user preferences, representing user consumption habits, and defining other information (e.g. demographics models) for targeting a specific audience.

The TV-Anytime Metadata Specification also allows describing segmented content. Segmentation Metadata is used to edit content for partial recording and non-linear viewing. In this case, metadata is used to navigate within a piece of segmented content.

From the TV-Anytime Metadata Specification working group, 2001

TV-Anytime uses XML syntax for defining what metadata will be used (the schema) and then to actually encode the actual data (instances). The metadata structure is defined using the XML Schema based MPEG-7 Description Definition Language (ISO 2001), which uses the XML syntax.

An example of a metadata definition is:

<complexType name="EventInformationType">

<sequence>
<element name="PublishedTime" type="dateTime" minOccurs="0"/>
<element name="PublishedDuration" type="duration" minOccurs="0"/>
<element name="Live" type="tva:FlagType" minOccurs="0"/>
<element name="Repeat" type="tva:FlagType" minOccurs="0"/>
<element name="FirstShowing" type="tva:FlagType" minOccurs="0"/>
<element name="LastShowing" type="tva:FlagType" minOccurs="0"/>
<element name="Free" type="tva:FlagType" minOccurs="0"/>
</sequence>

</complexType>

From TV-Anytime Metadata Specification Version 1.2, WD498 Part-A.doc in ftp://tva:tva@ftp.bbc.co.uk/pub/Plenary/TV096r2.zip

Information display on a TV or PDA need not be limited to entertainment, but could be used for individually tailored customer education and sales. Information from traditional data repositories could be combined with audio and video to provide an individual customer experience.

For More Detail

This presentation is prepared from the following material prepared for the Australian National University:

And the presentation Metadata: the `killer application' for digital broadcasting? at the Australian Broadcasting Authority 2002 Conference, 29 April 2002, Canberra: http://www.tomw.net.au/2002/mka.html

Speaker

Tom Worthington Tom is an independent electronic business consultant and author of the book Net Traveller. Tom is one of the architects of the Commonwealth Government's Internet and web strategy. The first Web Master for the Australian Department of Defence, in 1999 he was elected a Fellow of the Australian Computer Society for his contribution to the development of public Internet policy. Tom is a director and past President of the Australian Computer Society and a voting member of the Association for Computing Machinery.

See also:


This document is Version 2.0 – 14 June 2002: http://www.tomw.net.au/2002/edm.html

Comments and corrections to: webmaster@tomw.net.au

Copyright © Tom Worthington 2002