Metadata and Electronic Document Management for Electronic Commerce

Capturing Australia's Scholarly Publishing

Tom Worthington

Version of 15 August 2008

This item on "Capturing Australia's Scholarly Publishing" is one of a segment on"Metadata and Electronic Document Management for Electronic Commerce" first presented for the Australian National University course "Information Technology in Electronic Commerce" (COMP3410/COMP6341).

This document is intended to provide both for live group presentation and accompanying lecture notes for individual use. The Slides and these notes are provided in the one HTML document, using HTML Slidy.

Capturing Scholarly Publishing

Thought experiment

A researcher, and ARC grant recipient, at an Australian university completes an article

Following peer review the article is accepted by an international proprietary journal

A post print copy of the article is also lodged with the university's open access digital repository ...

From: "Governmental Policy Frameworks", Dr Evan Arthur, Department of Education, Science and Training, 2004, URL: http://www.humanities.org.au/NSCF/PowerPoints/NSCF%20(Arthur).ppt

At a roundtable in 2004 a Thought experiment was outlined to transform the process of distribution of scholarly information in Australia. It proposed to allow research results to be available online to government funding bodies, universities where the research was conducted, industry and the public. This process can now be automated for open access publications, using digital libraries using metadata standards and XML.

Automated Distribution

... These actions lead to automatic updating of

the researcher's open access publication list

the university's open access record of staff research activity

the ARC's open access record of research activity related to its grants

a gateway site providing sophisticated, industry tailored access to research activities in Australian research institutions

the publicly accessible data warehouse which provides input into quality assessments of Australian research institutions

From: "Governmental Policy Frameworks", Dr Evan Arthur, Department of Education, Science and Training, 2004, URL: http://www.humanities.org.au/NSCF/PowerPoints/NSCF%20(Arthur).ppt

It was proposed that an researcher lodging their article in a university repository would automatically update institutional, government and public lists of research publishing.

Automating Capture

Journal publishes metadata for all papers in machine readable format on-line
Institution archive scans metadata for its authors
Institution publishes its author's metadata
ARC ingests metadata from the institution (checks against publisher)
Gateways provide industry tailored indexes to research

The "thought experiment" can simplified if an article to be lodged is already online with the required metadata. The step of lodging the article can be replaced by an automated scan.

To simplify the harvest process, the OAI Static Repository format is available for participating journals to publish their list of articles.

The papers can be automatically harvested from the digital library. The metadata can automatically populate publication lists, research gateways and quality assessment data warehouses.

Automated Capture

Arrow harvests ACS DL
IFIP DL expanding this globally

As a first step, the ACS Digital Library exports metadata from publications to the Arrow Discovery Service using OAI metadata and XML standards. This is being expanded to a global system with the IFIP Digital Library.

Masticating Documents

%0 Conference Proceedings
%A Aa, Tom Vander
%A Eeckhout, Lieven ...
%D 2002
%T Optimizing a 3D Image Reconstruction...
%O Seventh Asia-Pacific ... Conference (ACSAC2002)
%E Lai, Feipei
%E Morris, John
%I ACS
%C Melbourne, Australia
%P 119-126
%S Conferences in Research and Practice ...
%K CRPITVol6
%O confpapers/CRPITV6Aa.pdf

From: "Refer file of all papers", CRPIT, 2004, URL: http://crpit.com/CRPIT.refer

Same metadata in BibTex

@inproceedings{CRPIT-6-119-126,
   Author = {Aa, Tom Vander and Eeckhout...},
   Title = {Optimizing a 3D Image ...},
   BookTitle = {Seventh Asia-Pacific ... },
   Editor = {Lai, Feipei and Morris, John},
   Series= {Conferences in Research ...},
   Address= {Melbourne, Australia},
   Publisher = {ACS},
   Volume = {6},
   Pages = {119-126},
   Year = {2002} }

From: "BiBTeX file of all papers", CRPIT, 2004, URL: http://crpit.com/CRPIT.bib

The term "ingest" is used to describe the process of incorporating an electronic document and its metadata into an electronic archive. Therefore "Masticate", seems a suitable term to describe the preceding step of breaking the document into ingestible items.

Extracting the metadata is a much easier task than that of converting the entire content of a paper to an e-publishing format. The metadata required is not much more than already provided for bibliographic services.

Journal and conference indexes are traditionally provided in formats such as Refer and BibTeX. These formats can be converted to those needed by digital repositories. The resulting metadata files can be placed on the publisher's web site for harvesting tools used by readers and by archives. The archives can use XSLT to transform the metadata into other formats as required.

Transforming metadata

Refer to Bibtex using InterBib
BibTex to BibXML using bib2xml or online version

While Refer and BibTex contain most of the needed bibliographic information, it would be more convenient in an XML format for use in XML based systems. There are utility programs available to convert between bibliographic formats and to XML versions of these formats (such as BibXML). The XML versions can then be transformed using XSLT into other XML formats.

BibXML to RSS

<?xml version="1.0" ?>
<rss version="2.0">
<channel>
<title>CRPIT</title>
<link>http://crpit.com/</link>
<description>Conferences in Research and Practice in Information Technology</description>
<language>en</language>
<item>
<title>Fast Segmentation of Large Images</title>
<link>http://crpit.com/confpapers/CRPITV16Crisp.pdf </link>
</item>

Converted using XSLT

XSLT used for BibXML to RSS

<xsl:template match="REF">
<item>
<title>xsl:value-of select="TITLE" /></title>
<link>http://crpit.com/<xsl:apply-templates select="UNRECOGNIZED/ITEM[TAG='note']"/>
</link>
</item>

XSLT used, Tom Worthington, 2004, URL: crpit.xsl

The XML version of the metadata can then be made available and converted further.

RSS

<?xml version="1.0" ?>
<rss version="2.0">
<channel>
<title>ACM Queue</title>
<link>http://www.acmqueue.com/</link>
<description>Tomorrow's Computing Today</description>
<language>en-us</language>
<item>
<title>Samba Does Windows-to-Linux Dance</title>
<link>http://acmqueue.com/?...pid=171</link>
<description>Mounting remote Linux ...</description>
</item>

From: "RSS feed", Queue magazine, ACM, 2004, URL: http://acmqueue.com/rss.rdf

RSS (Really Simple Syndication) is a Web content syndication format usually used for news items. But it can also make research papers more widely available. As an example the ACM "Queue" magazine has a "feed" button on the home page.

Atom (IETF RFC 4287), provides a more advanced, standardised and feature rich syndication format than RSS. The ANU E Press provides a custom feed in RSS or ATOM of Weblog entries, New Products and Product Reviews.

OAI Static Repository

<ListRecords metadataPrefix="oai_dc">
    <oai:record> 
      <oai:header>
        <oai:identifier>oai:arXiv:cs/0112017...
        <oai:datestamp>2001-12-14</oai:datestamp>
      </oai:header>
      <oai:metadata>
        <oai_dc:dc ...
          <dc:title>Using Structural Metadata ...
          <dc:creator>Dushay, Naomi</dc:creator>
          <dc:subject>Digital Libraries</dc:subject> 
          <dc:description>With the increasing ...
        </oai_dc:dc>
      </oai:metadata>
    </oai:record>

From: "Specification for an OAI Static Repository and an OAI Static Repository Gateway Protocol", Version 2.0 of 2002-06-14, URL: http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm

OAI Static Repository Gateway Protocol, while more complex than RSS is conceptually similar. The details of a list of published documents can be provided in a static file which can be harvested by a remote system. This file can be simply placed on the publishers web site, alongside the Refer, BibTex and RSS files.

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is conceptually similar to the web services interface provided by Amazon.com to their list of publications.

Repository Explorer

Repository Name ACS Digital Library

Base URL http://dl.acs.org.au/index.php/index/oai
Protocol Version 2.0

Admin Email dl@tomw.net.au
Earliest Datestamp 2006-12-05T00:40:05Z

Deleted Record Handling no
Granularity YYYY-MM-DDThh:mm:ssZ

Compression gzip

Compression deflate
Other Information
description: 
   oai-identifier: 
      scheme: oai
      repositoryIdentifier: acs.ojs.journals.sfu.ca
      delimiter: :
      sampleIdentifier: oai:acs.ojs.journals.sfu.ca:article/1
Archive Self-Description, for http://dl.acs.org.au/index.php/index/oai, Repository Explorer, 2008-08-15T02:32:07Z

Repository Name	ACS Digital Library
Base URL	http://dl.acs.org.au/index.php/index/oai
Protocol Version	2.0
Admin Email	dl@tomw.net.au
Earliest Datestamp	2006-12-05T00:40:05Z
Deleted Record Handling	no
Granularity	YYYY-MM-DDThh:mm:ssZ
Compression	gzip
Compression	deflate
Other Information	description: oai-identifier: scheme: oai repositoryIdentifier: acs.ojs.journals.sfu.ca delimiter: : sampleIdentifier: oai:acs.ojs.journals.sfu.ca:article/1

Tools such as the Open Archives Initiative Repository Explorer allow demonstration access to a digital library's OAI interface. The formats the metadata is available in can be queried and then records of electronic documents requested in that format.

Metadata and Electronic Document Management for Electronic Commerce by Tom Worthington is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 Australia License.