Metadata and Electronic Document Management

Introduction

As part of ANU Course COMP3410 in 2001 a case study was presented on the electronic publishing requirements of the Australian Computer Society (ACS). This case study was supplemented in 2002 with a discussion of the concept of Preflight systems and an investigation of the generation of XML document templates. In 2003 requirements for a prototype system were added. In the first half of 2004 an ANU student built a prototype system in response to these requirement for the ACS. Additional features have been detailed to enhance the system in further student projects. As well as allowing authors to submit papers on-line and have them automatically converted into an XML based format ready for publication, the accompanying metadata would allow the documents to be automatically included in digital repositories.

It is proposed to interface the prototype ACS system to the DSPACE digital repository under construction at the ANU. A first step would be to have the metadata for a paper automatically extracted from the paper when lodged. The next step would be to have the list of all papers by ANU authors automatically included in the ANU repository as soon as they are published.

The metadata could automatically populate publication lists, research gateways and quality assessment data warehouses. After it was demonstrated to work with one journal, the software and expertise could be made available to allow any scholarly journal to be similarly incorporated. This data would be available to government and other organisations assessing research efforts as well as researchers and the general public.

It is proposed to transform the process of distribution of scholarly information by providing the metadata and the publications in formats which can be distributed and reformatted by popular web tools. This would allow many more potential readers to see the content in a more useful format that a facsimile of a black and white paper journal.

Capturing Australia's Scholarly Output

At a roundtable of "Changing Research Practices in the Digital Information and Communication Environment", June 2004 at the National Archives of Australia in Canberra, a Thought experiment was conducted:

Thought experiment

From: "Governmental Policy Frameworks", Dr Evan Arthur, Department of Education, Science and Training, 2004, URL:http://www.humanities.org.au/NSCF/PowerPoints/NSCF%20(Arthur).ppt

Automated Capture

It was proposed that lodging the article would automatically update archives with the publication details:

  • ... These actions lead to automatic updating of
    • the researcher's open access publication list
    • the university's open access record of staff research activity
    • the ARC's open access record of research activity related to its grants
    • a gateway site providing sophisticated, industry tailored access to research activities in Australian research institutions
    • the publicly accessible data warehouse which provides input into quality assessments of Australian research institutions
  • From: "Governmental Policy Frameworks", Dr Evan Arthur, Department of Education, Science and Training, 2004, URL:http://www.humanities.org.au/NSCF/PowerPoints/NSCF%20(Arthur).ppt

    Such a system could be demonstrated with the enhanced Xpub prototype. Articles would then be in a format suitable for repositories.

    Simpler Archive Proposed

    The "thought experiment" can simplified if an article to be lodged already includes the required metadata embedded in it. This metadata can automatically populate the repository.

    The step of lodging the article can be eliminated, if the article (with metadata) is available from the publisher's repository. This would require a list of accepted publications to be harvested automatically to be kept. Such a list and harvest system is used by the Australian Government to create the index for its web sites. The author's name and affiliation would have to be correctly identified in the so it could be correctly attributed.

    1. Journal publishes metadata for all papers in machine readable format on-line
    2. Institution archive scans metadata for its authors
    3. Institution publishes its author's metadata
    4. ARC ingests metadata from the institution (checks against publisher)
    5. Gateways provide industry tailored indexes to research

    To simplify the harvest process, the OAI Static Repository format could be used by participating journals to publish their list of articles.

    The information in the university archive could then be used to automatically construct a gateway site of industry tailored research activities and for quality assessments of Australian research institutions. The university list of publications could be automatically audited by software which checks each claimed paper against the publisher's official archive.

    In order to obtain a record of related research activity, there would need to be a further item of metadata matching articles to grants. This could be done in the university archive, or in the original paper with an acknowledgement.

    DSPACE

    DSpace is an open source software platform that enables institutions to:

    From: "DSpace System Documentation", Tansley, Mick Bass, Margret Branschofsky, Greg McClellan, David Stuve, Version: 1.1.1-1, 17-Sep-2003, MIT and Hewlett Packard, URL: http://dspace.org/technology/system-docs/introduction.html

    DSpace has been implemented at the ANU and is accepting collections of information. In addition to searches of the ANU's collection, searches of other institutions are possible However, the limiting factor is the manual work needed to import information to the system:

    Modifications to DSPACE's Ingest Process

    The batch item importer is an application, which turns an external SIP (an XML metadata document with some content files) into an "in progress submission" object. The Web submission UI is similarly used by an end-user to assemble an "in progress submission" object.

    From: "DSpace System Documentation", Tansley, Mick Bass, Margret Branschofsky, Greg McClellan, David Stuve, Version: 1.1.1-1, 17-Sep-2003, MIT and Hewlett Packard, URL: http://dspace.org/technology/system-docs/functional.html#ingest

    To implement the "thought experiment" the DSPACE Web submission UI would be modified to examine an individual document submitted and extract its metadata. This could then be displayed to the user for correction. A step would be added before the batch item importer to manufacture a SIP by transforming the XML metadata on the publisher's web site. DSPACE code is available online.

    Making Scholarly Work Interesting and Accessible

    <!-- Default stylesheet -->
    <link rel="stylesheet" href="../../all.css" type="text/css" />
    <!-- For print devices -->
    <link rel="stylesheet" href="../../print.css" media="print" type="text/css" />
    <!-- For projection devices -->
    <link rel="stylesheet" href="projection.css" media="projection" type="text/css" />
    <!-- For handheld devices -->
    <link rel="stylesheet" href="../../handheld.css" media="handheld" type="text/css" />

    XHTML Basic Code, From: Website Design, Tom Worthington, 2004, URL: http://www.tomw.net.au/2004/wd/index.html

    Before going on to show how to extract the metadata from scholarly works for inclusion in an archive, the question has to be asked: will anyone ever read them? Dull paper journals are not markedly improved by conversion to a dull electronic format. Rows of dusty shelves in a library are not improved by rows of dusty electronic entries in a digital archive.

    It is suggested that the opportunity is taken to make scholarly work more interesting to a wider audience. PDF formatting of papers has been limited to a fixed paper-like format. Web based formatting allows their style to be changed using CSS, to suit screen based reading, mobile devices, and presentation to an audience on large screen. Documents can be machine translated into other languages.

    Masticating Documents

    The term "ingest" is used in the DPSACE system to describe the process of incorporating an electronic document and its metadata into an electronic archive. Therefore "Masticate", seems a suitable term to describe the preceding step of breaking the document into ingestible items. This is the step which Xpub would carry out for ACS publications. However, it need to be kept in mind that extracting the metadata is a much easier task than that of converting the entire content of the paper to an e-publishing format. The metadata required is not much more than already provided for bibliographic services.

    Journal and conference indexes are traditionally provided in formats such as Refer and BibTeX. These formats can be converted to those needed by digital repositories (new documents would have this metadata captured on creation). The resulting metadata files can be placed on the publisher's web site for harvesting tools used by readers and by archives.

    %0 Conference Proceedings
    %A Aa, Tom Vander
    %A Eeckhout, Lieven ...
    %D 2002
    %T Optimizing a 3D Image Reconstruction...
    %O Seventh Asia-Pacific ... Conference (ACSAC2002)
    %E Lai, Feipei
    %E Morris, John
    %I ACS
    %C Melbourne, Australia
    %P 119-126
    %S Conferences in Research and Practice ...
    %K CRPITVol6
    %O confpapers/CRPITV6Aa.pdf
    

    From: "Refer file of all papers", CRPIT, 2004, URL: http://crpit.com/CRPIT.refer

    Same metadata in BibTex

    @inproceedings{CRPIT-6-119-126,
       Author = {Aa, Tom Vander and Eeckhout...},
       Title = {Optimizing a 3D Image ...},
       BookTitle = {Seventh Asia-Pacific ... },
       Editor = {Lai, Feipei and Morris, John},
       Series= {Conferences in Research ...},
       Address= {Melbourne, Australia},
       Publisher = {ACS},
       Volume = {6},
       Pages = {119-126},
       Year = {2002} }

    From: "BiBTeX file of all papers", CRPIT, 2004, URL: http://crpit.com/CRPIT.bib

    Transforming metadata

    While Refer and BibTex contain most of the needed bibliographic information, it would be more convenient in an XML format for use in XML based systems. There are utility programs available to convert between bibliographic formats and to XML versions of these formats (such as BibXML). The XML versions can then be transformed using XSLT into other XML formats.

    As an example:

    1. Refer to Bibtex using InterBib
    2. BibTex to BibXML using bib2xml or online version

    There are a number of practical problems to be overcome with such a process. The ACS CRPIT BibTex file does not contain the URI of the published paper. Therefore the Refer file is used. The Refer file is first converted to BibTex and then to BibXML. The Refer version contains only a partial URI (such as "confpapers/CRPITV6Aa.pdf"), the rest of the URI needs to be added.

    BibXML to RSS

    <?xml version="1.0" ?>
    <rss version="2.0">
    <channel>
    <title>CRPIT</title>
    <link>http://crpit.com/</link>
    <description>Conferences in Research and Practice in Information Technology</description>
    <language>en</language>
    <item>
    <title>Fast Segmentation of Large Images</title>
    <link>http://crpit.com/confpapers/CRPITV16Crisp.pdf </link>
    </item>
    

    Converted using XSLT

    XSLT used for BibXML to RSS

    <xsl:template match="REF">
    <item>
    <title>xsl:value-of select="TITLE" /></title>
    <link>http://crpit.com/<xsl:apply-templates select="UNRECOGNIZED/ITEM[TAG='note']"/>
    </link>
    </item>
    

    XSLT used, Tom Worthington, 2004, URL: crpit.xsl

    The XML version can then be made available to harvesting programs to collect the metadata. RSS)is intended for used directly by readers. OAI Static Repository is for archives which then make the data available to readers.

    RSS

    RSS (Really Simple Syndication) is a Web content syndication format usually used for news items. But this might make papers more accessible. As an example the ACM "Queue" magazine has such a "feed" button on the bottom right of the home page.

    <?xml version="1.0" ?>
    <rss version="2.0">
    <channel>
    <title>ACM Queue</title>
    <link>http://www.acmqueue.com/</link>
    <description>Tomorrow's Computing Today</description>
    <language>en-us</language>
    <item>
    <title>Samba Does Windows-to-Linux Dance</title>
    <link>http://acmqueue.com/?...pid=171</link>
    <description>Mounting remote Linux ...</description>
    </item>
    

    From: "RSS feed", Queue magazine, ACM, 2004, URL: http://acmqueue.com/rss.rdf

    RSS does not contain sufficient metadata to be used for an institutional repository. But it could be useful in encouraging wider readership of papers.

    OAI Static Repository

    OAI Static Repository is a more complete (and complex) XML format designed for digital archives. This would require a more complex XSLT transformation to create.

    <ListRecords metadataPrefix="oai_dc">
        <oai:record> 
          <oai:header>
            <oai:identifier>oai:arXiv:cs/0112017...
            <oai:datestamp>2001-12-14</oai:datestamp>
          </oai:header>
          <oai:metadata>
            <oai_dc:dc ...
              <dc:title>Using Structural Metadata ...
              <dc:creator>Dushay, Naomi</dc:creator>
              <dc:subject>Digital Libraries</dc:subject> 
              <dc:description>With the increasing ...
            </oai_dc:dc>
          </oai:metadata>
        </oai:record>
    

    From: "Specification for an OAI Static Repository and an OAI Static Repository Gateway Protocol", Version 2.0 of 2002-06-14, URL: http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm

    OAI Static Repository, while more complex than RSS is conceptually similar. The details of a list of published documents is provided in a static file which can be harvested by a remote system. This file can be simply placed on the publishers web site, alongside the Refer, BibTex and RSS files.

    Commercial Applications

    While scholarly publishing has the aim of furthering knowledge, rather than commercial gain, the requirements are similar. The same technology as can be used for large commercial e-publishing systems, such as a recent tender by the OECD:

    OECD ... releases approximately 150 new books a year in English and 100 in French... and publishes periodicals, loose-leaf titles, interactive statistical databases and one abstracting and indexing service

    OECD is seeking a supplier who can:

    From: "Call for Tender for Online Publishing Service", OECD/ICT/EXD/PCM/PAC/MKT/04/139, OECD, 2004, URL: http://www.oecd.org/dataoecd/22/22/33638479.pdf

    Sales of publications contributed 12 million Euro to the OECD in 2003.