Metadata and Electronic Document Management

Introduction

As part of COMP3410 in 2001 a case study was presented on the electronic publishing requirements of the Australian Computer Society (ACS). The ACS has been publishing for thirty years in support of its mission and objects.

"The ACS", Australian Computer Society, 2003, URL: http://www.acs.org.au/static/national/purpose.htm

This work was supplemented the following year with a discussion of the concept of Preflight systems and an investigation of the generation of XML from ACM and IEEE document templates using the OpenOffice.org product. For 2003 a draft of this document was presented, discussing what is needed to build a prototype system for the ACS. In 2004 an ANU student built a prototype system in response to these requirement for the ACS. This document gives some background on the issues with scholarly publishing and guidance on how the prototype might be expanded.

How free and open should access to scholarly research be?

The issue of how free and open access to scholarly research should be, and to make it that way, was explored on ABC Radio, in 2001:

On the eighth day God created the Internet so that eventually everyone would know everything. But mankind didn't want to share, and created new technologies to control the miracle of the Internet, and knowledge became a commodity.

Scientists are the first to rebel, and 26 000 have signed a petition. After the first of September they'll refuse to cooperate unless scientific knowledge is set free.

From: "Knowledge Indignation: Road Rage on the Information Superhighway", Background Briefing, ABC Radio National, August 12th 2001, URL: http://www.abc.net.au/rn/talks/bbing/stories/s345514.htm

Petition from Public Library of Science

The petition referred to is from the Public Library of Science. An Advocacy Group made up of 11 people from US based and one from UK academic institutions is proposed the establishment of international online public libraries of science with the complete text of all published scientific articles:

We believe that the permanent, archival record of scientific research and ideas should neither be owned nor controlled by publishers, but should belong to the public, and should be made freely available.

We support the establishment of international online public libraries of science that contain the complete text of all published scientific articles in searchable and interlinked formats.

From: "Open Letter", Public Library of Science, Patrick O. Brown and Michael Eisen, 2001, URL: http://www.plos.org/about/letter.html.

The group claimed 26,144 researchers from 170 countries signed the open letter urging publishers to allowing research reports from their journals to be publicly available. The web site for the group is maintained by Patrick O. Brown, Stanford University School of Medicine and the Howard Hughes Medical Institute and Michael Eisen of the Lawrence Berkeley National Lab and University of California at Berkeley.

There was no subsequent boycotting of traditional publishers. But there has been a change in the way research publishing is done. An interesting issue is the position of information technology researchers on the issue, given their role in creating the technology used for electronic publishing.

E-Publishing at ACS

The Australian Computer Society (ACS) publishes:

Some editions of some publications are made available free on-line in PDF or web format. However, there is no overall digital library. The ACS is now considering publishing strategies, including e-publishing.

Open Archives Initiative

Activities such as the Open Archives Initiative are attempting to construct a virtual library of material using distributed document archives and shared metadata:

Digital Library Federation Encourages Use of Open Archives Initiative The Digital Library Federation (DLF) is supporting the development of a small number of Internet gateways through which users will access distributed digital library holdings as if they were part of a single uniform collection. The gateways will be built using the OAI Metadata Harvesting Protocol. DLF gateways will contribute to a practical evaluation of the OAI's harvesting technique and its application within libraries to encourage digital collection managers to expose metadata and build services.

From: Open Archives Initiative, URL: http://www.openarchives.org/, 2001

Organisations now considering electronic publications strategies can consider an integrated approach using newer XML tools to create and maintain content. The ACS has a tradition of providing the content of its journal free for non-profit use. This could be extended into an electronic edition in a format suitable for direct citation and annotation with metadata in a format suitable for harvesting by specialised virtual library tools as well as traditional web search engines. The content could be available for use in multimedia conference and training formats.

ACM Digital Library

A pioneer of e-publishing for IT has been the Association for Computing Machinery (ACM):

The Association for Computing Machinery (ACM) is a professional society that publishes research journals and magazines in computer science. It also organizes a wide variety of conferences, many of which publish proceedings. ACM is typical of the publishers that have moved rapidly into electronic publication of conventional journals. In 1993, the ACM decided that its future publication process would be a computer system that creates a database of journal articles, conference proceedings, magazines and newsletters, all marked up in SGML. Subsequently, ACM also decided to convert large numbers of its older journals and build a digital library covering its publications from 1985. The digital library will eventually extend back to ACM's foundation in 1948.

From: "Preservation of Scientific Serials: Three Current Examples", WILLIAM Y. ARMS, The Journal of Electronic Publishing December, 1999, Volume 5, Issue 2 ISSN 1080-2711, http://www.press.umich.edu/jep/05-02/arms.html

The ACM collection was made available on-line in 1997 and the web interface allows the contents pages of the journals to be browsed and metadata searched. New content was created in SGML, then web, PDF and print versions generated for that. The online service is by paid subscription to members, non-members and institutions or sales per article. The service has proved popular and ACM is considering discontinuing some print titles.

ACM journals accept articles in a number of electronic formats using supplied templates. The PDF versions of documents generated are close in format to the print editions, but the HTML versions use a different format more suited to on-line viewing. Graphics are shown as small thumbnail versions, with links to high resolution versions.

E-publishing not easy

The minutes of the ACM Publications Board show the considerable complexities and manual processing steps which had to be overcome:

... the current track 1 production process:

1. The paper is received from EIC, and is logged into the system.

2. The paper is converted from whatever original format into SGML (requires intervention). For mathematics, ACM requires that minimum customization be inserted into LaTeX. ....

3. The SGML is copy edited (by the managing editors). ... email notification to the lead author to let them know to expect a galley in one week and that they will have 48 hours to respond to the galley. ...

From: "Minutes of the Publications Board Meeting", ACM, May 5, 2000, URL:http://www.acm.org/pubs/minutes_05-05-00.html">

E-publishing still problematic

Some of these issues were to do with limitations in electronic publishing software, which are still apparent today:

4. The reference section is created separately from the SGML file because it has to be citation-linked ....

5. ... Proof is sent to the author before any tweaking takes place. After feedback from the author, layout is tweaked ...

6. Problems in layout: tables with multiple columns which have different widths (the auto-table generator makes all columns of equal widths, so these must be tweaked by hand during composition).

7. Illustrations and figures are processed separately. If received figures are in TIF or EPS, they can be electronically processed and inserted during composition. Many times, the EPS file is non-standard ...

From: "Minutes of the Publications Board Meeting", ACM, May 5, 2000, URL:http://www.acm.org/pubs/minutes_05-05-00.html">

Formats

Given the rapid development in XML it was considered better for the ACS to wait until the technology was more widely available, rather than implement a SGML/PDF system which would then have to be replaced.

IEEE Xplore, the online delivery system for all the IEEE's journals, magazines, conference proceedings, and standards, is now bigger and better than ever, thanks to its latest release, launched in December. ...

Another enhancement is full-text HTML formats for issues of IEEE Spectrum and Proceeding of the IEEE going back to January 2002. PDF versions are still available, but articles presented in HTML are easier to navigate, Williams says.

From: " Upgrade Makes IEEE Xplore Easier to Explore", ERICA VONDERHEID, IEEE, 23 February 2004

JRPIT PDF example

JRPIT is published in a relatively efficient PDF format (only 39 kbytes for a 10 page paper with one photo).

PDF example

"The Future of Open Source Software", Bill Appelbe, JRPIT, Volume 35, No. 4, 2003, URL: http://www.acs.org.au/jrpit/JRPITVolumes/JRPIT35/JRPIT35.4.227.pdf

JRPIT PDF Detail

Zooming in to be able to read the text results in lines dropping off right hand side of the screen:

Detail from PDF example

Detail from "The Future of Open Source Software", Bill Appelbe, JRPIT, Volume 35, No. 4, 2003, URL: http://www.acs.org.au/jrpit/JRPITVolumes/JRPIT35/JRPIT35.4.227.pdf

Defining an XML Format

Since at least 1994, ACM has been working on systems to convert papers to a structured electronic format (originally SGML and later XML). However, the structure used has not been made public and no generally accepted format for publication of IT papers exists. It is therefore necessary to define a format. Rather than define a new XML format, a subset of XHTML was proposed.

While XML would seem the obvious encoding to use, this would then require additional processing to create a document which can be viewed on pre-XML web browsers. This could be done by storing the XML document and converting it to HTML for display. However, HTML was originally designed for publishing scientific research papers, it would therefore seem reasonable to use this format for IT papers. Using HTML would remove the need to transform documents for display, allowing one version to be used for creation, storage on-screen display and printing.

This proposal concerns the management of general information about accelerators and experiments at CERN. It discusses the problems of loss of information about complex evolving systems and derives a solution based on a distributed hypertext system.

From: "Information Management: A Proposal", Tim Berners-Lee, CERN, March 1989, URL: http://www.w3.org/History/1989/proposal-msw.html

XHTML is a version of HTML modified slightly to conform to XML syntax. Older web browsers which do not support XML directly can display XHTML documents reasonably well. With styling added through CSS, this can provide a high quality display on advanced web browsers, while still being readable on older devices. Additional formatting commands can be used to provide a printed display similar to a PDF document. Web browsers with limited formatting capabilities (because they are older, for hand held devices or for the disabled) will ignore the advanced formatting but still render a readable web page.

Some advanced features, such as MathML for mathematical equations, will not render on pre-XML web browsers. The conventional approach to this has been to render equations as an image (usually in GIF format) for older browsers. However, this requires generating multiple versions of the document. Instead it will be assumed that IT professionals working on advanced IT concepts will have more modern browsers with MathML and similar features. Those without these features will still be able to follow the discussion of the equations, from reading the accompanying text in the paper, even if they can't see the equations.

Preflight

The process of checking electronic documents is called "preflight" in the publishing industry:

Preflight - a term used to test the validity and completeness of a prepared DTP document, ready for supply to a bureau service provider.

From: "Glossary of Terms", Goprint , 2001 (Archived copy)

Academic publishers appear, in general, not to make use of preflight processes. As well as reducing the manual effort required by the publisher, this might also reduce the work needed by authors. As an example, submission processes require information which is already included in the text of the paper (such as Title, Author and affiliation) to be also entered in a separate on-line form. Examples are the ANU Digital Theses Deposit Procedures and the E-Print Repository Deposit Procedures. If preflight processes were used, this information could be extracted from the text of the document and presented to the author for checking.

Most authors will be unfamiliar with the discipline of using a template and it is not clear if they can be easily educated as to their use. However, if popular journals use the same (or similar) style sheets and this speeds up submission, it may be possible. Also use of a style sheet should allow creation of adaptable documents, rather than just print-line PDF documents.

Using OpenOffice.org to Translate Documents

An example of a document converted using OpenOffice.org, is "ICT Development in Australia - A Strategic Policy Review" prepared for the Australian Computer Society by Professor Houghton. The web adaption of the report was created from the the MS-Word version. This was done by first importing the MS-Word document into OpenOffice.org and saving in HTML. The HTML was run through the "Tidy" utility to replace formatting commands throughout the document with styles. The table of contents was then manually relinked to the document sections and ALT text placed on images.

Using OpenOffice.org to produce HTML has limitations. A better approach may be to use OpenOffice's internal XML format as an intermediate format. This retains more information about the original MS-Word document, than is present in a HTML translation.

As an example the Microsoft Word Style Files for ACM Journals and IEEE Transactions were converted to OpenOffice format:

Template Translation to OpenOffice XML Format
ACM IEEE
Template instructions.ms_word TRANS-JOUR.DOC
Size (Kbytes) 23 111
Converted File instructions.ms_word.sxw TRANS-JOUR.sxw
Size (Kbytes) 10 39
XML icontent.xml tjcontent.xml

OpenOffice files are stored as a directory of ZIP compressed files. The text of the word processing document is stored in a file labeled "content.xml" in the directory. Images and other binary files are stored in sub-directories.

WP Styles Translate to XML

Styles from the original style sheet are reflected in text styles in the translated XML documents:

...

<text:p text:style-name="Title">This Is the Title of the Paper</text:p>

...

<text:p text:style-name="Primary Head">1. INTRODUCTION </text:p>

From: " Publishing Options for Academic Material", Tom Worthington, 2002: URL:http://www.tomw.net.au/2002/epo.html

Xpub

xpub is an XML-based electronic publishing system for scientific papers.

A prototype version of xpub is being produced as part of a 3rd Year Software Engineering Individual Project (COMP37X0) in the Department of Computing Science at The Australian National University. The project is being completed by Tim Wilson-Brown with supervision by Ian Barnes and Tom Worthington. The project is based on a concept developed by Ian and Tom...

The xpub project is being supported by a scholarship from the Australian Computer Society (ACS) Foundation.

From: "Welcome to the xpub website", Tim Wilson-Brown, 2004, URL: http://xml.anu.edu.au/index.php3?config=xpub&dummy=index.

Tim Wilson-Brown's "Xpub" is a prototype XML-based Preflight electronic publishing system for ACS papers. This was produced as part of a 3rd Year Software Engineering Individual Project (COMP37X0) in the Department of Computing Science at The Australian National University, using the ACS case study as its specification.

Xpub uses a modified version of the ACS's Ms-Word template to make papers with styles, then OpenOffice.org to convert the documents to OOO's XML format. XSLT is then used to transform the document to a more logical XML structure, with formatting removed (equivalent to the Tidy program). Further XSLT transformations then convert the logical XML to displayable XHTML Basic, with references to a standard CSS style sheet for formatting.

The initial Xpub prototype is able to recognize styles, such as the abstract. These are converted to logical equivalents, such as < abstract >, then to renderable < p class="abstract"></p&gt. Extraneous formatting, such as font styles, colours and sizes are removed in this process. The only font information comes from the standard style sheet.

Xpub has a number of limitations in converting more complex structures, other formats and in reliability, which may be addressed as further student projects in courses "Project and Operations Management" (ENGN3221_S2) and "Software Engineering Projects" (COMP4500, 3100/3500, 3110/6311).

More Requirements for Xpub 2

  1. Metadata: The bibliographic information from the document in a standard format (such as DC).
  2. Images: Document images converted to a suitable format and resolution.
  3. References: citations in the paper converted to a standard format and linked to the electronic document, where possible.
  4. Equations: Mathematical equations converted.
  5. Schema: A format DTD or XML Schema defined for the intermediate and final formats of the document.
  6. Robust: The software made more reliable.
  7. Archive interface: Ensuring the document and metadata are in a format suitable for inclusion in scholarly archives. To demonstrate this the system could be interfaced to the ANU DPSPACE archive.
  8. Web Services interface: Allow conversion to be invoked using a web services transaction, similar to Amazon.com's implementation.
  9. Automatic indexes and tables of content: A table of contents should be generated from the headings in the document and an index from key words in the document. These should display differently (or not at all) depending on the display device. For example the tables of contents should be displayed by default in a window by a web browser, but hidden when the document is printed. The table should be have keyboard shortcuts for use on a mobile phone or an accessible browser.
  10. Automatic generation of slides from paper: The format should include the option of identifying key points so that a presentation versions can be generated automatically. The "slides" could be activated by a suitable viewer.
  11. Format for mobile device: An additional style sheet would be provided for automatic activation by a PDA, mobile phone or other hand held device to render the document for a small screen.
  12. Multiple publications: The software could be generalised to work with other scholarly publication formats, begining with that of the ACM and IEEE.
  13. Cross platform: The software could be adapted to work with other operating systems, apart from Linux, such as Ms-Windows and Unix(Mac).

Projects

The tasks might be usefully divided into a number of projects:

  1. Input Formats and Templates
  2. Translator
  3. Document format standardisation
  4. Output formats (slides, mobile phone)
  5. Archive Interface to an archive, such as DSPACE

The client for ACS publications would be the ACS (Publications section, journal and conferences editors). However, there may be scope for input from others.