XML and Office data

Tom Worthington

Tomw Communications Pty Ltd

tom.worthington@tomw.net.au

For: Open Publish 2003, Tuesday 29 July 20031


Abstract

Standards group OASIS have formed a technical committee to define a file format for office documents containing text, spreadsheets, charts, and graphics. The standard is to be based on the OpenOffice.org XML format specification. The features and limitations of this file format and the feasibility of it becoming a standard are discussed.

Introduction

Publishers in both traditional and electronic media use specalised tools to design, create and disseminate information. However, much of the source of that information is authors who have just office automation tool, typically a word processor. Much effort has been spent by the publishing industry in taking word processing documents and converting them into well structured electronic documents for imput to their systems. Office automation systems, such as Microsoft Office 2003 and OpenOffice.Org, are now becoming available which store documents in XML. In theory this would allow authors to produce material in a format which could be easily imported into a publishing system. But the office automation systems use different XML formats for their documents.


Microsoft Office 2003 offers the potential of a defacto industry standard for an XML office document standard due to the wide use of previous versions of Ms-Office. An alternative is OpenOffice.org's XML format. The standards group OASIS have formed a technical committee to define a formal standard based on OpenOffice.org XML format file format for text, spreadsheets, charts, and graphics.


The rest of this presentation will discuss the features of Ms-Office 2003 and OOO's XML formats, concentrating on word processing documents.

About OASIS

OASIS is a not-for-profit, global consortium that drives the development, convergence and adoption of e-business standards. Members themselves set the OASIS technical agenda, using a lightweight, open process expressly designed to promote industry consensus and unite disparate efforts. OASIS produces worldwide standards for security, Web services, XML conformance, business transactions, electronic publishing, topic maps and interoperability within and between marketplaces.
From: OASIS - Who We Are – Mission, Organization for the Advancement of Structured Information Standards, undated, URL: <http://www.oasis-open.org/who/>

OASIS started out as “SGML Open” in 1993 and changed its name in 1998. It concentrates on more business level format standards, above the technical level done by W3C. OASIS is essentially a vendor based organisation, of the eight directors, six from from IT companies (BEA Systems, Intel Corporation, Hewlett Packard, Microsoft, Sun Microsystems, IBM Corporation), one from the US government (The Federal Reserve System) and the OASIS CEO2.

Ms-Word 11

Microsoft is expected to release an upgrade of Microsoft Word as part of Microsoft Office during 2003. A brief evaluation of Microsoft Word 11 in the Office System Beta 2 Kit 20033 it looked and worked like other versions of word, but allowed for working with XM. The results related to Ms-Word's ability to work with XML documents. It should be noted that the features of the released version of Word may differ from the Beta.

Some XML features of Ms-Word 11:

A 1kb “hello world” text document took 24kbytes when saved in DOC format, 12kbytes in MS-Word's XML format and 2kbytes converted to HTML. Opening the source code of the XML version of the test document showed a neatly structured word-centric document. Ms-word name spaces are declared, office document properties (such as author, title, last saved) are listed, then word font declarations, a very long set of list declarations (for one short numbered list in the document), styles for the heading and paragraph used (and for a table not used) and for docPr document print settings.


After all the document definitions is the small body of the document with the actual content. All structures in the body were qualified with "w" as in "w:body". There is a section and subsection around everything and proofreading error declarations around text which failed a spelling check <w:proofErr>.


For a document consisting of a few lines of text this is a lot of baggage, but it is logically laid out. The production release of the program might be more concise and it should be possible to write custom programs to remove unused declarations, and summarise used ones (in much the way the Tidy4 program cleans up Ms-Word generated HTML). One disappointment is that the MS-IE 6.0 web browser didn't render the XML document, just displayed the XML source code. Even though IE 6 can display an XML document, it can't do this with MS-Word XML documents.

MS-Word and XML

In "Electronic Publishing Options for Academic Material"5 I described some simple tests converting Ms-Word documents to XML using the OpenOffice.Org (OOO) word processor. So decided to convert the same documents using Ms-Word 11. The two documents were examples of formatting information technology articles for submission to ACM and IEEE journals. Both documents are in .DOC format and so there was no problem opening them in MS-Word 11. When saved as XML, the ACM document was 28 kbytes and the IEEE document was 146 kbytes. The are similar to the sizes of the original .DOC versions (23 and 111 kbytes).


As with translation by OpenOffice.org, the names of the styles in the original Ms-Word document survived the translation to Ms-Word XML. This is useful where parts of the document need to be automatically extracted by an electronic publishing system. The title in the original documents was marked "title", the XML versions had 'w:styleId="Title". However, Ms-Word redefines styles which have spaces in their names, so "Heading 1" (with a space) becomes "Heading1" (no space). This will take a little more work for any conversion program.


Images in the Word XML File


The large size of the IEEE XML file was a surprise. Most of the space in the original was not text, but images. Therefore it would be expected that the XML for the text would be small. But Ms-Word stores the images inside the XML document as “binData”, which is base 64 encoded. This format is commonly used for sending binary files by e-mail as a sequence of characters. This differs from OOO's approach, which is to store images as separate binary files and then compress the text and images into one Zip archive.

The Microsoft approach to storing images in XML has the advantage of keeping the images in the one file with the text of the document. This would appear to have the disadvantage of making a much larger document file, as base 64 is an inefficient way to encode binary data (compared to OOO using native binary image files). But when compressed with the Zip format, as used by OOO, Word's XML files reduces to about the same size as the OOO version.

OpenOffice.Org (OOO)

OpenOffice.Org (OOO) is an open source development of Sun Microsystems' StarOffice, office automation product. Version 5.2 was released in June of 20006. Sun purchased StarOffice and attempted to sell the product as a rival to Microsoft Office, with limited success. StarOffice 5.2 was the last pre-open source version of the OpenOffice.org 1.0 which is available free on-line has the same same source code and file formats as StarOffice 6.0 (sold by Sun with some additional features).


OpenOffice files are stored as a directory of ZIP compressed files. The text of the word processing document is stored in a file labeled "content.xml" in the directory. Images and other binary files are stored in sub-directories. OOO comes with well developed filters for converting to and from Microsoft Office formats.

In theory it should be possible to open the files which OpenOffice creates, using XML capable desk top publishing software. In practice, this does not work reliably. The DTD defining the structure of OpenOffice XML files is intended for documentary purposes only and does not appear to have been used to generate, or verify the code. Attempting to open a WP document generated by OOO 1.0 in Adobe FrameMaker Version 7.0 resulted in syntax errors in the DTD. These errors were subsequently corrected by the project:

Revision 1.71.2.1 May 31 2002 of drawing.mod <http://www.openoffice.org/source/browse/xml/xmloff/dtd/drawing.mod> shows two occurrences of: <!ATTLIST draw:text-box %draw-transform; >
From: Issue 6697, 2002-08-02, Project Issue Tracking: openoffice.org
Revision 1.31 May 6 2002 of chart.mod <http://www.openoffice.org/source/browse/xml/xmloff/dtd/chart.mod> has an occurrence of: fo:direction (ltr|ttb)
From: Issue 6698, 2002-08-02, Project Issue Tracking: openoffice.org

Also OOO's format does not take advantage of recent XML based formats for specifying structure (such as XML schema) or formatting. The result is that while there are numerious articles about converting from OOO XML to other XML formats, but these examples do not use the OOO DTD and only convert a subset of the format. An example of how OOO can be used is to translate documents created using a template in Ms-Word format7

The latest Beta release of OOO (version 1.1) has introduced an XSLT transformation capability for converting incoming XML document formats and creating saved XML formats (such as DocBook and XHTML). However, this facility is not as easy to use as Ms-Word 11's XSLT facility and the current filters supplied appear to be incomplete.

OASIS Standard Draft

OASIS issued a call for participation in an new “Open Office XML Format Technical Committee” in November 20028. The aim is to create standard XML file format for office applications:



The resulting file format must meet the following requirements:
it must be suitable for office documents containing text, spreadsheets, charts, and graphical documents,
it must be compatible with the W3C Extensible Markup Language (XML) v1.0 and W3C Namespaces in XML v1.0 specifications,
it must retain high-level information suitable for editing the document,
it must be friendly to transformations using XSLT or similar XML-based languages or tools,
it should keep the document's content and layout information separate such that they can be processed independently of each other, and
it should 'borrow' from similar, existing standards wherever possible and permitted.
From: OASIS Open Office XML Format TC, OASIS, 16 December 2002, URL: <http://www.oasis-open.org/committees/office/charter.php>

The TC is basing its work on OOO's format and in the first phase of the work, essentially tidying up the existing format and documenting it. The second is to extend the specification. The work of the committee is documented in a list archive publicly available9. This includes threaded discussions of options selected, as an example in dicussing what Medata options to use:

From: Philip Boutros <Philip.Boutros@stellent.com>
To: office@lists.oasis-open.org
Date: Fri, 24 Jan 2003 14:38:10 -0600
I thought I'd start the ball rolling on metadata in advance of Monday's call. Please forgive the "schema by example" nature of my examples. I would have presented the suggestions in a schema language (DTD, XSD, RelaxNG) but I'm not sure we've decided on one yet. ...
Option 1 Leave it alone.
Option 2 Leave the existing predefined metadata (meta:generator, dc:creator, etc.) as they are but extend meta:user-defined so it can contain more than just text types...
<meta:user-defined-date name="checkin-date">2003-01-24T13:47:12</meta:user-defined-date>
<meta:user-defined-text name="foo">Some text</meta:user-defined-text>
...
From Metadata options, Philip Boutros, office@lists.oasis-open.org , Fri, 24 Jan 2003 14:38:10 -0600 url: http://lists.oasis-open.org/archives/office/200301/msg00030.html

The current proposal is to divide the format up into “Work Packages”:


1 Document Structure
1.1 Top Level Elements
1.2 First Level Elements
1.3 Body
2 Meta Information
3 Text Content
3.1 Paragraph Level Structure
3.1.1 Basics
3.1.2 Lists
3.2 Text Sections & Indices
3.2.1 Basics
3.2.2 Indices
3.2.3 Index Source Elements
3.3 Inside a Paragraph
3.3.1 Basics
3.3.2 Footnotes & Endnotes
3.3.3 Rubies
3.3.4 Text Fields
3.3.5 Text Marks
3.4 Change Tracking
3.5 Text Declarations
4 Tables
4.1 Basic Table Model
4.2 Advanced Table Model
4.3 Table Change Tracking
4.4 Other Table Elements
5 Styles
5.1 Style Structure: Styles, Auto-Styles, and Special Styles
5.2 Basics
5.3 Style Properties
5.3.1 Style Poperty Elements
5.4 Special Styles
5.4.1 Footnotes & Endnotes
5.4.2 Bibliography
5.4.3 Line Numbering
5.4.4 List Styles
5.4.5 Outline Style
5.5 Font Declarations
5.6 Number Styles
6 Graphical Content
6.1 Shapes
6.2 Embedded Objects
6.3 Other Graphical Elements
7 Business Charts
8 Forms
9 Events and Scripts
10 Settings
From: Work Packages Proposal, Daniel Vogelheim, office@lists.oasis-open.org, Thu, 23 Jan 2003 19:52:01 +0100, URL: <http://lists.oasis-open.org/archives/office/200301/msg00027.html>

At its January 2003 meeting the TC decided to use XML Schema format automatically Relax-NG schemes for schema creation. The initial schema was automatically converted from the OOO DTDs.


As an example, the document title (from the dc or Dubmin Core medadata set) is defined as:


<define name="dc.title">
<element name="dc:title">
<ref name="dc.title-attlist"/>
<ref name="cString"/>
</element>
</define>
From: DTD conversions, Daniel Vogelheim, office@lists.oasis-open.org, Mon, 20 Jan 2003 18:43:19 +0100, URL: <http://lists.oasis-open.org/archives/office/200301/msg00022.html>

The TC aimed to complete the first phase (codification of the existing OOO format) by mid 2003. But it is not clear from the committee minutes if this schedule is still expected to be met.

Conclusion

OASIS Standard Viable?

It would be easy to dismiss the OASIS Open Office standard as just a creature of Sun Microsystems, given that much of the work of the committee is being undertaken by Sun staff. However, the best IT standards come from the codification of accepted industry technology. As an example, SGML was derived from IBM's GML.


The custom for Internet standards is to have two working inter-operable implementations (preferably open source). This demonstrates that the standard can actually be implemented in practice. Having an open source implementation available allows rapid implementation. The OASIS Open Office standard has only one implementation so far, that of OpenOffice.org. This precludes interoperability testing. OpenOffice.org provides filters for converting from other office formats, notably Microsoft Office, to OOO. What might be useful is if a translation from the new Ms-Office XML formats to OOO XML format could be undertaken without using OOO code.

OASIS Standard Useful in Practice?

Even if there are no other implementations of the OASIS Open Office format, the standards work will be of value. OpenOffice.org is a useful tool, as a entry level office automation product and as a tool for converting between office formats. As well as being run interactively in a GUI environment, OOO can be invoked under program control to convert between formats. The most likely use for this is to convert between older Ms-Office formats and XML. OOO's XML can be converted to other formats using XSLT transformations. What has held up use of this is the poor quality of the documentation for the OOO format (and of the OOO non-interactive mode). The OASIS work will be of value if only for improving the documentation of the OOO format, so it can be transformed into other formats.


Biographical Notes

Tom Worthington is a Visiting Fellow in the Department of Computer Science, Faculty of Engineering and Information Technology at the Australian National University. He is an electronic business consultant, author of the book Net Traveller and information technology professional, with 17 years experience.

Tom was an expert witness on the accessibility of the Sydney Olympic Web Site in the Human Rights and Equal Opportunity Commission in August 2000. He is currently contracted by Macromedia Inc. to adapt their accessibility resources for Australian use and teaches web design and e-commerce.



1Slides for this presentation are at URL: <http://www.tomw.net.au/2003/xmlstd.ppt>

2OASIS Board of Directors, OASIS, undated, URL: <http://www.oasis-open.org/who/bod.php>

3Quick Evaluation of MS-Word 11 XML Features, Tom Worthington, 21 May 2003, URL: <http://www.tomw.net.au/2003/mswxml.html>

4Clean up your Web pages with HTML TIDY, W3C, 2003, URL <http://www.w3.org/People/Raggett/tidy/>

5Electronic Publishing Options for Academic Material, For: Computing 3410 Students, The Australian National University, Tom Worthington , 20 August 2002, URL: http://www.tomw.net.au/2002/epo.html

6About Us: OpenOffice.org, OpenOffice.org, 2003-07,

URL: http://www.openoffice.org/about.html

7Electronic Publishing Options for Academic Material, Tom Worthington , 2002, URL: http://www.tomw.net.au/2002/epo.htmlURL: <http://www.tomw.net.au/2002/epo.html>

8OASIS TC Call For Participation: Open Office XML TC, From: Karl Best <karl.best@oasis-open.org> , 04 Nov 2002 08:32:27 -0500, URL: <http://lists.oasis-open.org/archives/tc-announce/200211/msg00001.html>

9Monthly Archives for office , OASIS, 2002-2003, URL: http://lists.oasis-open.org/archives/office/