Quick Evaluation of MS-Word 11 XML Features

By Tom Worthington FACS

Notes for ACS Canberra Branch meeting, 21 May 2003

Prepared 18 April 2003

  1. Installing

  2. MS-Word and XML

  3. Opening an OOO Document in Ms-Word

Installing

This is a quick evaluation of Microsoft Word 11 in the Office System Beta 2 Kit 2003 (commonly known as "Office 11"). In particular I wanted to look at Ms-Word's ability to work with XML documents.

The Kit

The Beta arrived in a small folder with a 28 page booklet. There were three CD-ROM disks, which did not seem excessive, but these aren't the software, just to help you with the evaluation. There is an inner folder with another 12 CD-ROMs with the actual software. That is a lot of disks, but all of these would not be needed for the average installation.

The product

The large range of applications and features in Office seems to be daunting for everyone, including Microsoft. The booklet says:

"While some overlap does exist among these applications, each of them has been designed to address distinct usage scenarios, and based on their needs, customers can choose the appropriate application."

This seems to be an admission that a number of separately developed applications have been packaged together, rather than having a coherent whole. However, if your needs fit one of the "usage scenarios", then that shouldn't be a problem. Like many happy computer users I spend most of my time in the e-mail and word processor, rarely venture into other applications and so don't have many integration problems.

New applications included in Office are "OneNote", an integrated jotting pad and "SharePoint" for collaboration. But I didn't look at these, just Ms-Word. Microsoft describe "InfoPath" for working with XML forms. It wasn't made clear in the overview if this was a separate application or a feature built into the traditional word processor, spreadsheet and other applications.

Installation

The Beta was installed on a Dell Latitude with a Pentium III processor running Windows 2000 (Office 11 doesn't run on Ms-Windows 98, requiring Windows 2000, NT, or XP).

The installation instructions were sparse, saying to insert the disk for the application required. Like a typical user I avoided looking at any of the demo or instructional disks. So I inserted the "office" disk, but was told I needed to install Windows 2000 service pack 3 first.

Upgrading the operating system

When I checked on-line, service pack 3 was only 620kbytes, but there were about 33 other "critical" updates recommended (more than 30mbytes), almost all to do with security. As this was just a trial (and the disk would be wiped clean after) I only selected the service pack. The download and upgrade (using a broadband connection) took 17 minutes, most of which was installation time.

Back to Office Installation

It is a little disconcerting that simply inserting the CD-ROM begins the software installation. I assume if I pressed "cancel" it would stop, but would prefer a positive "GO".

The product requires a 25 character product key, contained in the booklet. The number is difficult to read with Q looking like O and R like P. But the number worked first time.

I chose to do a full installation of the new applications and keep the old ones, needing 575MB. This seems a relatively modest amount of space and the installation only took 3 minutes. I forgot to remove the installation disk when rebooting the system, but it didn't do any harm.

In trying to start MS-Office I noticed an immediate problem. The shortcut was called "New Office Document", but immediately below this was a shortcut for "Open Office Document". Microsoft will need to remember to describe their product as "Microsoft Office" to distinguish it from “office” products from other vendors.

Opening MS-Word then presented a screen offering to activate the product via the Internet or telephone. I selected Internet which only took a couple of seconds.

I then typed in a short document and saved it in DOC, XML and HTML formats. Ms-Word looked and worked like other versions of word I have used. A 1kb text document took 24kbytes for DOC, 12kbytes for XML and 2kbytes for HTML.

Opening the source code of the XML version of my first test document showed a neatly structured word-centric document. Ms-word name spaces are declared, office document properties (such as author, title, last saved) are listed, then word font declarations, a very long set of list declarations (for my one short numbered list), styles for the heading and paragraph I used (and for a table I didn't use) and for docPr document print settings (I didn't sent any so these must be the defaults).

After all the document definitions came the small body of the document with the actual content. All structures in the body were qualified with "w" as in "w:body". There was a section and subsection around everything and proofreading error declarations around text which failed a spelling check <w:proofErr>.

For a document consisting of a few lines of text this was a lot of baggage, but it is logically laid out. One disappointment is that the MS-IE 6.0 web browser didn't render the document, just displayed the XML source code. Even though IE 6 can display an XML document, it can't do this with MS-Word XML documents.

From starting the laptop to looking at an XML document (while writing these notes) took one hour and three minutes. This was a reasonable amount of time. Next I tried some more complex documents.

MS-Word and XML

Having installed MS-Office and checked Ms-Word could save a simple document in XML, I then set about trying a more complex document.

In "Electronic Publishing Options for Academic Material" I described some simple tests converting Ms-Word documents to XML using the OpenOffice.Org (OOO) word processor. So decided to convert the same documents using Ms-Word 11.

The two documents were examples of formatting information technology articles for submission to ACM and IEEE journals. Both documents are in .DOC format and so there was no problem opening them in MS-Word 11. When saved as XML, the ACM document was 28 kbytes and the IEEE document was 146 kbytes. The are similar to the sizes of the original .DOC versions (23 and 111 kbytes).

As with translation by OpenOffice.org, the names of the styles in the original Ms-Word document survived the translation to XML. This is useful where parts of the document need to be automatically extracted by an electronic publishing system. The title in the original documents was marked "title", the XML versions had 'w:styleId="Title". However, Ms-Word redefines styles which have spaces in their names, so "Heading 1" (with a space) becomes "Heading1" (no space). This will take a little more work for any conversion program.

Images in the XML File

The large size of the IEEE XML file was a surprise. Most of the space in the original was not text, but images. Therefore it would be expected that the XML for the text would be small. But Ms-Word stores the images inside the XML document as “binData”, which is base 64 encoded. This format is commonly used for sending binary files by e-mail as a sequence of characters. This differs from OOO's approach, which is to store images as separate binary files and then compress the text and images into one Zip archive.

The Microsoft approach to storing images in XML has the advantage of keeping the images in the one file with the text of the document. This would appear to have the disadvantage of making a much larger document file, as base 64 is an inefficient way to encode binary data (compared to OOO using native binary image files). But when compressed with the Zip format, as used by OOO, Word's XML files reduces to about the same size as the OOO version.

XSL Transformations

When saving a document as XML in Ms-Word the option "apply transform" appears. I didn't have a transformation document (XSL) to hand for the ACM or IEEE formats, so used one I had prepared for demonstrating the formatting of tax forms in "Formatting the eBAS with XSL”.

Opening the Business Activity Statement tax form in Ms-Word displayed the default "data only" view:

ATO BAS Tas Form Displayed in MS-Word in  "Data Only" Format

This shows XML tags as boxed labels, around the data values. The data can be edited and saved as "data only" (the default from the "data only" format), preserving the XML structure, without Ms-Word formatting. While useful for technically orientated people, this view would not make much sense to the average user, who would be expecting to see a tax form.

However, "atobas.xsl" appears on the screen as an alternative view option to "data only", when opening the document. This is the name of an XSL Transformation file cited in the XML document. MS-Word automatically detected this XSLT file and provided it as an option, without my having to take any action. By selecting this option "atobas.xsl" is retrieved from the web, the transformation applied to the data file and the resulting HTML document displayed in a few seconds. The result looks much as it does in an XSL aware web browser, such as MS-IE 6:

ATO BAS Form Transformed By XSLT in MS-Word

The display is not identical to the web browser version; for example the address spread right across the screen on one line, rather than wrapped in a box on the left side of the screen. But colours and other layout appeared similar to the browser display:


ATO BAS data Transformed by IE6

While this transformation looks impressive at first, its limitations soon become apparent. After editing the data in "data only" format, there doesn't seem to be any easy way to flip back to the transformed view. If the document is saved and reopened, it can then be transformed, but once transformed it cannot be converted back to data (unless you have another transformation which does that).

Saving the transformed document, MS-Word warns the document name-space will be overwritten with the Ms-Word name-space. This warning message would be incomprehensible to the average Ms-Word user (even many IT professionals would not know what a name-space was). What this actually does is to save the document with the MS-Word XML formatting, so it effectively becomes a less structured word processing document, rather than a data form. The XML tags which previously delimited data (such as <BILLER_CODE>75556</BILLER_CODE> become text in the document.

In effect MS-Word applies an XSL transformation when you open an XML document, or when you save an XML document. Once you do this you are stuck with the transformed document (undo doesn't work). On the plus side my example transformation generates HTML, which appeared the same when displayed in a browser, as when the browser does the transformation.

Ms-Word would not appear to be suitable for filling in the ATO BAS form, as the data-only format where the data can be edited looks nothing like the tax form. What would be needed is a transformation to a display format and then another transformation to turn the result back into a data file.

Having the transformation function in Ms-Word is useful, but not a lot more useful than a standalone transformation function.

Opening an OOO Document in Ms-Word

Ms_Word can open XML documents and transform them, so could it open an XML document from another word processor, such as OOO? The first attempt to open the content of the OOO conversion of the ACM document failed with "Error processing system resource 'office.dtd'". This is reasonable, as OOO's DTD is not correctly referenced in OOO documents. The OOO software doesn't actually use the formal definition of an OOO document, it is just supplied for documentation purposes. In theory an OOO document could be transformed on opening to an XML format Ms-Word could read, but so far I haven't seen a suitable transformation.

In the absence of a transformation to modify the OOO XML, I used a manually modified version of the document. This has the OOO DTD inserted into the document and was successfully opening in the Adobe FrameMaker Version 7.0. This was used to detect syntax errors in the OOO DTD, which were reported to the OOO project and fixed.

Opening the OOO document in Ms-Word's "data only" mode worked. However, the results are disappointing. OOO's non-standard style declarations appear in the text of the document, rather than modifying the text:


OOO Styles Displayed in MS-Word

The text of the document shows up reasonably well, but Ms-Word doesn't know how to deal with OOO's structures, such as lists:


OOO List Displayed in MS-Word

With a suitable XSLT transformation and XML styles, it should be possible for MS-Word to convert to and from OOO format. However, the limitations of the OOO format may make this a less than perfect process. The reverse process can't be tested as the current version of OOO (Version 1.0.2) has no XML import facilities.

See also:

Tom Worthington © 2003