Introduction
With an XML-based file format, WordprocessingML, Word 2003 provides new opportunities for using XSL transformation to convert data and documents to and from Word. This article presents a utility template for writing CodeProject articles in Word 2003, with an XSL stylesheet for converting the native document to a concise HTML syntax representative of the CodeProject submission template. This article is not intended to serve as an introduction to XSL transformation, nor necessarily as a primer on WordprocessingML. Rather, this article offers XSL examples for transforming a Word document with single- and multi-line paragraph styles, character formatting, images, hyperlinks, and tables.
Background
I like using Word for writing articles. There are numerous features � outlining, revision tracking, and proofing tools to name a few � to assist the writer. Historically though, as a rich-text HTML editor Word has had its problems. Its functions over the years to save a document as HTML have produced notoriously complex and verbose syntax. For its part, Word 2003 offers both a full-fidelity HTML save format, and a "filtered HTML" format. The former produces as garrulous a syntax as previous versions; the latter, though cleaner, still handles too many formats (such as a simple list item) using a <span> tag rather than the suitable HTML (<li>). Though I prefer an editor that generates a more standard HTML, I still wish to benefit from all of Word's features. Writing CodeProject articles, based on the CodeProject submission template[^], is an excellent case where I want Word's power but simple HTML output, using standard heading <h2>, paragraph <p>, and list item <li> tags among others.
Word 2003 opens the door to this possibility by offering WordprocessingML as an XML-based save format. Originally called WordML, WordprocessingML provides a complete grammar for representing a Word document as XML. With it and an appropriate XSL stylesheet, document transformation to a simpler HTML format is attainable. The template and companion XSL stylesheet described in this article serve as a utility to convert a Word 2003 document into a simpler HTML syntax for CodeProject articles.
For the reader not familiar with XML or XSL transformation, try the W3Schools tutorials on XML [^] and XSL [^]. For an introduction to and reference for WordprocessingML, try the following from Microsoft:
Using the Template
The template includes a custom toolbar, styles in the Bob-loves-orange CodeProject colors, and some VBA code. Because of the code, security issues must be considered when using the template.
Setting Up
Copy the template CodeProject Article.dot to your local templates directory. This location can be found by clicking Word's Tools menu to Options on the File Locations tab under "User Templates".
A typical location for the templates folder is "driveLetter:\Documents and Settings\user\Application Data\Microsoft\Templates".
Security Issues
Depending on your security settings, you may receive a warning (or the code may be disabled entirely) when attempting to use the template. To view your security settings in Word, click the Tools menu to Macro, Security.
The template is not signed, so disabled code is possible if the security level is set higher than Medium. To use the template, ensure one of the following options:
- On the Trusted Publishers tab of the Security window, check the box labeled Trust all installed add-ins and templates. This allows use of the template provided it has been copied to the User Templates file location.
- Set the Security Level to Medium and when opening the template, choose to enable macros.
- Sign the template with your own security certificate, potentially including that certificate among the Trusted Publishers list. Refer to Word 2003 documentation for more information on code signing.
The First Time � Setting Options
To create a new document using the template, click the File menu to New� In the New Document task pane, under Templates click On my computer�, then select the CodeProject Article icon. Upon first use, the Options dialog displays:
In the XSL Transform Stylesheet box, enter the full path of the companion XSL stylesheet, or click Browse to locate the file. This path must be set for the XSL transformation to function correctly. Check the box Open the .html file after XSL transform at your discretion.
These options are stored as custom properties in the template itself, so there are no additional registry settings or external files used.
Toolbar Functions
The XSL transformation employed here is largely based on the use of paragraph, character, and table styles. Specific style names are easy to match in the XSL stylesheet, and the template encourages the use of these styles through the functions on its custom toolbar.
Function |
Toolbar Button |
Description |
Heading 2 |
|
Apply the Heading2 style to the selected paragraph. Heading2 renders as an <h2> tag upon transformation. |
Heading 3 |
|
Apply the Heading3 style to the selected paragraph. Heading3 renders as an <h3> tag upon transformation. |
Code Block |
|
Apply the pre style to the selected paragraph(s). When transformed, blocks using the pre style are rendered within <pre>�</pre> tags. |
Normal |
|
Apply the Normal style to the selected paragraph(s). Normal paragraphs render as <p> tags. |
BulletList |
|
Apply the BulletList style to the selected paragraph(s). This style name is interpreted upon transformation as a <ul> block of <li> items. |
NumberList |
|
Apply the NumberList style to the selected paragraph(s). This style name is interpreted upon transformation as an <ol> block of <li> items. |
Bold, Italic, Underline |
|
Standard bold, italic, and underline character formatting, transformed to <b> , <i> , and <u> tags. |
Code formatting |
|
Character formatting for variables or class names; this style name transforms to a <code> tag. |
Table style � Border0 |
|
Apply the TableBorder0 table style to the selected table. Upon transformation, this renders a border="0" attribute in the <table> tag. |
Table style � Border1 |
|
Apply the TableBorder1 table style to the selected table. Upon transformation, this renders a border="1" attribute in the <table> tag. |
Insert Hyperlink |
|
Standard Word 2003 command for inserting hyperlinks, with the utility of including a new window [^] link. Destinations may be external to the document, or internal bookmarks. (Note: proofing errors within hyperlinks may interfere with the rendering of hyperlinks to XML; see Additional Considerations for more information) |
Insert Download |
|
Custom command for inserting a download file hyperlink, such as those that appear above an article. In addition to constructing the link, the DownloadList paragraph style is applied, which when transformed renders a <ul class='download'> tag. |
Insert Linked Picture |
|
Conducts the standard Insert Picture Word dialog, and then ensures that the inserted picture is linked and not embedded. Upon transformation, a linked picture is rendered as an <img> tag with a src attribute set to the path of the picture relative to the document. If the picture is in the same folder as the document, src is set to the file name only; if in a sibling folder of the document, src is set to folder\picFileName.xxx. |
Apply XSL Transformation |
|
Saves the current document in its original format (typically .doc), then saves again using XSL transformation, generating a file with the same name as the original but with an .html extension. Once transformed, the document is reset so additional saves retain the original format. |
Options |
|
Conducts the Options dialog, allowing the path to the XSL stylesheet to be set. These options are stored directly in the template as custom properties. |
The XSL Stylesheet
The file CPArticleTransform.xsl provides the XSL stylesheet used for this transformation. This file can be saved anywhere on the drive with the template; as mentioned, the template's Options dialog provides a box to enter the full stylesheet path.
Namespaces and Outer Templates
WordprocessingML incorporates a number of namespaces, which we will include as attributes in the root <xsl:stylesheet>
tag.
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"
xmlns:aml="http://schemas.microsoft.com/aml/2001/core"
xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
>
Among this listing, the following prefixes are particularly important in our transformation:
- xsl � serves as an alias for the namespace defining XSL transformation; stylesheet commands will be prefixed with xsl.
- w � alias for the WordprocessingML namespace; when matching most nodes specific to the Word document, we'll prefix using w. For example, to match a Word paragraph tag, we'll look for
<w:p>
.
- v � alias for the VML namespace, used by Word to represent images.
- wx � alias for the Word 2003 auxiliary namespace; section and sub-section tags will be prefixed with wx.
- aml � alias for the Annotation Markup Language namespace; bookmarks are represented as
<aml:annotation>
tags.
The root node of a Word document, represented through WordprocessingML, is the <w:wordDocument>
element. Our template for matching this root node of the document is as follows:
<!---->
<xsl:template match="/w:wordDocument">
<html>
<head>
<title>The Code Project</title>
<style>
BODY, P, TD { font-family: Verdana, Arial, Helvetica,
sans-serif;
font-size: 10pt }
H2,H3,H4,H5 { color: #ff9900; font-weight: bold; }
H2 { font-size: 13pt; }
H3 { font-size: 12pt; }
H4 { font-size: 10pt; color: black; }
PRE { BACKGROUND-COLOR: #FBEDBB;
FONT-FAMILY: "Courier New", Courier, mono;
WHITE-SPACE: pre; }
CODE { COLOR: #990000;
FONT-FAMILY: "Courier New", Courier, mono; }
</style>
<link rel="stylesheet" type="text/css"
href="http://www.codeproject.com/styles/global.css" />
</head>
<body>
<!---->
<xsl:apply-templates select="w:body" />
</body>
</html>
</xsl:template>
With this template, we set up the article HTML and issue the <xsl:apply-templates>
instruction to render the document body.
In WordprocessingML, a <w:body>
tag serves as a container for section and sub-section nodes, represented as <wx:sect>
and <wx:sub-section>
. These in turn serve as containers for paragraphs, represented by the <w:p>
tag. It is at the paragraph level that the heart of our processing begins, so for <w:body>
, <wx:sect>
, and <wx:sub-section>
matches, we simply issue the <xsl:apply-templates>
instruction to dive further down into the element hierarchy.
<!---->
<!---->
<xsl:template match="w:body">
<xsl:apply-templates select="*" />
</xsl:template>
<!---->
<xsl:template match="wx:sect">
<xsl:apply-templates select="*" />
</xsl:template>
<!---->
<xsl:template match="wx:sub-section">
<xsl:apply-templates select="*" />
</xsl:template>
Single-line Paragraph Formatting
Once inside the body of the document, we use a template matching the tag <w:p>
. This represents an individual paragraph. As the template is based on the use of styles in Word, locating heading paragraphs is a straightforward matter. Among other children, paragraphs are containers for <w:pPr>
tags, which stands for "paragraph properties". The <w:pPr>
tag may contain a <w:pStyle>
tag if a paragraph style is in use. The name of the style will be found in the w:val
attribute. Therefore, to match a paragraph with the Heading2 style, we can use the following XPath syntax:
w:pPr/w:pStyle[@w:val='Heading2']
The <w:p>
template looks for several different heading paragraph styles from within an <xsl:choose>
tag. The <xsl:otherwise>
condition applies a simple <p>
tag in the output.
<xsl:template match="w:p">
-->
<xsl:choose>
-->
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading2']">
<h2><xsl:apply-templates select="*" /></h2>
</xsl:when>
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading3']">
<h3><xsl:apply-templates select="*" /></h3>
</xsl:when>
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading4']">
<h4><xsl:apply-templates select="*" /></h4>
</xsl:when>
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading5']">
<h5><xsl:apply-templates select="*" /></h5>
</xsl:when>
. . .
-->
<xsl:otherwise>
<p>
-->
<xsl:choose>
<xsl:when test="w:pPr/w:jc/@w:val">
<xsl:attribute name="align">
<xsl:value-of select="w:pPr/w:jc/@w:val" />
</xsl:attribute>
</xsl:when>
</xsl:choose>
-->
<xsl:apply-templates select="*" />
</p>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Multi-line Paragraph Formatting
A more complex situation arises when using lists or <pre>
sections. In these cases, each line (ended with a carriage return) is considered a new paragraph to Word, and would have its own paragraph style information. Though we can still identify each by its style name (e.g. "BulletList", or "pre") we need to treat the multiple lines as a single group � surrounded with say a <ul>
or <pre>
container.
For these cases, we will still test for the style name as we did before. Once found, we'll test the preceding paragraph to see if it matches the same style. If it doesn't, we can assume we are beginning the multi-paragraph block. In the case of a BulletList for example, we will then apply a transform like the following:
<ul>
<xsl:apply-templates select="." mode="insideBulletList"/>
</ul>
The mode
attribute here is the key to making this work. We will continue applying templates, thus continuing to match <w:p>
tags. However, by specifying a mode
we can change the operational <w:p>
template to one specifically designed for, say, a bullet list. Recall that our original <w:p>
template was defined without a mode
:
<xsl:template match="w:p">
. . .
</xsl:template>
We'll define another template to match <w:p>
tags, but include the mode
attribute to handle paragraph processing differently inside a BulletList.
<!---->
<xsl:template match="w:p" mode="insideBulletList">
<!---->
<li><xsl:apply-templates /></li>
<!---->
<xsl:apply-templates
select="following-sibling::*[1][self::w:p/w:pPr/w:pStyle[@w:val='BulletList']]"
mode="insideBulletList" />
</xsl:template>
A paragraph match here outputs the list item <li>
tag, then applies the same template for any siblings that follow, provided they share the paragraph style name "BulletList". So back in the original <w:p>
template, as an <xsl:when>
condition in the original <xsl:choose>
instruction, the following handles BulletList formatting:
. . .
-->
-->
<xsl:when test="w:pPr/w:pStyle[@w:val='BulletList']">
<xsl:choose>
-->
<xsl:when
test="preceding-sibling::*[1][self::w:p/w:pPr/w:pStyle[@w:val='BulletList']]"/>
-->
<xsl:otherwise>
<ul>
<xsl:apply-templates select="." mode="insideBulletList"/>
</ul>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
. . .
This block reflects the pattern also used for NumberList, DownloadList, and pre paragraph styles.
Runs and Character Formatting
In WordprocessingML, the tag <w:r>
identifies a run of content. These tags are children of <w:p>
tags and represent containers of content with consistent character formatting. Text, linked images, and line breaks are all examples of content nested inside a <w:r>
tag. The <w:r>
tag may also contain a <w:rPr>
tag to enclose the properties (including character formatting) of the run. As multiple character formats may be applied to a run, we must adhere to proper hierarchical nesting of formatting tags in the output. To accomplish this, we will call a recursive template when matching a <w:r>
tag, and pass as a parameter the first of the child formatting tags within the <w:rPr>
run property parent.
<!---->
<!---->
<xsl:template match="w:r">
<!---->
<!---->
<xsl:call-template name="recurseRunProps">
<xsl:with-param name="nodeCount" select="1" />
<xsl:with-param name="propNodes" select="w:rPr/*" />
<!---->
<xsl:with-param name="runContent" select="*[not(w:rPr)]" />
</xsl:call-template>
</xsl:template>
The recursive template recurseRunProps checks to see if it has been passed a valid node, and if so tries to match supported character formatting. If a supported formatting tag is caught, an <xsl:call-template>
instruction is issued to execute recurseRunProps again with the next formatting child, nested within the appropriate output formatting tags. If the passed node is not a supported formatting tag, recurseRunProps is still called with the next formatting child, if any. When the recursion has ended, an <xsl:apply-templates>
instruction is performed to process the inner run content.
The following shows the pattern in recurseRunProps for matching <w:b>
bold formatting tags. Italic, underline, and <code>
character formats follow the same pattern.
<!---->
<xsl:template name="recurseRunProps">
<xsl:param name="nodeCount" />
<xsl:param name="propNodes" />
<xsl:param name="runContent" />
<!---->
<xsl:variable name="curNode" select="$propNodes[$nodeCount]" />
<!---->
<xsl:choose>
<xsl:when test="$curNode">
<!---->
<xsl:choose>
<!---->
<xsl:when test="name($curNode)='w:b' ">
<b>
<xsl:call-template name="recurseRunProps">
<xsl:with-param name="propNodes" select="$propNodes" />
<xsl:with-param name="nodeCount" select="$nodeCount+1" />
<xsl:with-param name="runContent" select="$runContent" />
</xsl:call-template>
</b>
</xsl:when>
. . .
<!---->
<xsl:otherwise>
<xsl:call-template name="recurseRunProps">
<xsl:with-param name="propNodes" select="$propNodes" />
<xsl:with-param name="nodeCount" select="$nodeCount+1" />
<xsl:with-param name="runContent" select="$runContent" />
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
<!---->
<xsl:otherwise>
<xsl:apply-templates select="$runContent" />
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Run Content: Text, Line Breaks, and Images
Following the recursive application of character formatting, we process the content of a run. Each type of content is supported through its own template matching a WordprocessingML tag. Regular text, represented by a <w:t>
tag, is rendered with an <xsl:value-of>
instruction.
<!---->
<xsl:template match="w:t">
<!---->
<xsl:value-of select="." />
</xsl:template>
Line breaks (created in Word by pressing [Shift]+[Enter]) are also simple to address with a template matching the <w:br>
tag:
<!---->
<xsl:template match="w:br">
<!---->
<br />
</xsl:template>
Images are a little more complicated. Our output should be an <img>
tag with a src
attribute pointing to a file relative to the html document itself. To support this, we must insert linked pictures in the Word document rather than embedded pictures. Linked pictures are identified with <w:pict>
tags in WordprocessingML. We can pull the linked file source name from the src
attribute of the w:pict/v:shape/v:imagedata
child tag. Pictures in WordprocessingML are described with VML syntax, hence the v:
prefix. In VML, image dimensions are represented through a CSS style
attribute. We use that to add a style
attribute to the output <img>
tag.
<!---->
<xsl:template match="w:pict">
<!---->
<!---->
<img>
<!---->
<xsl:attribute name="src">
<xsl:value-of select="v:shape/v:imagedata/@src" />
</xsl:attribute>
<!---->
<xsl:if test="v:shape/@style">
<xsl:attribute name="style">
<xsl:value-of select="v:shape/@style" />
</xsl:attribute>
</xsl:if>
</img>
</xsl:template>
Hyperlinks
In WordprocessingML, a hyperlink is represented with a <w:hlink>
tag. If present, a w:dest
attribute indicates an external destination. Without it, a destination internal to the document is assumed. The w:bookmark
attribute then contains the name of a destination bookmark.
Document bookmarks are represented by Word as a pair of <aml:annotation>
tags, one with a w:type
attribute of "Word.Bookmark.Start", the other with a w:type
value of "Word.Bookmark.End". The .Start bookmark tag also has a w:name
attribute representing the bookmark name. It is this value that will match the w:bookmark
value in the <w:hlink>
tag.
Whether the destination is external or internal, the <w:hlink>
tag will nest its display text as inner content.
<!---->
<xsl:template match="w:hlink">
<!---->
<xsl:variable name="dest">
<xsl:value-of select="@w:dest" />
</xsl:variable>
<!---->
<a>
<!---->
<xsl:attribute name="href">
<xsl:choose>
<!---->
<xsl:when test="@w:bookmark">
<xsl:value-of select="concat($dest, '#', @w:bookmark)" />
</xsl:when>
<!---->
<xsl:otherwise>
<xsl:value-of select="$dest" />
</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<!---->
<xsl:if test="@w:target">
<xsl:attribute name="target">
<xsl:value-of select="@w:target" />
</xsl:attribute>
</xsl:if>
<!---->
<xsl:apply-templates />
</a>
</xsl:template>
<!---->
<xsl:template match="aml:annotation">
<xsl:if test="@w:type='Word.Bookmark.Start'">
<!---->
<!---->
<a>
<xsl:attribute name="name">
<xsl:value-of select="@w:name" />
</xsl:attribute>
</a>
</xsl:if>
</xsl:template>
Tables
Table formatting has been kept simple in this stylesheet. The template contains two table styles, TableBorder0 and TableBorder1, which are interpreted in the XSL instructions to apply either "0" or "1" for the output table border
attribute.
<!---->
<xsl:template match="w:tbl">
<table>
<xsl:attribute name="border">
<!---->
<!---->
<xsl:choose>
<xsl:when test="w:tblPr/w:tblStyle/@w:val = 'TableBorder0'">
0
</xsl:when>
<xsl:otherwise>1</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<!--
We supply the following template to match the <w:tr>
table row tags:
<!---->
<xsl:template match="w:tr">
<tr valign="top">
<xsl:apply-templates />
</tr>
</xsl:template>
Finally, we process individual cells within a row by matching the <w:tc>
tag with a template. The formatting supported here includes background color and alignment.
<!---->
<xsl:template match="w:tc">
<td>
<!---->
<xsl:choose>
<xsl:when test="w:tcPr/w:shd/@w:fill">
<!---->
<xsl:attribute name="bgColor">
<xsl:value-of select="concat('#', w:tcPr/w:shd/@w:fill)" />
</xsl:attribute>
</xsl:when>
</xsl:choose>
<!---->
<xsl:choose>
<xsl:when test="w:tcPr/w:vAlign/@w:val">
<xsl:attribute name="valign">
<xsl:value-of select="w:tcPr/w:vAlign/@w:val" />
</xsl:attribute>
</xsl:when>
</xsl:choose>
<!---->
<xsl:apply-templates />
</td>
</xsl:template>
Additional Considerations
Word 2003 does a good job of representing a document with full fidelity in WordprocessingML � too good a job, in fact. Proofing errors for example may render as tags whether or not the options to display such errors are enabled. This can impact the XSL transformation.
Proofing Errors in Lists
When spelling or grammar errors exist at the beginning of a list item, a <w:proofErr>
tag may result as a sibling tag to the list item. In our transformation, we are assuming contiguous list items as siblings, employing following-sibling
and preceding-sibling
XSL functions to render them. The appearance of the <w:proofErr>
tag effectively interrupts the list, causing a new list to begin with the next list item. To avoid this problem, right-click those spelling and grammar errors in list items and choose to either fix or ignore them prior to transformation.
Hyperlinks as Field Codes
The existence of proofing errors may cause hyperlinks to render differently as well. I have seen hyperlinks represented in WordprocessingML as combinations of <w:fldChar>
and <w:instrText>HYPERLINK �</w:insertText>
tags, rather than as <w:hlink>
tags if there are spelling or grammar errors in the text of the link. As with list items, to avoid this problem right-click on those spelling/grammar errors and fix or ignore them prior to transformation.
Summary
This article presents an XSL stylesheet for transforming a Word 2003 document into a simple HTML syntax, at the same time offering a Word 2003 template for CodeProject article authors. By making heavy use of Word styles, and by matching specific WordprocessingML tags, common HTML may be rendered without the verbosity typical of Word's Save as HTML command. Certain transformation issues are resolved through resourceful XSL application. For example, the problem of dealing with multiple-paragraph blocks is resolved by using the mode
attribute of the <xsl:apply-templates>
instruction, and a recursive template is applied for proper hierarchical nesting of character formatting. With room for further development, I hope this template serves as a useful tool and XSL example for the CodeProject community.