Introduction
This application uses the Open XML SDK to find MERGEFIELD
s in Microsoft Word documents and replace them with the provided data. Additionally, there's also support for adding tables with data. This is a very fast and stable way of generating Microsoft Word documents server-side.
The main code only consists of 1 class with a few methods that do all the work. I've provided a frond-end to test the functionality of the class.
To be able to run the application, you must download and install the aforementioned SDK. As the SDK is written in .NET 3.5, the entire library only works in .NET 3.5 and above.
Background
For a customer project, I needed the ability to inject data from an XML file into a standardized document format. The customer still used Microsoft Office 2000 but had installed the Compatibility Pack on all his PCs.
I didn't want to use Microsoft Word through OLE automation because it was a server-side process that ran unattended. As Microsoft doesn't recommend using Microsoft Office in such scenarios, it wasn't an option. But I remembered that the new docx format is just a zipped archive of loose XML files that can be edited. After some searching on the Internet, I found the Open XML SDK that provided a lot of help in parsing the Microsoft Word document structure. Finally, I've written a piece of code that fills a Microsoft Word docx file with the data from the XML file. This resulted in the required document with data.
Using this mechanism also gave me the additional advantage that the customer himself could edit the layout of the template. Although it wasn't a requirement, it saved me a lot of time afterwards.
Using the Front-End
Along with the source code, a front-end application has been provided to allow you to test the functionality.
This application has been written using WPF and uses the datagrid from the WPF toolkit. To be able to run the testing application, you'll need to download and install the WPF toolkit from CodePlex.
Of course, before being able to test anything, you'll need a docx template. I've added a sample template to the zip file, but you can just as well provide your own (see the following chapter for details about the template).
In the main window, you must start by providing the full path of the template in the textbox above (as long as this field is empty, the Generate button will be disabled).
Add your fields and the data in the grid in the center of the window. To add tabular data, click on the 'Add Table' button and define the tablename and column names (max. 5). Click on OK and provide the data for the table. Repeat this for each table.
Finally, click on Generate. Your report should appear automatically.
The docx Template
First of all, you'll need a Microsoft Word docx document with a number of MERGEFIELD
s that act as placeholders for your data. The mergefield
s contain the name (code) of the data that you want to add, for example:
{MERGEFIELD CAND_NAME \* MERGEFORMAT}
There are also 3 suffixes that can be used:
dp
: Deletes the paragraph if the data field is empty or wasn't provided dr
: (only in tables) Deletes the row if the data field is empty or wasn't provided dt
: (only in tables) Deletes the whole table if the data field is empty or wasn't provided
The suffixes are added to the field name, with a preceding '#
'. For example:
{MERGEFIELD CAND_NAME#dp \* MERGEFORMAT}
If you want to add tabular data to the Word document, you must add a Table to the docx document. The cells of the Table contain mergefields that indicate the datafields that must be placed there. These Mergefield
s are formatted as: TBL_nameoftable_nameoffield
. For example:
{MERGEFIELD TBL_LANG_NAME \* MERGEFORMAT}
The mergefield
above tells the application that this cell contains the value of the Name-column in the selected record of the Lang-datatable. The application will add a row to the Table for each record found in the datatable. (Suffixes are not supported for tabular data. Each tablecell can only contain 1 mergefield.)
Note: The application will fill loose mergefield
s that are placed in the header/footer of the document, but there's no support for tabular data in headers/footers.
Using the Code
There's only one (public
) method that can be invoked on the FormFiller
class: GetWordReport
.
This method accepts 3 parameters:
filename
: Full path of the template docx file dataset
: A DataSet
containing the tabular data that must be added to the template. Each datatable
in the dataset
must be named according to the names used in the template (see above). If the template contains a field TBL_LANG_NAME
, the datatable
must be called 'LANG
' and must contain a column 'NAME
'. This parameter can be null
if there's no tabular data. values
: This is a Dictionary
of string
s where the key is the fieldname and the value is the data that must be placed in the Microsoft Word document.
If all goes well, the filled-in template is returned as an array of bytes.
A Few Highlights in the Code
Opening the Template
Opening the docx file is very easy with the SDK. Only the following code is required:
using (MemoryStream stream = new MemoryStream(filebytes))
{
using (WordprocessingDocument docx = WordprocessingDocument.Open(stream, true))
{
...
}
}
(The filebytes variable is the read-in docx template.)
Providing a Run-object for the Data
In the OpenXML
document, you can't just add text that contains plain hard returns or tabs. These must be replaced by the correct XML tags to be displayed correctly in Microsoft Word.
The mergefield
s in the OpenXML
are represented as SIMPLEFIELD
(<fldsimple>
) elements and can contain child RUN
(<r>
) elements. The text of the field is represented as a child TEXT
(<t>
) element inside the RUN
element. A RUN
element can also have a RUNPROPERTIES
(<rpr>
) element with additional layout information about the displayed text, which we don't want to lose, because we'd like our data to keep the same layout as the mergefield has in the template.
So, if we want to replace a mergefield with our text we must make sure that:
- tabs and returns in our data are rendered correctly, and
- the formatting of the mergefield is preserved
The code in the FormFiller.GetRunElementForText
does exactly this:
internal static Run GetRunElementForText(string text, SimpleField placeHolder)
{
string rpr = null;
if (placeHolder != null)
{
foreach (RunProperties placeholderrpr in placeHolder.Descendants<RunProperties>())
{
rpr = placeholderrpr.OuterXml;
break;
}
}
Run r = new Run();
if (!string.IsNullOrEmpty(rpr))
r.Append(new RunProperties(rpr));
if (string.IsNullOrEmpty(text)) return r;
string[] split = text.Split(new string[] { "\n" }, StringSplitOptions.None);
bool first = true;
foreach (string s in split)
{
if (!first) r.Append(new Break());
first = false;
bool firsttab = true;
string[] tabsplit = s.Split(new string[] { "\t" }, StringSplitOptions.None);
foreach (string tabtext in tabsplit)
{
if (!firsttab) r.Append(new TabChar());
r.Append(new Text(tabtext));
firsttab = false;
}
}
return r;
}
This method checks if there's a RUNPROPERTIES
element in the given mergefield. If there is, the content is preserved (.OuterXml
) and added to the newly instantiated RUN
element. The data is inspected for tabs/returns and the correct elements are added to the data (BREAK
and TABCHAR
elements).
Saving the Template
Once all the fields have been filled in, the changes must be explicitly saved back into the document (it doesn't happen automatically).
docx.MainDocumentPart.Document.Save();
Processing Headers and Footers
The headers and footers aren't placed in the same XML file as the main document (it's a different 'document part' in the package). The code that is discussed above won't find MERGEFIELD
s that are placed in the header or footer. For this, a loop over the header- and footerparts is required. Below is an example of a loop over the headers of the document:
foreach (HeaderPart hpart in docx.MainDocumentPart.HeaderParts)
{
...
hpart.Header.Save();
}
Points of Interest
The suffixes (see above) allow to delete paragraphs, rows and tables. If this is done while iterating over the elements, the loop suddenly stops (without throwing any error whatsoever). For example: if there are 10 mergefields in the document, you're iterating over them using the following statement:
foreach (var field in docx.MainDocumentPart.Document.Descendants<SimpleField>())
{
...
}
Suppose you decide to delete element 5. For example, the following code searches the parent PARAGRAPH
(<p>
) element of the mergefield
, and deletes it (deleting also the field itself):
Paragraph p = GetFirstParent<Paragraph>(field);
if (p != null)
p.Remove();
You'll never reach elements 6 to 10. The loop will quit without any indication that you've missed 4 elements.
To solve this, you'll remark in the code that there are 2 loops: the first loop will fill the mergefield
s with the data. This first loop will keep a list of empty mergefield
s and a second loop will delete all those empty mergefield
s.
Update provided by M. Chale
The library now supports tags for UPPER
, LOWER
, FirstCap
and Caps
. UPPER
and LOWER
modify the entire string
to be uppercase or lowercase, FirstCap
capitalizes the first letter while making everything else lowercase; and Caps
title-cases words, capitalizing the first letter of every word. Note that the Caps
routine is a bit naive, only capitalizing letters that directly follow spaces. The library also supports text that should appear before or after the data. They will be inserted with the same formatting as the rest of the MergeField
, provided the field is not blank and marked #dp
.
A sample field with formatting: MERGEFIELD MYFIELD \ UPPER \b before \f after
Thanks to Michael Chale for this update.
Update for Microsoft Word 2010
Since Microsoft Word 2010, the SimpleField
element is no longer used. It has been replaced with a number of Run
elements where one (or more) contain a FieldCode
element with the field instruction. The code of the library has been modified to replace these with the old-style SimpleField
thus remaining backwards compatible with Microsoft Word 2007 documents.
History
- 2009-07-29: Submitted to CodeProject
- 2009-08-12:
Mergefield
s in headers and footers will now also be processed - 2009-08-14: Small update in source: formatting of
mergefield
s in tables is now also repeated (bold, italic, ...) - 2009-09-15: Updated source:
MemoryStream
wasn't expandable and table row properties weren't copied. Fixed both issues. - 2010-06-14: Michael Chale added support for formatting the fields. I've updated the solution for VS2010.
- 2010-08-02: Updated library to work with Microsoft Word 2010 generated documents
- 2011-05-30: Added a couple of bugfixes to the library