Introduction
GitHub: https://github.com/zaagan/Docx-Html-Docx.git
I would have simply uploaded this whole article from my docx file in just a few seconds, if only this WYSIWYG editor that I wrote this article on had an Upload from Docx button also. Well, I could have just used the Paste from Word button. But to paste from a Word document, we need to have a Microsoft Office Package installed on the system (in Windows).
This article is the solution to that problem and also to help C# developers to perform Docx-HTML-Docx Conversion. The resources found in this article have been collected from many different places and solutions provided by many awesome developers around the globe and combined into one small sample application such that developers don't have to dwell around looking for solutions to common problems.
For now, we will look into how the conversion is done. In the next chapter to this article, we will be creating our very own CKEditor plug in to upload from Docx (Coming soon :D).
Requirements
- DocumentFormat.OpenXml.dll (2.6.0.0) [ For Docx to Html Conversion ]
- DocumentFormat.OpenXml.dll (2.5.5631.0) [ For Html to Docx Conversion ]
We actually didn't have to include two different sets of the same DLLs, but it was mandatory due to some DLL issues - OpenXmlPowerTools.dll
- System.IO.Packaging.dll (1.0.0.0)
- HtmlToOpenXml (1.6.0.0)
System.Drawing
[ Add Reference ] System.IO.Compression
[ Add Reference ] - CKEditor (4.6.1 Standard) - Your choice
Note: You can also find the above mentioned DLLs in the project that I have attached along with this article.
Background
Docx to HTML is becoming a very common requirement these days, mainly if you have a CMS or are building one and your WYSIWYG editor wants this feature. You can also find many questions regarding Docx to Html conversion in StackOverflow if you have noticed.
This editor I wrote my article on also has its own Paste from Word button. It would have been much better, if it had a feature to directly upload from a docx file alongside it. I hope this feature will soon be available in all the WYSIWYG editors out there.
Moving on to what this article intends to do is as shown in the figure below:
Well, if you didn't know what a Docx file is, then it is simply a packaged file just like our normal zip file with its own set of standardized structure. If you try uncompressing a docx file with a Decompressor or a Zip extractor, this is what you get:
For full details of the packaging structuring, you can head on to the following link:
Using the Code
Converting a Docx File data to an HTML content is as simple as shown by the following code:
DocxToHTML.Converter.HTMLConverter converter = new DocxToHTML.Converter.HTMLConverter();
string htmlContent = converter.ConvertToHtml(YOUR-DOCX-FILE);
If you are building an ASP.NET application, you could have just sent the converted HTML content to the client but for demo purposes, I have shown the output in a CKEditor control inside a WinForm WebBrowser control.
One thing we need to look for while parsing the docx content is to check for broken hyperlinks which might result in an exception. The following code intends to handle that exception.
string htmlText = string.Empty;
try
{
htmlText = ParseDOCX(fileInfo);
}
catch (OpenXmlPackageException e)
{
if (e.ToString().Contains("Invalid Hyperlink"))
{
using (FileStream fs = new FileStream(fullFilePath,
FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
}
htmlText = ParseDOCX(fileInfo);
}
}
return htmlText;
Actual parsing is done here by this method:
private string ParseDOCX(FileInfo fileInfo)
{
try
{
byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
using (MemoryStream memoryStream = new MemoryStream())
{
memoryStream.Write(byteArray, 0, byteArray.Length);
using (WordprocessingDocument wDoc =
WordprocessingDocument.Open(memoryStream, true))
{
int imageCounter = 0;
var pageTitle = fileInfo.FullName;
var part = wDoc.CoreFilePropertiesPart;
if (part != null)
pageTitle = (string)part.GetXDocument()
.Descendants(DC.title)
.FirstOrDefault() ?? fileInfo.FullName;
WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
{
AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
PageTitle = pageTitle,
FabricateCssClasses = true,
CssClassPrefix = "pt-",
RestrictToSupportedLanguages = false,
RestrictToSupportedNumberingFormats = false,
ImageHandler = imageInfo =>
{
++imageCounter;
string extension = imageInfo.ContentType.Split('/')[1].ToLower();
ImageFormat imageFormat = null;
if (extension == "png") imageFormat = ImageFormat.Png;
else if (extension == "gif") imageFormat = ImageFormat.Gif;
else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
else if (extension == "tiff")
{
extension = "gif";
imageFormat = ImageFormat.Gif;
}
else if (extension == "x-wmf")
{
extension = "wmf";
imageFormat = ImageFormat.Wmf;
}
if (imageFormat == null) return null;
string base64 = null;
try
{
using (MemoryStream ms = new MemoryStream())
{
imageInfo.Bitmap.Save(ms, imageFormat);
var ba = ms.ToArray();
base64 = System.Convert.ToBase64String(ba);
}
}
catch (System.Runtime.InteropServices.ExternalException)
{ return null; }
ImageFormat format = imageInfo.Bitmap.RawFormat;
ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()
.First(c => c.FormatID == format.Guid);
string mimeType = codec.MimeType;
string imageSource =
string.Format("data:{0};base64,{1}", mimeType, base64);
XElement img = new XElement(Xhtml.img,
new XAttribute(NoNamespace.src, imageSource),
imageInfo.ImgStyleAttribute,
imageInfo.AltText != null ?
new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
return img;
}
};
XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
var html = new XDocument(new XDocumentType("html", null, null, null),
htmlElement);
var htmlString = html.ToString(SaveOptions.DisableFormatting);
return htmlString;
}
}
}
catch
{
return "File contains corrupt data";
}
}
The Uri fixing code goes like this:
private static string FixUri(string brokenUri)
{
string newURI = string.Empty;
if (brokenUri.Contains("mailto:"))
{
int mailToCount = "mailto:".Length;
brokenUri = brokenUri.Remove(0, mailToCount);
newURI = brokenUri;
}
else
{
newURI = " ";
}
return newURI;
}
The HTML to Docx Conversion can be viewed in the link below:
Sources
I would like to thank each and every individual for his/her contribution and the helpful solutions to various problems that were encountered related to this topic.
I am a Software Engineer and a developer by profession.
The following things describes me:
1. I am a real time/run time developer : Basically learn and implement things on the fly.
2. Definitely not a technology racist : Love to work and explore any technology that i come across.
3. Prefer Beer over H2O : Code with passion and for fun
I am interested in Research & Development based tasks, exploring, experimenting and trying out new things.
Technologies i have been using up until now are C#, ASP.NET, Win Services, Web Services, Restful Web API, Windows Application, Windows Phone Application, Store Apps, couple of JavaScript frameworks, Xamarin Forms, NodeJS, React, ReactNative, AngularJS, SQL Server, MongoDB, Postgres etc.