Click here to Skip to main content
15,867,308 members
Articles / Web Development / HTML

C# Docx to HTML to Docx

Rate me:
Please Sign up or sign in to vote.
5.00/5 (18 votes)
22 Dec 2016CPOL3 min read 69.1K   5.8K   38   3
Converting Docx To Html to Docx

Introduction

GitHub: https://github.com/zaagan/Docx-Html-Docx.git

I would have simply uploaded this whole article from my docx file in just a few seconds, if only this WYSIWYG editor that I wrote this article on had an Upload from Docx button also. Well, I could have just used the Paste from Word Image 1 button. But to paste from a Word document, we need to have a Microsoft Office Package installed on the system (in Windows).

This article is the solution to that problem and also to help C# developers to perform Docx-HTML-Docx Conversion. The resources found in this article have been collected from many different places and solutions provided by many awesome developers around the globe and combined into one small sample application such that developers don't have to dwell around looking for solutions to common problems.

For now, we will look into how the conversion is done. In the next chapter to this article, we will be creating our very own CKEditor plug in to upload from Docx (Coming soon :D).

Requirements

  1. DocumentFormat.OpenXml.dll (2.6.0.0) [ For Docx to Html Conversion ]
  2. DocumentFormat.OpenXml.dll (2.5.5631.0) [ For Html to Docx Conversion ]
    We actually didn't have to include two different sets of the same DLLs, but it was mandatory due to some DLL issues
  3. OpenXmlPowerTools.dll
  4. System.IO.Packaging.dll (1.0.0.0)
  5. HtmlToOpenXml (1.6.0.0)
  6. System.Drawing [ Add Reference ]
  7. System.IO.Compression [ Add Reference ]
  8. CKEditor (4.6.1 Standard) - Your choice

Note: You can also find the above mentioned DLLs in the project that I have attached along with this article.

Background

Docx to HTML is becoming a very common requirement these days, mainly if you have a CMS or are building one and your WYSIWYG editor wants this feature. You can also find many questions regarding Docx to Html conversion in StackOverflow if you have noticed.

This editor I wrote my article on also has its own Paste from Word button. It would have been much better, if it had a feature to directly upload from a docx file alongside it. I hope this feature will soon be available in all the WYSIWYG editors out there.

Moving on to what this article intends to do is as shown in the figure below:

Image 2

Well, if you didn't know what a Docx file is, then it is simply a packaged file just like our normal zip file with its own set of standardized structure. If you try uncompressing a docx file with a Decompressor or a Zip extractor, this is what you get:

Image 3

For full details of the packaging structuring, you can head on to the following link:

Using the Code

Converting a Docx File data to an HTML content is as simple as shown by the following code:

C#
DocxToHTML.Converter.HTMLConverter converter = new DocxToHTML.Converter.HTMLConverter();
string htmlContent = converter.ConvertToHtml(YOUR-DOCX-FILE);

If you are building an ASP.NET application, you could have just sent the converted HTML content to the client but for demo purposes, I have shown the output in a CKEditor control inside a WinForm WebBrowser control.

Image 4

One thing we need to look for while parsing the docx content is to check for broken hyperlinks which might result in an exception. The following code intends to handle that exception.

C#
string htmlText = string.Empty;
try
{
    htmlText = ParseDOCX(fileInfo);
}
catch (OpenXmlPackageException e)
{
    if (e.ToString().Contains("Invalid Hyperlink"))
    {
        using (FileStream fs = new FileStream(fullFilePath, 
                                   FileMode.OpenOrCreate, FileAccess.ReadWrite))
        {
            UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
        }
        htmlText = ParseDOCX(fileInfo);
    }
}

return htmlText;

Actual parsing is done here by this method:

C#
private string ParseDOCX(FileInfo fileInfo)
{
    try
    {
         byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
         using (MemoryStream memoryStream = new MemoryStream())
         {
              memoryStream.Write(byteArray, 0, byteArray.Length);
              using (WordprocessingDocument wDoc = 
                                        WordprocessingDocument.Open(memoryStream, true))
              {
                    int imageCounter = 0;
                    var pageTitle = fileInfo.FullName;
                    var part = wDoc.CoreFilePropertiesPart;
                    if (part != null)
                        pageTitle = (string)part.GetXDocument()
                                                .Descendants(DC.title)
                                                .FirstOrDefault() ?? fileInfo.FullName;
                    
                    WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                    {
                         AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                         PageTitle = pageTitle,
                         FabricateCssClasses = true,
                         CssClassPrefix = "pt-",
                         RestrictToSupportedLanguages = false,
                         RestrictToSupportedNumberingFormats = false,
                         ImageHandler = imageInfo =>
                         {
                             ++imageCounter;
                             string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                             ImageFormat imageFormat = null;
                             if (extension == "png") imageFormat = ImageFormat.Png;
                             else if (extension == "gif") imageFormat = ImageFormat.Gif;
                             else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                             else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                             else if (extension == "tiff")
                             {
                                 extension = "gif";
                                 imageFormat = ImageFormat.Gif;
                             }
                             else if (extension == "x-wmf")
                             {
                                  extension = "wmf";
                                  imageFormat = ImageFormat.Wmf;
                             }

                             if (imageFormat == null) return null;

                             string base64 = null;
                             try
                             {
                                  using (MemoryStream ms = new MemoryStream())
                                  {
                                        imageInfo.Bitmap.Save(ms, imageFormat);
                                        var ba = ms.ToArray();
                                        base64 = System.Convert.ToBase64String(ba);
                                  }
                             }
                             catch (System.Runtime.InteropServices.ExternalException)
                             { return null; }

                             ImageFormat format = imageInfo.Bitmap.RawFormat;
                             ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()
                                                       .First(c => c.FormatID == format.Guid);
                             string mimeType = codec.MimeType;

                             string imageSource = 
                                    string.Format("data:{0};base64,{1}", mimeType, base64);
                             
                             XElement img = new XElement(Xhtml.img,
                                   new XAttribute(NoNamespace.src, imageSource),
                                   imageInfo.ImgStyleAttribute,
                                   imageInfo.AltText != null ?
                                        new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                             return img;                            
                          }   
                    };

                    XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
                    var html = new XDocument(new XDocumentType("html", null, null, null),
                                                                                htmlElement);
                    var htmlString = html.ToString(SaveOptions.DisableFormatting);
                    return htmlString;
              }
         }
    }
    catch
    {
        return "File contains corrupt data";
    }
}

The Uri fixing code goes like this:

C#
private static string FixUri(string brokenUri)
{
     string newURI = string.Empty;
     if (brokenUri.Contains("mailto:"))
     {
         int mailToCount = "mailto:".Length;
         brokenUri = brokenUri.Remove(0, mailToCount);
         newURI = brokenUri;
     }
     else
     {
        newURI = " ";
     }
     return newURI;
}

The HTML to Docx Conversion can be viewed in the link below:

Sources

I would like to thank each and every individual for his/her contribution and the helpful solutions to various problems that were encountered related to this topic.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer Assurance IQ
Nepal Nepal
I am a Software Engineer and a developer by profession.

The following things describes me:
1. I am a real time/run time developer : Basically learn and implement things on the fly.
2. Definitely not a technology racist : Love to work and explore any technology that i come across.
3. Prefer Beer over H2O : Code with passion and for fun

I am interested in Research & Development based tasks, exploring, experimenting and trying out new things.
Technologies i have been using up until now are C#, ASP.NET, Win Services, Web Services, Restful Web API, Windows Application, Windows Phone Application, Store Apps, couple of JavaScript frameworks, Xamarin Forms, NodeJS, React, ReactNative, AngularJS, SQL Server, MongoDB, Postgres etc.

Comments and Discussions

 
Praiseworks like a charm. Docx to PDF Pin
DaaJason6-Jun-21 0:43
DaaJason6-Jun-21 0:43 
5 years later, your code still works great! I had to install the 2.7.2 of DocumentFormat.OpenXml from nuget instead of the current up to date, which is fine by me.

One of the things I think it does very well is convert the images to their data:image format. Some libraries require storing off images and pointing to their path.

I am using this as an intermediate step to convert a docx to pdf without using $1k+ 3rd party tools (Aspose) in case anyone is looking for the same. This allows my clients to use docx as a template for pdf attachments sent as emails. Just a little text replacement of the docx contents and it's a great way to let THEM customize their content.
I am using itext.htmltopdf nuget package for the html to pdf conversion.
GeneralRe: works like a charm. Docx to PDF Pin
nullnull00221-Jan-24 15:30
nullnull00221-Jan-24 15:30 
QuestionDocxToHtml detail Pin
csugden26-Aug-19 3:59
professionalcsugden26-Aug-19 3:59 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.