Click here to Skip to main content
15,884,629 members
Articles / Web Development / HTML

Helpful Toolchain for Article Writing

Rate me:
Please Sign up or sign in to vote.
5.00/5 (9 votes)
29 Jun 2017CPOL13 min read 19.4K   94   15   4
This toolchain (v.2) helped me to accelerate article writing, reduce the number of mistakes and save tons of time; some of my recommendations could help

Having trouble writing or posting articles? These articles aim to gather together tips and tricks from authors and mentors to help you write great articles.

Epigraph:

I wanted to say only one new word to the world. As I failed to do so, I became a writer.

Stanisław Jerzy Lec

Contents

Writing is Annoying

After suffering for a while with not very smooth writing processes, I started to think of getting it less painful. Indeed, writing an article gets easily sunken in tons of annoying detail: to make sure all anchors are unique and match the HREF, all links are correct, styles are consistent, and so on. It’s easy to forget millions of subtle things and get distracted from the main topics and goals of the article itself. An attempt to postpone all the annoying work does not help much; after all, it becomes harder and harder to force yourself to get to this part of work.

Actually, CodeProject article editor supports consistent editing process, but it lacks one most important feature: auto-generated Table of Contents (TOC) and auto-generated cross references, especially between TOC and headings. I just have to have this feature, period. Besides, for slow internet connection, on-line editing would be prohibitively slow.

So, my efforts paid to reduce the pain and make writing more stable and enjoyable started to pay off: I accelerated my work a lot, removed many distractions and now feel pretty well. I hope the result of my research can help many. This is pretty boring kind of research; and there is no need for everyone to go this way; it’s much better to use someone’s experience. My toolchain is by far not perfect, but I think it is not the worst balance between the tooling efforts and article writing itself.

At the same time, I would be much grateful if someone suggests some improvements of offer some informative criticism and perhaps some better ideas.

The toolchain is cross-platform for those who would use plain text as the source (probably the best option), but Windows-specific for those who would like to start with with MS Word.

Pandoc and Wiki

After some search, some trial and error work which did not much time, I ended up using open-source Pandoc converter, and use it to convert from Wiki to HTML.

CodeProject style, as well as any reasonable style of article writing can be matched very well by Wiki. In both approaches, a strict, moderate and consistent style is encouraged. All articles should look pretty much similar; the colors and effect should not shout. The article author should try to stand out with ideas and good taste, not with cheap effects.

So, first option is to write Wiki text as the article source.

TWiki

Pandoc supports not so many Wiki input formats, and the support is far from being comprehensive. Please see the documentation files I included in my ZIP file you can download from this page.

I’ve chosen TWiki, not the most popular Wiki markup, because I find it the closest to required CodeProject-specific features and does the required minimum. One very attractive feature is the ease of rendering of the most frequently needed CodeProject elements: in-line code fragment. It is rendered like this text and looks in source code as =this text=. Isn’t that simple?

A really big problem is the block code samples. By some reasons, Pandoc produces inner code element inside the pre element, and it badly spoils the formatting of the whole code fragment. The best work-around I found was also possible with TWiki. My sample markup is shown here; it also solves the problem if the anchor to the code sample, which would be some problem even with manual HTML coding.

It’s important to remember that CodeProject requires the top-level headings to be h2. In TWiki, it’s written as

---++Top-Level Article Section Heading

With proper Pandoc option, Table of Contents (TOC) is built automatically, and all headings are generated with automatic anchors with matching TOC href. Presently, TWiki does not support auto-numbered TOC, but this is not too bad, because CodeProject article style is designed for interactive reading. Also, it’s important to have some headings not showing on the TOC; this is done by a Pandoc parameter specifying maximum level of headings in the TOC. I’ll explain it on the script code.

It does not make much sense to show example of TWiki format. I’ve shown three cases; it should be enough. To make sense of it, it’s much easier to refer to the documentation.

Script/Batch

We are spending some time to save time in future, not to waste it, right? All of the above can only save time if nearly all the further steps are performed automatically by clicking just once on some script. The scripts I have right now are Windows-specific, but it’s just a matter of translating few lines of code into some other scripting language for other platforms.

This is what I can suggest for Windows, a batch file wiki2Html.bat:

@echo off

:: modify next two lines to point to the directories
:: where Pandoc is installed and where the input files are:
set tool=c:/app/Media/eBookAuthoring/Pandoc/pandoc.exe
set data=./

for %%f in (%data%*. wiki) do call:proc %%~nf
goto:eof

:proc
%tool% -s -S --read=twiki --toc --toc-depth 5 -B title.txt -H style.css -o %1.html %1.wiki
goto:eof

The proc part is a real subroutine with return. I need it, because it is the most convenient to add more lines to the handler of each item.

With the Pandoc options I use, 4 files are involved. Apparently, “%1.wiki” is the input file and “%1.html” is output. The option -s means stand-along it produces the whole well-formed HTML, not a fragment of it. With this option, it is important to have the hooks into the HTML structure the option -H specifies the file used as a child of the head element. In my sample, this is not really CSS, but a whole style element with CSS inside it. This file won’t be used in CodeProject, but it’s good to have some style set useful for rendering preview before publishing. Another hook, -B (title.txt), specifies the code on the very top of HTML body, above the TOC. This is the place to put article name, author’s name, a picture, epigraph, the heading “Contents” or “Table of Contents”, and the like.

Look at my sample of the file “style.css”. The file “title.txt” is a sample taken from the present article. These files can be found in the ZIP file downloaded from this page.

The parameters --toc and --toc-depth specify the presence of TOP and the maximum heading depth used in TOC.

The option -S (--smart) is used to simplify input of the important Unicode characters: -- is rendered as en-dash, --- as em-dash, ... as the single ellipsis; ASCII quotation marks U+27 and apostrophes U+22 (used either also for quotation) are rendered as Unicode pairs ‘, ’, “ and =”=; the choice between left or right member of the pair is taken depending on the context. See also: Take Care of Typography.

I described this behavior in detail in my documentation for the Visual Studio Code extension named “Extensible Markdown Converter”. This extension is based on the node.js module “markdown-it”, one of the products where this behavior is implemented.

Most Painful Component: MS Word

Nothing in the process irritates me as much as Word. It is hard to control. Even if all auto-corrections are completely turned off, this is World who decides to add an extra blank space or what character should come between lines, not me. And yet, I still keep using it, due to one single reason: it has a grammar checker, not only a spell checker. Perhaps the best way would be using Word only for final proof-reading, but then it would be good to have some other editor just to show wrong spelling while we type.

Anyway, I’m going to show the hard way, where the source text is always .DOCX. The thing is: the easiest way would be using copy and paste from Word to Wiki text, but even this quick operation badly distracts from writing: we need to re-render text often.

For this purpose I developed a script based on ActiveXObject scripting (sigh…). This is quite obsolete technology, but on Windows .bat and Windows Scripting Host (WSH) are the only two technologies which work without a need to install anything. This script produce just one step: it extracts pure text from World document, so it could be added to a general script. The best form for such things is the .WSF. This is the JavaScript (or VBS) written in a thing XML wrapper; and such file will execute on just one click. This is how my script looks:

HTML
<job>
<script language="JScript" src="wsh.js"></script>
<script language="JScript">
JavaScript
var options = {
    ext: ".out.txt",
    title: "Convert MS Word to plain text",
    last: undefined
}; //options

(function() {

    String.prototype.replaceAll = function(search, replacement) {
            var target = this;
            return target.split(search).join(replacement);
    };
 
    var files = FileSystem.requireInputFilesGetOneOutputFile(1, options.ext);
    if (!files.files) {
        Shell.errorBox(files, options.title);
        return;
    } //if
    var fname = FileSystem.expandFileName(files.files[0]);

    var doc = new ActiveXObject("Word.Application");
    doc.Visible = false;
    doc.Documents.Open(fname);
    var txt = doc.Documents.Open(fname).Content;
    txt = "" + txt;
    txt = txt.replaceAll(String.fromCharCode(11),  String.fromCharCode(13)+String.fromCharCode(10) );
    doc.quit(0);
    FileSystem.writeAllText(files.outputFile, txt);

})();
HTML
</script></job>

The part to note here is txt.replaceAll line. It is needed to solve very annoying problem: exported Word text contains VTAB character (code point 11) between some lines; and the difference is not visible in a standard editing mode by unarmed eye.

This script uses another script “wsh.js” written in JavaScript, but this is just the file system operations and command line parsing. Please refer to the code downloadable from this page. With the use of WSH script, the master script (this variant comes in the downloadable ZIP file under the name docx.wiki2Html.bat) looks like this:

@echo off

:: modify next two lines to point to the directories
:: where Pandoc is installed and where the input files are:
set tool=c:/app/Media/eBookAuthoring/Pandoc/pandoc.exe
set data=./

set tmpFile=____tmp.txt

for %%f in (%data%*.docx) do call:proc %%~nf
goto:eof

:proc
.\process\Docx2Text.wsf %1.docx -o:%tmpFile%
%tool% -s -S --read=twiki --toc -B title.txt -H style.css -o %1.html %tmpFile%
del %tmpFile% 
goto:eof

So this is the file which does the entire job on one click. It can be done often, without closing Word.

One of the most annoying problems here is this one: Word stores Unicode only in UTF-16LE, and Pandoc only works with UTF-8, but the worst thing is that "Scripting.FileSystemObject" only accepts ASCI or UTF-16 and breaks when it is fed UTF-8. As we often saw in the past, Microsoft is not compatible with Microsoft. I know that WSH is obsolete technology, but this fact cannot dismiss the warm place in the hell reserved for some. This is the reason to avoid Unicode in Word. For articles in English, this is not a serious, because few characters can be written as HTML entities. The utilities for conversion of all characters in HTML are useless, because first we need to obtain the text.

Visual Studio

Fortunately, most of us use software development tools, which are accurate, convenient, but not designed for writing articles. Well, not exactly. We only need a good text editor with a spell checker. There is nothing special about these requirements. After all, I found that I can add a spell checker to one of the tools I use on a regular basis: Microsoft Visual Studio 2015. So, I finally got rid of Microsoft Word (well, more exactly, got rid from keeping any Word documents; I only use Word for final proof-reading of the article, but don’t create any documents on disk).

The spell checked I’ve chosen for now is called “Visual Studio Spell Checker (VS2013/VS2015)”.

How does it feel? Well, not everything works as promised; there are some customer issue reports on the product, but it does the minimum without much buzz, first of all “check-as-you-type”. Even though there is no a grammar checking, the spell checking makes the entire work much more efficient. This is software development, right? The predefined rules installed already reasonable well handle computer-specific subjects. For example, words in camel case, other words which can be recognized as computer language constructs are ignored. Settings are easy to understand and can be stored locally with the project.

I was able to change the options and observer their effect. Unfortunately, not everything worked out: regular expressions to ignore did not work, but adding words to a local vocabulary was just fine. In other words, the tool is quite usable.

Of course, it’s possible to include the wiki file in any of the projects related to the article and do it all in one solution. If the development tools are different, it could be better to keep the Visual Studio use to minimum. I found it convenient enough to have only one solution, without any projects; all article items can be added to the solution as “solution items”.

Please see the sample “Article.sln” and “SampleArticle.html” (and too more files explained above) in the ZIP file provided.

CodeProject-Specific

CodeProject editor behaves nicely of the HTML code produced by the process described above in the Source (HTML) mode (a button on right of the toolbox). Only it’s better to avoid small hot fixes in the WYSIWYG mode; the results can be unpredictable. It would be better to do it in HTML, accurately preserving well-formed HTML; otherwise the submission script will try to fix it, also with unpredictable results.

There are just few subtle peculiarities:

  1. Lists are either bulleted or numbered. Table of contents (TOC) is generated by Pandoc without auto-numbering, so it will be rendered with bullets. To avoid them, the attribute style="list-style: none;" can be added to a ul element.
  2. Nested TOC will be rendered with top and bottom margins looking inconsistently around the inner ul element. It can be fixed by adding the attribute style="list-style: none; margin-top:0; margin-bottom:0;" to inner ul.
  3. All anchors should be created using the attribute id. It does not work for pre element, probably because CodeProject renders it dynamically, with Hide/Show/Copy operations. The workaround is shown below, starting from the TWiki text:
<literal><i id="someAnchor"></i><pre lang="c#">

SomeCSharpCode(forExample);

</pre></literal>

This is all the required post-processing, but I would be very happy if available CodeProject styles did it without any post-processing.

And now, a couple of useful CodeProject-specific markup idioms:

Downloading ZIP file:

HTML
<ul class="download">
    <li><a href="MyFile.zip">Download source code &mdash; 112.1 KB</a></li>
</ul>

CodeProject editor can add such markup automatically, based on downloaded file, but some authors missed it.

Also note: &mdash; instead of dash.

Block quote:

 

HTML
<blockquote class="FQ" id="epigraph">
<div class="FQA">Master said:</div>

<p><i>The wise phrase to convince anyone</i></p>

<dl>
    <dd><a href="https://author/url.org">Some Author</a></dd>
</dl>
</blockquote>

Take Care of Typography

Just take care of that. Dash character is not ‘-’, it is &mdash; or &ndash;; and " is not the best or typographically standard quotation mark: it makes automatic text search harder, as left mark is the same as right. Standard Unicode &lsquo;, &rsquo;, &ldquo; and &rdquo; are different and look more cultural. Even minus is not ‘-’ but &minus; with better visibility; ‘-’ is only the dash character. I listed almost all characters which are needed for majority of articles. There rules are not hard to observe, especially if suitable software is properly used.

Consult Character Map (“charmap” on Windows, “gucharmap” on most Linux distro) or HTML character entities. Be careful: not all of them are rendered on all systems, but the small subset I listed above is perfectly fine.

Versions

Initial version

March 22, 2017

V. 2

March 28, 2017

Added advice on Visual Studio and sample VS 2015 solution.

New Toolchain

June 29, 2017

New toolchain is published in a separate article: All in One Toolchain for Article Writing with Visual Studio Code.

As this new toolchain is all in one, and based on Visual Studio Code, which is open-source, good to have anyway and loads orders of magnitude faster than Visual Studio 2015, it may render the toolchain offered in the present article nearly obsolete.

However, the code related to Microsoft Office Word still might be of some value, as well as useful information on the tools like TWiki and Pandoc.

Final Words

The present article is written using the toolchain described in the present article. :-)

Happy writing!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect
United States United States
Physics, physical and quantum optics, mathematics, computer science, control systems for manufacturing, diagnostics, testing, and research, theory of music, musical instruments… Contact me: https://www.SAKryukov.org

Comments and Discussions

 
QuestionEverybody puts <code> inside <pre> Pin
Qwertie1-Apr-17 11:43
Qwertie1-Apr-17 11:43 
AnswerRe: Everybody puts <code> inside <pre> Pin
Sergey Alexandrovich Kryukov1-Apr-17 14:05
mvaSergey Alexandrovich Kryukov1-Apr-17 14:05 
QuestionMessage Closed Pin
23-Mar-17 19:41
Member 1301837623-Mar-17 19:41 
AnswerWelcome Pin
Sergey Alexandrovich Kryukov23-Mar-17 20:53
mvaSergey Alexandrovich Kryukov23-Mar-17 20:53 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.