Click here to Skip to main content
15,887,267 members
Articles / Programming Languages / C#
Tip/Trick

How to Insert Missing Spaces After Punctuation Marks in .docx Files using C# and the DocX Library

Rate me:
Please Sign up or sign in to vote.
4.80/5 (5 votes)
6 Jan 2014CPOL5 min read 15.4K   62   2   4
Insert missing spaces following punctuation marks

Not Again!

This tip is an extension of the one here which shows how to add spaces after periods that lack one, or reduce extra spaces to just one.

Oops-a-Daisy!

I realized that periods aren't the only punctuation marks that need to have a space after them, but don't always. And so, here's the code necessary to insert spaces so that text like this, from Mark Twain's The Adventures of Huckleberry Finn:

A harem's a bo'd'n-house,I reck'n.Mos' likely dey has rackety times in de nussery.En I reck'n de wives quarrels considable;en dat 'crease de racket.Yit dey say Sollermun de wises' man dat ever live'.I doan' take no stock in dat.Bekase why:would a wise man want to live in de mids' er sich a blim-blammin' all de time?No—'deed he wouldn't.A wise man 'ud take en buil' a biler-factry;en den he could shet _down_ de biler-factry when he want to res'.

...will have the missing spaces added so that it becomes:

A harem's a bo'd'n-house, I reck'n. Mos' likely dey has rackety times in de nussery. En I reck'n de wives quarrels considable; en dat 'crease de racket. Yit dey say Sollermun de wises' man dat ever live'. I doan' take no stock in dat. Bekase why: would a wise man want to live in de mids' er sich a blim-blammin' all de time? No—'deed he wouldn't. A wise man 'ud take en buil' a biler-factry; en den he could shet _down_ de biler-factry when he want to res'.

Show Me the Code

So now for the code to fix those sorts of sentences up: Once you've downloaded the DocX library from here, have referenced it in your C# Visual Studio project, and added a "using Novacode" to your project's using clause section, you can add code like what is shown below to accomplish the necessary insertion of missing spaces in a document.

First, add these consts to the top of your class:

C#
const int FIRST_CAP_POS = 65;
const int LAST_CAP_POS = 90;
const int FIRST_LOWER_POS = 97;
const int LAST_LOWER_POS = 122;

Then, create the LetSquarshedSentencesBreathe() method (you can give it a more serious/"professional"/staid name if you'd like) as shown below:

C#
private void LetSquarshedSentencesBreathe(string filename, string punctuation)
{
    using (DocX document = DocX.Load(filename))
    {
        for (int i = FIRST_CAP_POS; i <= LAST_CAP_POS; i++)
        {
            char c = (char)i;
            string originalStr = string.Format("{0}{1}", punctuation, c);
            string newStr = string.Format("{0} {1}", punctuation, c);
            document.ReplaceText(originalStr, newStr);
        }
        for (int i = FIRST_LOWER_POS; i <= LAST_LOWER_POS; i++)
        {
            char c = (char)i;
            string originalStr = string.Format("{0}{1}", punctuation, c);
            string newStr = string.Format("{0} {1}", punctuation, c);
            document.ReplaceText(originalStr, newStr);
        }
        document.Save();
    }
}

Finally, call GeneralCleanup(), which is shown below, from a button click handler:

C#
private void GeneralCleanup()
{
    List<string> puncMarks = new List<string>();
    Cursor.Current = Cursors.WaitCursor;
    try
    {
        string filename = string.Empty;
        DialogResult result = openFileDialog1.ShowDialog();
        if (result == DialogResult.OK)
        {
            filename = openFileDialog1.FileName;
        }
        else
        {
            MessageBox.Show("No file selected - adios!");
            return;
        }
		// Add all the punctuation marks that you want to follow with a space
        puncMarks.Add(".");
        puncMarks.Add(",");
        puncMarks.Add(";");
        puncMarks.Add(":");
        puncMarks.Add("?");
        puncMarks.Add("!");
	puncMarks.Add("»");
        // Add others if wanted (parentheses, angle brackets,
        // square brackets, braces, quotation marks, etc.)
        using (DocX document = DocX.Load(filename))
        {
            foreach (string punc in puncMarks)
            {
                LetSquarshedSentencesBreathe(filename, punc);
            }
        }
    }
    finally
    {
        Cursor.Current = Cursors.Default;
    }
    MessageBox.Show("General cleanup done!");
}

GeneralCleanup() loads a file, adds the punctuation characters you want to verify are followed by a space, then loops through all of them, calling LetSquarshedSentencesBreathe(), which finds any instances of sentences with scrunched together punctuation marks/following letters, and inserts a space between them so that they can be proud to strut around with their fellow characters that inhabit Strunk and White's The Elements of Style.

One step forward and 0.Squirch Steps Back

This should work just fine on regular text. However (there's always a however), if the text being cleaned up this way contains quotation marks, it could actually cause problems by inserting spaces between punctuation marks and their containing quotation mark, so that a sentence like this, from To the Person Sitting in Darkness by Mark Twain:

Is it, perhaps, possible that there are two kinds of Civilization -- one for home consumption and one for the heathen market?" ... Can we afford Civilization?"

...once "fixed," would look like this:

Is it, perhaps, possible that there are two kinds of Civilization -- one for home consumption and one for the heathen market? " ... Can we afford Civilization? "

Obviously, that's not what we want (the spaces between the question and quotation marks). So, now we need another method to fix any that have been misfixed or malfixed in that way. The following should do that:

Here Comes the New Code, Suspiciously Similar to the Old Code

C#
private void RemoveSpacesBetweenPunctuationAndQuotation(string filename, string punctuation)
{
    bool superfluousSpacesFound = true;
    string textToFind = string.Format("{0} \"", punctuation);
    string replacementText = string.Format("{0}\"", punctuation);
    using (DocX document = DocX.Load(filename))
    {
        List<int> multipleSpacesLocs;
        while (superfluousSpacesFound)
        {
            document.ReplaceText(textToFind, replacementText);
            multipleSpacesLocs = document.FindAll(textToFind);
            superfluousSpacesFound = multipleSpacesLocs.Count > 0;
        }
        document.Save();
    }
}
...and then add a call to RemoveSpacesBetweenPunctuationAndQuotation() following each call to LetSquarshedSentencesBreathe() in GeneralCleanup() so that section is now:
C#
foreach (string punc in puncMarks)
{
    LetSquarshedSentencesBreathe(filename, punc);
    RemoveSpacesBetweenPunctuationAndQuotation(filename, punc);
}

No Rest for the Weary

Apparently there are a lot of helper functions that could be written to make automatic fixing of funky formatting in files freely forgettable (IOW, just run a utility to fix common formatting problems and not have to fuss with them). These are a few helper functions to foment that fortuitous fantod.

Update to article (Use of RemoveSpacesBetweenPunctuationAndQuotation() helper method Considered Harmful) 

Actually, I was right the first time/I was wrong about being wrong the first time. Reason: As the LetSquarshedSentencesBreathe() method only replaces sentences like this:

Sit on a potato pan,Otis.Otis is an elevator.What?Oh...

...with, for example:

Sit on a potato pan, Otis. Otis is an elevator. What? Oh...

We need to remember that LetSquarshedSentencesBreathe() only adds spaces to punctuation marks that are followed by alpha characters. So sentences like this:

She opened her eyes and uttered the palindrome, "Sit on a potato pan, Otis." I then scratched my stubble.

...are not a problem anyway/after all, since they will not be changed to:

She opened her eyes and uttered the palindrome, "Sit on a potato pan, Otis. " I then scratched my stubble.

This is because quotation marks are not alpha characters. So, the RemoveSpacesBetweenPunctuationAndQuotation() method is probably moot (not mute, although perhaps I should have been "mute" about it). In fact, RemoveSpacesBetweenPunctuationAndQuotation() could be a problem, because, although it will fix up any quotation marks at the end of a sentence (such as if there was a "Sit on a potato pan, Otis. "), it will hosify quotation marks at the beginning of a sentence. So, if you had this:

Petulant Petula (not Petunia) sang, "Don't sleep in the subway, darling." Huh. "What in the whirled?"

...it would be changed to:

Petulant Petula (not Petunia) sang,"Don't sleep in the subway, darling." Huh."What in the whirled?"

So, use of the RemoveSpacesBetweenPunctuationAndQuotation() method should be considered harmful, or at least potentially harmful. Use it at your own risk.

Get Back Here! You Don't Get Off That Easy!

If you enjoyed this tip, demonstrate your exuberance by rearing your head back and yelling "Yee-haw!" at the top of your livers. If you didn't enjoy it, OTOH, go to the mirror, look into it, and slap the first person who shows up.

BTW, the code for this and the related tips are included in the file attached to this tip for download. 

 

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Founder Across Time & Space
United States United States
I am in the process of morphing from a software developer into a portrayer of Mark Twain. My monologue (or one-man play, entitled "The Adventures of Mark Twain: As Told By Himself" and set in 1896) features Twain giving an overview of his life up till then. The performance includes the relating of interesting experiences and humorous anecdotes from Twain's boyhood and youth, his time as a riverboat pilot, his wild and woolly adventures in the Territory of Nevada and California, and experiences as a writer and world traveler, including recollections of meetings with many of the famous and powerful of the 19th century - royalty, business magnates, fellow authors, as well as intimate glimpses into his home life (his parents, siblings, wife, and children).

Peripatetic and picaresque, I have lived in eight states; specifically, besides my native California (where I was born and where I now again reside) in chronological order: New York, Montana, Alaska, Oklahoma, Wisconsin, Idaho, and Missouri.

I am also a writer of both fiction (for which I use a nom de plume, "Blackbird Crow Raven", as a nod to my Native American heritage - I am "½ Cowboy, ½ Indian") and nonfiction, including a two-volume social and cultural history of the U.S. which covers important events from 1620-2006: http://www.lulu.com/spotlight/blackbirdcraven

Comments and Discussions

 
GeneralUse two spaces following puctuation that ends a sentence Pin
Bob Sidie8-Jan-14 11:23
Bob Sidie8-Jan-14 11:23 
GeneralRe: Use two spaces following puctuation that ends a sentence Pin
B. Clay Shannon8-Jan-14 11:33
professionalB. Clay Shannon8-Jan-14 11:33 
GeneralRe: Use two spaces following puctuation that ends a sentence Pin
B. Clay Shannon8-Jan-14 11:36
professionalB. Clay Shannon8-Jan-14 11:36 
GeneralMy vote of 5 Pin
fredatcodeproject6-Jan-14 2:50
professionalfredatcodeproject6-Jan-14 2:50 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.