FormatString - Smart String Formatting

Ivo Beltchev

4.86/5 (30 votes)

Nov 25, 2006

CPOL

12 min read

174593

1704

Smart string formatting and other string utilities

Download source - 18.48 KB

Introduction

In this article, we are going to talk about string formatting. The standard way of doing this in C is the old sprintf function. It has various flaws and is showing its age. C++ and STL introduce the iostreams and the << operator. While convenient for simple tasks, its formatting features are clunky and underpowered.

On the other hand, we have the .NET Framework with its String class, which has the formatting function String.Format[^]. It is safer and easier to use than sprintf - but can only be used from managed code. This article will show the main problems of sprintf and will offer an alternative that can be used from native C++ code.

What are the Problems with sprintf?

sprintf is Prone to Buffer Overflows

There are different versions of sprintf that provide different degrees of buffer overflow protection. The basic flavor of sprintf provides none. It will happily write past the end of the given buffer and will probably crash the program. The _snprintf function will not write past the end of the buffer, but will also not put a zero at the end if there is no space. The program will not crash immediately but will most likely crash later. The new _sprintf_s function fixes the buffer overflow problems but it is only available for Visual Studio 2005 and up.

String.Format allocates the output buffer itself from the managed heap and can make it as big as it needs to.

sprintf is Not Type-safe

The sprintf function uses the ellipsis syntax (...) to accept variable number of arguments. The downside is that the function has no direct information about the arguments' types and can't perform any validation. It assumes that the argument count and types match the formatting string. This can lead to hard to spot bugs. For example:

std::string userName("user1");
int userData=0;

// These will compile and often run, but will produce wrong result

// the type of the arguments don't match the format
sprintf(buf,"user %d, data %s",userName.c_str(),userData);

// the string is missing .c_str()
sprintf(buf,"user %s, data %d",userName,userData);

In String.Format the formats of the arguments are optional. If the argument is a string it will be printed as a string, if it is a number it will be printed as a number.

// The .NET equivalent:
String.Format("user {0}, data {1}",userName,userData);

sprintf has Localization Problems

The sprintf function requires that the order of the arguments is exactly the same as the order of the format specifiers. The bad news is that different languages have different word order. The program needs to provide the arguments in different order to accommodate different languages. For example:

// English
sprintf(buf,"The population of %s is %d people.","New York",20000000);
// But maybe in some other language it has to be:
sprintf(buf,"%d people live in %s.",20000000,"New York"); // the order is different

String.Format wins in this case too. Its format items explicitly specify which argument to use and can do that in any order.

// The .NET equivalent - same code can be used for both languages,
// just the formatting string needs to change:
String.Format("The population of {0} is {1} people.","New York",20000000);
String.Format("{1} people live in {0}.","New York",20000000);

The FormatString Function

The FormatString function is a smart and type-safe alternative to sprintf that can be used by native C++ code. It is used like this:

FormatString(buffer, buffer_size_in_characters, format, arguments...);

The function has two versions - a char version and a wchar_t version.

The format string contains items similar to String.Format:

{index[,width][:format][@comment]}

index is the zero-based index in the argument list. If the index is past the last argument, FormatString will assert.

width is optional width of the result. If width is less than zero, the result will be left-aligned. The width can be in the format '*<index>'. Then <index> must be an index of another argument in the list that provides the width value.

format is optional format of the result. The available formats depend on the argument type. If the format is not supported for the given argument FormatString will assert.

comment is ignored. It can be a hint that describes the meaning of the argument, or provides examples to aid the localization of the formatting string.

The result of FormatString always fits in the provided buffer and is always zero-terminated. Special cases like the buffer ending in the middle of a double-byte character or a in the middle of a surrogate pair are also handled.

Since the { and } characters are used to define format items, they need to be escaped in the format string as {{ and }}.

Available Formats

For 8, 16, 32 and 64 bit integers, including 32 and 64 bit pointers

c - a character. it is an ANSI or UNICODE character depending on the type of the format string
d[+][0] - a signed integer. '+' will force the + sign for positive values. '0' will add leading zeros
u[0] - unsigned integer. '0' will add leading zeros
x[0] - lower case hex integer. '0' will add leading zeros
X[0] - upper case hex integer. '0' will add leading zeros
n - localized integer number (uses GetNumberFormat[^] but with no fractional digits)
f - localized file size (uses StrFormatByteSize[^])
k - localized file size in KB (uses StrFormatKBSize[^])
t[<number>] - localized time interval in ms (uses StrFromTimeInterval[^] with optional number of significant digits between 1 and 6)

The default format for signed integers is 'd' and for unsigned integers is 'u'.

For floats and doubles

f[<number>] - fixed point (with optional number of fractional digits)
f*<index> - fixed point. <index> is an index of another argument that provides the number of fractional digits
e or E - exponential format. Supports the number of fractional digits same as the 'f' format
g or G - chooses between 'f' and 'e'/'E', whichever is shorter. Same rules apply for the fractional digits
$ - localized currency (uses GetCurrencyFormat[^])
n[<number>] or n*<index> - localized number (uses GetNumberFormat with optional number of fractional digits)

The default format for floats or doubles is 'f'.

For ANSI strings, including std::string

The char version of FormatString doesn't support any formats for ANSI strings. The wchar_t version supports:

<number> - a code page to be used when converting the ANSI string to UNICODE
*<index> - index of another argument that provides the code page

If a code page is not given, the default (CP_ACP) is used.

For UNICODE strings, including std::wstring

The wchar_t version of FormatString doesn't support any formats for UNICODE strings. The char version supports:

<number> - a code page to be used when converting the UNICODE string to ANSI
*<index> - index of another argument that provides the code page

If a code page is not given, the default (CP_ACP) is used.

For SYSTEMTIME (Passed as const SYSTEMTIME &)

d[l/f][format] - short date format (uses GetDateFormat[^]). 'l' - converts the time from UTC to local. 'f' - same as 'l' but uses the file system rules *. format - optional format passed to GetDateFormat
D[l/f][format] - long date format
t[l/f][format] - time format, no seconds (uses GetTimeFormat[^])
T[l/f][format] - time format

* 'l' uses SystemTimeToTzSpecificLocalTime to convert from UTC to local time. 'f' uses FileTimeToLocalFileTime instead. The difference is that FileTimeToLocalFileTime uses the current daylight savings settings instead of the settings at the given date. This is incorrect but is more consistent with the way Windows displays the local file times. If STR_USE_WIN32_TIME is not defined, then the localtime function is used no matter if 'l' or 'f' is specified. localtime produces results consistent with the file system (and FileTimeToLocalFileTime). You can read why the file system behaves this way here: The Old New Thing: Why Daylight Savings Time is nonintuitive .

The default format for SYSTEMTIME is 'd'.

Examples

char buf[100];

// The order of the arguments can change
FormatString(buf,100,"{1} people live in {0}.","New York",20000000);
    -> 20000000 people live in New York.

// Signed values are printed as signed
FormatString(buf,100,"{0}",-1);
    -> -1

// Unsigned values are printed as unsigned
FormatString(buf,100,"{0}",(unsigned int)-1);
    -> 4294967295

// The same argument can be used more than once
FormatString(buf,100,"{0}, 0x{0,8:X0}",1);
    -> 1, 0x00000001

// UNICODE text can be converted to ANSI
FormatString(buf,100,"{0}",L"test");
    -> test

// Localized integer number
FormatString(buf,100,"{0:n}",12345678);
    -> 12,345,678

// Time interval
FormatString(buf,100,"{0:t3}",12345678);
    -> 3 hr, 25 min

// Floating point number
FormatString(buf,100,"{0}",12345.678);
    -> 12345.678000

// Localized floating point number
FormatString(buf,100,"{0:n*1}",12345.678,2);
    -> 12,345.68

// Show current time
SYSTEMTIME st;
GetSystemTime(&st);
FormatString(buf,100,"{0:dl}  {0:tl}",st);
    -> 11/25/2006  1:26 PM

// Use custom date format
FormatString(buf,100,"{0:ddddd',' MMM dd yy}",st);
    -> Saturday, Nov 25 06

How It Works

The FormatString function has 10 optional arguments arg1, ... arg10 of type const CFormatArg & like this:

class CFormatArg
{
public:
    CFormatArg( void );
    CFormatArg( char x );
    CFormatArg( unsigned char x );
    CFormatArg( short x );
    CFormatArg( unsigned short x );
    ..........
    
    enum
    {
        TYPE_NONE=0,
        TYPE_INT=1,
        TYPE_UINT=2,
        .....
    };

    union
    {
        int i;
        __int64 i64;
        double d;
        const char *s;
        const wchar_t *ws;
        const SYSTEMTIME *t;
    };
    int type;
    static CFormatArg s_Null;
;

int FormatString( char *string, int len, const char *format,
    const CFormatArg &arg1=CFormatArg::s_Null, ...,
    const CFormatArg &arg10=CFormatArg::s_Null );

The CFormatArg class contains constructors for each of the supported types. Each constructor sets the type member depending on the type of its argument. When the FormatString function is called with an actual argument, a temporary CFormatArg object is created that stores the value and the type of the argument. The FormatString function can then determine the number of arguments that are provided and has access to their types and values.

Dynamically Allocated Strings

Often you don't want to use a buffer of a fixed size, but one that is dynamically allocated. Use the FormatStringAlloc function instead:

char *string=FormatStringAlloc(alocator, format, arguments );

The first parameter is an object with a virtual member function responsible for allocating and growing the string buffer:

class CFormatStringAllocator
{
public:
    virtual bool Realloc( void *&ptr, int size );

    static CFormatStringAllocator g_DefaultAllocator;
};

bool CFormatStringAllocator::Realloc( void *&ptr, int size )
{
    void *res=realloc(ptr,size);
    if (ptr && !res) free(ptr);
    ptr=res;
    return res!=NULL;
}

The Realloc member function must reallocate the buffer pointed by ptr with the given size (in bytes) and set ptr to the new address. The allocator will be called every 256 characters (approximately) to enlarge the buffer. The first time Realloc is called with ptr=NULL. If error occurs, Realloc must free the memory pointed by ptr and return false or throw an error. If Realloc returns false then FormatStringAlloc terminates and returns NULL.

The default allocator uses the realloc function from the C run-time heap. To free the returned string, you need to call free(string). You can write your own allocator that uses a different heap or some other means of allocating memory. See further below for one example.

Output to Stream

Often you don't want to output the formatted string to a buffer, but to a file, to a text console, to the Visual Studio's debug window, etc. Use the FormatStringOut function instead:

bool success=FormatStringOut(output, format, arguments );

The first parameter is an object with a virtual member function responsible for outputting portions of the result. There are separate classes for char and wchar_t:

// char version
class CFormatStringOutA
{
public:
    virtual bool Output( const char *text, int len );

    static CFormatStringOutA g_DefaultOut;
};

bool CFormatStringOutA::Output( const char *text, int len )
{
    for (int i=0;i<len;i++)
        if (putchar(text[i])==EOF) return false;
    return true;
}

// wchar_t version
class CFormatStringOutW
{
public:
    virtual bool Output( const wchar_t *text, int len );

    static CFormatStringOutA g_DefaultOut;
};

bool CFormatStringOutW::Output( const wchar_t *text, int len )
{
    for (int i=0;i<len;i++)
        if (putwchar(text[i])==WEOF) return false;
    return true;
}

The Output member function will be called with each portion of the result. The len parameter is the number of characters. Note that the text is not guaranteed to be zero-terminated. Output must return false or throw an exception if there is an error. If Output returns false then FormatStringOut terminates and returns false.

The default implementations just use putchar/putwchar to send the text to the console. You can write your own output class for iostream, FILE*, Win32 HANDLE, etc.

Additional Functionality

Support for FILETIME, time_t and OLE time

The CFormatTime class derives from CFormatArg and allows you to use different date/time formats. You use it like this:

time_t t=time();
FormatString(buf, 100, "local time: {0:dl}  {0:tl}", CFormatTime(t));
    -> local time: 11/25/2006  1:26 PM

You can create your own classes that derive from CFormatArg to support more data types or add more formatting options.

Passing CFormatArg Argument List to Other Functions

FormatString.h defines 3 macros to be used with the argument list:

FORMAT_STRING_ARGS_H
FORMAT_STRING_ARGS_CPP and
FORMAT_STRING_ARGS_PASS

You can use them to create other functions that have variable argument list and call FormatString. For example, let's create a MessageBox function that can format the message:

// in your header file
int MessageBox( HWND parent, UINT type, LPCTSTR caption,
        LPCTSTR format, FORMAT_STRING_ARGS_H );

// in your cpp file
int MessageBox( HWND parent, UINT type, LPCTSTR caption,
        LPCTSTR format, FORMAT_STRING_ARGS_CPP )
{
    TCHAR *text=FormatStringAlloc(CFormatStringAllocator::g_DefaultAllocator,
            format,
            FORMAT_STRING_ARGS_PASS);
    int res=MessageBox(parent,text,caption,type);
    free(text);
    return res;
}

Calling with No Variable Arguments

If FormatString and its siblings are called with no variable arguments, the format string is directly copied to the output. In the example above, you can call MessageBox(parent, type, caption, text) and the text will be displayed in the message box directly without being parsed for any format items.

The CString Classes

The sample sources provide simple string container classes CStringA and CStringW. The strings stored in them have a reference count in the 4 bytes directly preceding the first character. When such a class is copied, the string is not duplicated, just the reference count is incremented (so called copy-on-write with reference counting). When the string is destroyed, the reference count is decremented and if it reaches 0, the memory is freed. The reference count is modified with InterlockedIncrement and InterlockedDecrement to be thread-safe.

The CString type is set to CStringA in ANSI configurations and to CStringW in UNICODE configurations. This allows you to use the configuration-dependent CString, while still being able to mix the ANSI and UNICODE types as needed.

The CString classes have a Format member function that formats a string and assigns the result to the object. This is done by calling FormatStringAlloc with a special allocator that allocates 4 bytes more than requested to store the reference count. The CString classes also define a cast operator CFormatArg, so they can be used directly as arguments to FormatString:

CString s;
s.Format(_T("{0}"),"test");
FormatStringOut(CFormatStringOutA::g_DefaultOut,"s=\"{0}\"\n",s);
    -> s="test"

The behavior ot CString is very similar to the ATL/MFC strings and is provided here merely to demonstrate the use of custom memory allocators for FormatStringAlloc and the use of the CFormatArg cast operator. To use them in a real application, you may wish to add more functionality, like comparison operators, conversion operators/constructors between CStringA and CStringW, string manipulation functionality, etc. Or simply use the existing classes std::string or ATL::CString.

StringUtils.h

The source files contain a set of string utilities that can be used independently from FormatString. Most of them are wrappers for the system string functions. The functions come in pairs - one for ANSI and one for UNICODE, like this:

inline int Strlen( const char *str ) { return (int)strlen(str); }
inline int Strlen( const wchar_t *str ) { return (int)wcslen(str); }
int Strcpy( char *dst, int size, const char *src );
int Strcpy( wchar_t *dst, int size, const wchar_t *src );

The advantage of this approach over _tcslen and _tcscpy is that you can easily mix ANSI and UNICODE code and always use the same function name.

Other wrappers provide safe versions of strncpy, sprintf, strcat, etc. that don't write past the provided buffer and always leave the result zero-terminated. They all compile cleanly under VC 6.0, VS 2003 and VS 2005.

Output to STL Strings

These functions output the formatted result to an STL string:

std::string FormatStdString( const char *format, ... );
std::wstring FormatStdString( const wchar_t *format, ... );
void FormatStdString( std::string &string, const char *format, ... );
void FormatStdString( std::wstring &string, const wchar_t *format, ... );

Output to STL Streams

You can output formatted string to STL streams like this:

stream << StdStreamOut(format, parameters) << ...;

The Source Code

To use the source code, just drop the .h and .cpp files into your project:

StringUtils.h/StringUtils.cpp - a set of string helper functions. They can be used on their own.
FormatString.h/FormatString.cpp - the string formatting functionality. Requires StringUtils
CString.h/CString.cpp - the string container classes. Requires StringUtils and FormatString

Configuring the Source Code

StringUtils.h defines several macros that can be used to enable or disable parts of the functionality:

STR_USE_WIN32_CONV - If this macro is defined, the code will use the Win32 functions WideCharToMultiByte and MultiByteToWideChar to convert between char and wchar_t strings. Otherwise, it will use wcstombs and mbstowcs. The advantage of using Win32 function is that they support conversions between Unicode and different code pages, including UTF8.
STR_USE_WIN32_NLS - If this macro is defined, the FormatString functions will use the Win32 functionality for formatting numbers, dates and times. Otherwise they will try to simulate their functionality to some extent.
STR_USE_WIN32_TIME - If this macro is defined, the FormatString functions will support the time types time_t, SYSTEMTIME, FILETIME and DATE. Otherwise only time_t will be supported.
STR_USE_WIN32_DBCS - If this macro is defined, the code will use IsDBCSLeadByte to handle DBCS characters. Otherwise isleadbyte will be used.
STR_USE_STL - If this macro is defined, the FormatString functions will support std::string and std::wstring as input parameters. Also FormatStdString and StdStreamOut will be defined that output to std::string, std::wstring, std::ostream and std::wostream.

With these macros, you can selectively enable only the functionality you need and is supported by your compiler or platform.

History

Nov, 2006 - First version
- FormatString implementation for char and wchar_t
- Support for numbers, strings and time formats
- Formatting to fixed sized buffers, dynamically allocated buffers and output streams
Dec, 2006 – Better portability and more functionality
- Added configuration macros
- Added support for STL strings and streams
- Added support for different sizes of wchar_t
- Added more robust handling of numeric formats thanks to Mihai Nita's suggestion
Feb, 2007

Added conversion from UTC time to local time that is consistent with the file system (to be used with file times)