char16_t.char32_t char8_t wchar_t char C C++

A History of C and C++ Character Data Types

Nemanja Trifunovic

4.82/5 (15 votes)

Nov 25, 2022

CPOL

11 min read

8363

Purpose, history and scenarios of use of character data types

Introduction

The topic of this article is something that is considered pretty basic in most programming languages: character and string types. However, this is not a beginner level article on usage of character data in C and C++. My goal here is to shed some light on various character data types, their purpose, history and scenarios of use.

People who might benefit from reading this text include:

programmers with experience in other programming languages who are interested in learning more about C and C++
C and C++ programmers of various levels of experience who are not 100% certain when or how to use different character and string types
programming enthusiasts who may find it interesting to learn some history behind development of character and string data types in C and C++

The Original Char Data Type

The original char type was invented at New Jersey - based Bell Labs in 1971 when Dennis Ritchie started extending the B programming language to add types. char was the very first type he added to the language which was first called NB ("new B") and was later renamed to C.

char is a type that represents the smallest addressable unit of the machine that can contain basic character set. In addition to character data, it is often used as the "byte" type - for instance, binary data is often declared as char arrays.

For developers who are used to more recent programming languages such as Java or C# , the C version of char looks too flexible and under-specified. For instance, in Java, the char type is a primitive integral type, guaranteed to hold 16-bit unsigned integers representing UTF-16 code units. In contrast, the C/C++ char type:

has the size of one byte, but the number of bits is only guaranteed to be at least 8. There are, or at least used to be, real-life platforms where char had more than 8 bits - as many as 32;
can be either signed or unsigned. The default usually depends on a platform, and it can typically be changed via a compiler flag. signed char and unsigned char exist and they are guaranteed to be unsigned and signed respectively, but they are both distinct types than char.
can contain a character in any single-byte character set or a byte of a multi-byte character set encoded string.

To illustrate the previous points - in Java:

 char a = (char)65;   // A
 char b = (char)174;  // ®
 char c = (char)1046; // Ж

We know that a contains a 16-bit unsigned integral value of 65 which represents Latin capital A; b is a 16-bit unsigned integral value of 174 which represents registered trade mark sign; c is a 16-bit unsigned integral value of 1046 which represents a Cyrillic capital letter zhe.

In C or C++, we have no such guarantees.

 char a = (char)65;

a would have at least 8 bits, so we can be certain it will have a value of 65, regardless of the "signess" of char. What that value would represent is open to interpretation. For most ASCII-derived character sets, a would represent Latin capital A, just like in Java. However, on an IBM mainframe system which uses EBCDIC encoding, it will not represent anything.

 char b = (char)174;

Assuming char is 8-bits wide, b could have a value of either 174 or it would overflow into something like -82, depending on whether the char type is signed or unsigned. Furthermore, the actual character it represents will differ even for various ASCII-derived character sets. For instance, in ISO-8859-1 (Latin-1 encoding), it will represent registered trademark sign (just like Java); with ISO 8859-2 (Latin-2) it would represent "Ş" (S-cedilla); in ISO 8859-5 (Latin/Cyrillic) it will represent "Ў" (short U), etc.

 char c = (char)1046;

c is all but guaranteed to overflow with modern hardware architectures and its value will be meaningless.

In practice, we rarely assign integer values to chars and use character literals instead. For instance:

 char a = 'A';

That will work and assign an integral value to a. Now, can you guess what will be the actual numerical value stored in the variable? That depends on the compiler, its options, and the encoding of the source file. For most platforms in use nowadays, the value stored in a will be 65. If you compile the code on an IBM mainframe, chances are it will be 193, but it also depends on the compiler settings.

How about:

 char b = '®';

Depending on the compiler and source file encoding, it may end up successfully compiling and storing something like 174 into b, or causing a compiler error. For instance, clang 14.0, which I use on Linux, expects UTF-8 source files and it reports:

error: character too large for enclosing character literal type

Something like:

char c = 'Ж';

could theoretically work if the source file was saved as ISO-8859-5 code page *and* the compiler was set up to use that encoding. Clang again fails with the same error and for the same reason.

Another interesting characteristic of char literal is that in C its type is int - not char. For instance, sizeof('a') is likely to return something like 4. In C++, the type of the literal is char, and sizeof('a') is guaranteed to be 1.

C-Style Strings

Obviously, character data is most often used as strings of characters rather than individual ones. In C, a string is an array of characters, terminated by a character with value 0. There are some C libraries that implement so-called "Pascal-style" strings where the character array is prefixed by its length, but they are rare as the language itself favors null-terminated string. For instance, the type of the string literal "abc" will be char[4] (in C, not C++) - it will include space for the trailing zero.

In the simplest case, a C-style string will be encoded with a single-byte character set, and in this case, each char corresponds to a "letter" that can be displayed on a screen. Obviously, a char array does not carry information about the character set, so it would have to be supplied separately for the text to be displayed correctly.

C-style strings can work well with multibyte encodings as long as they do not require embedded zeros. In that case, a single char contains a byte that could correspond either to an individual character or a part of a multi-byte encoded one.

The C Standard Library assumes strings are zero-terminated. For instance, a naive implementation of strlen() function could look like:

size_t strlen(const char *str)
{
    const char *c = str;
    while (*c != 0)
        ++c;
    return (c - str);
}

Let's look at how the above-mentioned characteristics of the char type affect strings of data. If we look at the strlen example, we can see that:

it is not affected by the size of char. Regardless of the number of bits, it will correctly count until it hits the 0;
it is not affected by the sign of char. It will return the same value regardless of whether char is signed or unsigned;
depending on the character set, the value returned by strlen may or may not be what the caller expected. Namely, the function will always return the number of chars in the array and that is usually the number of characters if the character set is single-byte. For multi-byte character sets, the number of chars returned will often differ from the user-perceived number of characters.

C++ String Class(es)

The C++ Standard Library provides template class std::basic_string<> which can be instantiated for various character types. It is declared in <string> header along with typedefs for character type instantiations. The typedef for the char type std::basic_string<char> is std::string.

Historically, the string class predates both templates and namespaces in C++, and before the C++ standard was adopted in 1998, it was just one of the many string classes used. In its early days, string class was often implemented with copy-on-write semantics which led to various problems especially in multi-threaded environments and was eventually prohibited by the C++11 standard.

Nowadays, std::basic_string is widely adopted and used. Its implementations usually contain "small string optimization" - a technique where a stack-based buffer is used for small strings. With the adoption of move semantics, strings play well with the C++ containers without introducing unnecessary copies. Using char* for strings in modern C++ rarely makes sense, except at the API level.

wchar_t

In late 1980s, an initiative was started to introduce a universal character set that would replace all the legacy character encodings based on 8-bit code units. The idea was to extend the popular ASCII character set from 7 to 16 bits which was considered enough to cover the characters of all world languages. The new encoding standard was called Unicode and the first version was published in late 1991.

To support the new, "wide" characters, a new type was added to C90 standard - wchar_t. It was defined as "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales". "w" in wchar_t means "wide" to emphasize that wchar_t is generally (but not necessarily!) wider than char. The "_t" part comes from the fact that in C wchar_t is not a distinct compiler type but a typedef to another integral type, such as unsigned short. wchar_t is declared in the wchar.h along with various functions to work with wide strings, such as wcslen(), wprintf(), etc.

In pre-standard C++, wchar_t also started as a typedef, but it was soon decided it had to be a distinct compiler type. Even after standardization, it remained tied to an "underlying type" which is one of the other integral types.

In practice (but not by a letter of any standard), wchar_t is always unsigned and comes in two sizes:

on Microsoft Windows and IBM AIX, it is 16 bits
on virtually every other platform, it is 32 bits

This difference in sizes is an unfortunate historical incident - the early adopters of Unicode went with the 16-bit size which was compatible with the Unicode 1.0 standard. After later Unicode standard versions introduced supplementary planes, wchar_t was used for UTF-16 encoding form on Windows and AIX and for UTF-32 encoding form on other platforms where Unicode was adopted later.

Along with wchar_t, the new literals for wide characters were introduced: L'' for a wide character and L"" for a wide string.

The C++ Standard defines class std::wstring as an instantiation of the basic_string class template that uses wchar_t as the character type.

C11 / C++11 Character Types

In C11, two new character types were introduced: char16_t and char32_t; both are declared in <uchar.h> header and are typedefs to unsigned integral types. The former is used to store 16-bit characters and has to be at least 16-bits wide. The latter is used for 32-bit characters and has to be at least 32 bits wide.

Just like with wchar_t, new literals were introduced: u'' for char16_t and U'' for char32_t. Unlike with wchar_t, there are no new string functions equivalent to the ones for char that will work with new types; there is no strlen() for char16_t.

The third new type of literal introduced is u8''. It works with the old char type and is used for UTF-8 encoding form.

With C11, wchar_t becomes basically useless (although not officially deprecated). The character types are meant to be used in the following scenarios:

char for UTF-8 Unicode encoding form, various single-byte and multi-byte legacy encodings, and as a byte type
char16_t for UTF-16 Unicode encoding form
char32_t for UTF-32 Unicode encoding form

Unsurprisingly, C++ 11 introduced two identically named character types. Unlike C, they are distinct built-in types rather than typedefs and new keywords as well.

Two new typedefs for std::basic_string instantiations for the two new types were introduced in C++11:

std::u16string - a typedef for std::basic_string<char16_t>
std::u32string - a typedef for std::basic_string<char32_t>

C++20 char8_t

C++20 introduces a new character type specifically for UTF-8 encoded character data: char8_t. It has the same size and sign as unsigned char but is distinct from it. The u8'' character literal and u8"" string literal has been changed to return the new type. A new typedef std::u8string for std::basic_string<char8_t> was introduced.

The upcoming C standard (probably C23) includes a proposal for char8_t, which is a typedef to unsigned char.

Conclusion

C and C++ character and string types reflect the long history of the languages.

The original char type is still in the widest use. In the new code, it should be used for legacy single-byte and multibyte encodings, and for non-character binary data. It works well with UTF-8 encoded strings and can be used for them, especially with compilers that don't support char8_t type.

wchar_t turned out to be a victim of changing Unicode specifications. There is no good reason to use it today in the new code, even with ancient compilers.

char16_t should be used for UTF-16 encoded strings in places where various "widechar" typedefs have been used in the past.

char32_t should be used for UTF-32 encoded string. Although it is very rare to see strings encoded as UTF-32 due to its memory inefficiency, individual code points are frequently UTF-32 encoded and char32_t is the ideal type for that purpose.

char8_t has been only recently introduced to C++ and only proposed for C. It is not clear whether its advantages over plain old char for encoding UTF-8 strings will be enough to see widespread use.

History

25^th November, 2022: Initial version