This article discusses various character data types, their purpose, history and scenarios of use.
The topic of this article is something that is considered pretty basic in most programming languages: character and
string types. However, this is not a beginner level article on usage of character data in C and C++. My goal here is to shed some light on various character data types, their purpose, history and scenarios of use.
People who might benefit from reading this text include:
- programmers with experience in other programming languages who are interested in learning more about C and C++
- C and C++ programmers of various levels of experience who are not 100% certain when or how to use different character and
- programming enthusiasts who may find it interesting to learn some history behind development of character and string data types in C and C++
The Original Char Data Type
char type was invented at New Jersey - based Bell Labs in 1971 when Dennis Ritchie started extending the B programming language to add types.
char was the very first type he added to the language which was first called
NB ("new B") and was later renamed to C.
char is a type that represents the smallest addressable unit of the machine that can contain basic character set. In addition to character data, it is often used as the "
byte" type - for instance, binary data is often declared as
For developers who are used to more recent programming languages such as Java or C# , the C version of
char looks too flexible and under-specified. For instance, in Java, the
char type is a primitive integral type, guaranteed to hold 16-bit unsigned integers representing UTF-16 code units. In contrast, the C/C++
- has the size of one byte, but the number of bits is only guaranteed to be at least 8. There are, or at least used to be, real-life platforms where
char had more than 8 bits - as many as 32;
- can be either signed or unsigned. The default usually depends on a platform, and it can typically be changed via a compiler flag. signed
char and unsigned
char exist and they are guaranteed to be unsigned and signed respectively, but they are both distinct types than
- can contain a character in any single-byte character set or a byte of a multi-byte character set encoded
To illustrate the previous points - in Java:
char a = (char)65;
char b = (char)174;
char c = (char)1046;
We know that
a contains a 16-bit unsigned integral value of 65 which represents Latin capital
b is a 16-bit unsigned integral value of 174 which represents registered trade mark sign;
c is a 16-bit unsigned integral value of 1046 which represents a Cyrillic capital letter zhe.
In C or C++, we have no such guarantees.
char a = (char)65;
a would have at least 8 bits, so we can be certain it will have a value of 65, regardless of the "signess" of
char. What that value would represent is open to interpretation. For most ASCII-derived character sets,
a would represent Latin capital
A, just like in Java. However, on an IBM mainframe system which uses EBCDIC encoding, it will not represent anything.
char b = (char)174;
char is 8-bits wide,
b could have a value of either
174 or it would overflow into something like
-82, depending on whether the
char type is signed or unsigned. Furthermore, the actual character it represents will differ even for various ASCII-derived character sets. For instance, in ISO-8859-1 (Latin-1 encoding), it will represent registered trademark sign (just like Java); with ISO 8859-2 (Latin-2) it would represent "Ş" (S-cedilla); in ISO 8859-5 (Latin/Cyrillic) it will represent "Ў" (short U), etc.
char c = (char)1046;
c is all but guaranteed to overflow with modern hardware architectures and its value will be meaningless.
In practice, we rarely assign integer values to
chars and use character literals instead. For instance:
char a = 'A';
That will work and assign an integral value to
a. Now, can you guess what will be the actual numerical value stored in the variable? That depends on the compiler, its options, and the encoding of the source file. For most platforms in use nowadays, the value stored in
a will be
65. If you compile the code on an IBM mainframe, chances are it will be
193, but it also depends on the compiler settings.
char b = '®';
Depending on the compiler and source file encoding, it may end up successfully compiling and storing something like
b, or causing a compiler error. For instance, clang
14.0, which I use on Linux, expects UTF-8 source files and it reports:
error: character too large for enclosing character literal type
char c = 'Ж';
could theoretically work if the source file was saved as ISO-8859-5 code page *and* the compiler was set up to use that encoding. Clang again fails with the same error and for the same reason.
Another interesting characteristic of
char literal is that in C its type is
int - not
char. For instance,
sizeof('a') is likely to return something like
4. In C++, the type of the literal is
sizeof('a') is guaranteed to be
Obviously, character data is most often used as
strings of characters rather than individual ones. In C, a
string is an array of characters, terminated by a character with value
0. There are some C libraries that implement so-called "Pascal-style"
strings where the character array is prefixed by its length, but they are rare as the language itself favors null-terminated
string. For instance, the type of the
string literal "
abc" will be
char (in C, not C++) - it will include space for the trailing zero.
In the simplest case, a C-style
string will be encoded with a single-byte character set, and in this case, each
char corresponds to a "letter" that can be displayed on a screen. Obviously, a
char array does not carry information about the character set, so it would have to be supplied separately for the text to be displayed correctly.
strings can work well with multibyte encodings as long as they do not require embedded zeros. In that case, a single
char contains a byte that could correspond either to an individual character or a part of a multi-byte encoded one.
The C Standard Library assumes
strings are zero-terminated. For instance, a naive implementation of
strlen() function could look like:
size_t strlen(const char *str)
const char *c = str;
while (*c != 0)
return (c - str);
Let's look at how the above-mentioned characteristics of the
char type affect
strings of data. If we look at the
strlen example, we can see that:
- it is not affected by the size of
char. Regardless of the number of bits, it will correctly count until it hits the
- it is not affected by the sign of
char. It will return the same value regardless of whether
char is signed or unsigned;
- depending on the character set, the value returned by
strlen may or may not be what the caller expected. Namely, the function will always return the number of
chars in the array and that is usually the number of characters if the character set is single-byte. For multi-byte character sets, the number of
chars returned will often differ from the user-perceived number of characters.
C++ String Class(es)
The C++ Standard Library provides template class
std::basic_string<> which can be instantiated for various character types. It is declared in
<string> header along with
typedefs for character type instantiations. The
typedef for the
string class predates both templates and namespaces in C++, and before the C++ standard was adopted in 1998, it was just one of the many
string classes used. In its early days,
string class was often implemented with copy-on-write semantics which led to various problems especially in multi-threaded environments and was eventually prohibited by the C++11 standard.
std::basic_string is widely adopted and used. Its implementations usually contain "small string optimization" - a technique where a stack-based buffer is used for small
strings. With the adoption of move semantics,
strings play well with the C++ containers without introducing unnecessary copies. Using
strings in modern C++ rarely makes sense, except at the API level.
In late 1980s, an initiative was started to introduce a universal character set that would replace all the legacy character encodings based on 8-bit code units. The idea was to extend the popular ASCII character set from 7 to 16 bits which was considered enough to cover the characters of all world languages. The new encoding standard was called Unicode and the first version was published in late 1991.
To support the new, "wide" characters, a new type was added to C90 standard -
wchar_t. It was defined as "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales". "
wchar_t means "
wide" to emphasize that
wchar_t is generally (but not necessarily!) wider than
char. The "
_t" part comes from the fact that in C
wchar_t is not a distinct compiler type but a
typedef to another integral type, such as
wchar_t is declared in the wchar.h along with various functions to work with wide
strings, such as
In pre-standard C++,
wchar_t also started as a
typedef, but it was soon decided it had to be a distinct compiler type. Even after standardization, it remained tied to an "underlying type" which is one of the other integral types.
In practice (but not by a letter of any standard),
wchar_t is always unsigned and comes in two sizes:
- on Microsoft Windows and IBM AIX, it is 16 bits
- on virtually every other platform, it is 32 bits
This difference in sizes is an unfortunate historical incident - the early adopters of Unicode went with the 16-bit size which was compatible with the Unicode 1.0 standard. After later Unicode standard versions introduced supplementary planes,
wchar_t was used for UTF-16 encoding form on Windows and AIX and for UTF-32 encoding form on other platforms where Unicode was adopted later.
wchar_t, the new literals for wide characters were introduced: L'' for a wide character and L"" for a wide
The C++ Standard defines class
std::wstring as an instantiation of the
basic_string class template that uses
wchar_t as the character type.
C11 / C++11 Character Types
In C11, two new character types were introduced:
char32_t; both are declared in
<uchar.h> header and are
typedefs to unsigned integral types. The former is used to store 16-bit characters and has to be at least 16-bits wide. The latter is used for 32-bit characters and has to be at least 32 bits wide.
Just like with
wchar_t, new literals were introduced: u'' for
char16_t and U'' for
char32_t. Unlike with
wchar_t, there are no new
string functions equivalent to the ones for
char that will work with new types; there is no
The third new type of literal introduced is u8''. It works with the old
char type and is used for UTF-8 encoding form.
wchar_t becomes basically useless (although not officially deprecated). The character types are meant to be used in the following scenarios:
char for UTF-8 Unicode encoding form, various single-byte and multi-byte legacy encodings, and as a byte type
char16_t for UTF-16 Unicode encoding form
char32_t for UTF-32 Unicode encoding form
Unsurprisingly, C++ 11 introduced two identically named character types. Unlike C, they are distinct built-in types rather than
typedefs and new keywords as well.
std::basic_string instantiations for the two new types were introduced in C++11:
std::u16string - a typedef for
std::u32string - a typedef for
C++20 introduces a new character type specifically for UTF-8 encoded character data:
char8_t. It has the same size and sign as
unsigned char but is distinct from it. The u8'' character literal and u8""
string literal has been changed to return the new type. A new
std::basic_string<char8_t> was introduced.
The upcoming C standard (probably C23) includes a proposal for
char8_t, which is a
C and C++ character and string types reflect the long history of the languages.
char type is still in the widest use. In the new code, it should be used for legacy single-byte and multibyte encodings, and for non-character binary data. It works well with UTF-8 encoded strings and can be used for them, especially with compilers that don't support
wchar_t turned out to be a victim of changing Unicode specifications. There is no good reason to use it today in the new code, even with ancient compilers.
char16_t should be used for UTF-16 encoded
strings in places where various "
typedefs have been used in the past.
char32_t should be used for UTF-32 encoded
string. Although it is very rare to see
strings encoded as UTF-32 due to its memory inefficiency, individual code points are frequently UTF-32 encoded and
char32_t is the ideal type for that purpose.
char8_t has been only recently introduced to C++ and only proposed for C. It is not clear whether its advantages over plain old
char for encoding UTF-8
strings will be enough to see widespread use.
- 25th November, 2022: Initial version
Born in Kragujevac, Serbia. Now lives in Boston area with his wife and daughters.
Wrote his first program at the age of 13 on a Sinclair Spectrum, became a professional software developer after he graduated.
Very passionate about programming and software development in general.