Secure Coding in C and C++: Strings
with Dan Plakosh and Jason Rafail1
But evil things, in robes of sorrow, Assailed the monarch's high estate.—Edgar Allan Poe "The Fall of the House of Usher"
Strings—such as command-line arguments, environment variables, and console input—are of special concern in secure programming because they comprise most of the data exchanged between an end user and a software system. Graphic and Web-based applications make extensive use of text input fields and, because of standards like XML, data exchanged between programs is increasingly in string form as well. As a result, weaknesses in string representation, string management, and string manipulation have led to a broad range of software vulnerabilities and exploits.
2.1 String Characteristics
Strings are a fundamental concept in software engineering, but they are not a built-in type in C or C++. C-style strings consist of a contiguous sequence of characters terminated by and including the first null character. A pointer to a string points to its initial character. The length of a string is the number of bytes preceding the null character, and the value of a string is the sequence of the values of the contained characters, in order.
A wide string is a contiguous sequence of wide characters terminated by and including the first null wide character. A pointer to a wide string points to its initial (lowest addressed) wide character. The length of a wide string is the number of wide characters preceding the null wide character and the value of a wide string is the sequence of code values of the contained wide characters, in order.
Strings in C++
C-style strings are still a common data type in C++ programs, but there have also been many attempts to create string classes. Most C++ developers have written at least one string class and a number of widely accepted forms exist. The standardization of C++ [ISO/IEC 98] has promoted the standard template class std::basic_string and its char and wchar_t instantiations: std::string and std::wstring.
The basic_string class is less prone to errors that result in security vulnerabilities than C-style strings. Unfortunately, there is a mismatch between C++ string classes and C-style strings. Specifically, most C++ string classes are treated as atomic entities (usually passed by value or reference), while existing C library functions accept pointers to null-terminated character sequences. In the standard C++ string class, the internal representation does not have to be null-terminated [Stroustrup 97]. Some other string types, like Win32 LSA_UNICODE_STRING, do not have to be null-terminated either. As a result, there can be different ways to access string contents, determine the string length, and determine whether a string is empty.
Except in rare circumstances—in which there are no string literals2 and no interaction with the existing libraries that accept C-style strings, or in which one uses C-style strings only—it is virtually impossible to avoid having multiple string types within a C++ program. Usually this is limited to C-style strings and one string class, although it is often necessary to deal with multiple string classes within a legacy code base [Wilson 03].