Handling User Input in C
by Paul Hsieh
Introduction
User input of strings in computer programs have been with us since the very early days of programming. While this is a no-brainer for most programming languages, the C language has great difficulty with this. This problem is compounded by the numerous imperfect rules of thumb (fgets(), then sscanf()) mixed with no end of inadequate solutions (gets() or scanf()). The purpose of this article is simply to analyze the problem, propose a no-compromise solution then discuss input security issues.
gets()
Because C does not have either a proper concept of variable sized arrays or bounds protected arrays, it has no intrinsic way to accept arbitrary user input and store it into a single data object. Of course, while nothing prevented the C language from having library functions that could handle this kind of operation, no such function was ever added to it.
There is a common misconception among C programmers that gets() can be used for user input. Of course, it cannot. The source of the misconception is the fact that it can be in used totally controlled test scenarios with bounded inputs and when it fails due to “undefined behavior” via exceeding the buffer bounds the common compiler implementation behaves as if no error has occurred. I.e., programmers who are not told otherwise or do not think of the C language in terms of its implementation have basically no way of knowing that gets() is a totally unusable function.
For those that don't know, the flaw is simple. gets() requires that a buffer with a predeclared fixed size be provided prior to receiving the input. If the input length exceeds the buffer, then this necessarily leads to undefined behavior. In fact in most compilers this will just lead to a bounds overrun. I.e., the side-effect of having undefined behavior is not under control of the programmer, but rather the user that provides the input.
This flaw is so serious that it was the source of the infamous “finger exploit” (finger is a Unix service) which served to so thoroughly taint the reputation of finger that most unix installations now turn it off entirely. I.e., “finger” got so bad of a reputation people wont touch it with a 10 foot pole.
In fact, my personal opinion is that the safest implementation of gets() is as follows:
#include <stdio.h> char * gets_fixed (char * buf, const char * sourcefile) { remove (sourcefile); return "Attempted callsite source file removal for calling gets()"; } #undef gets #define gets(x) gets_fixed ((x), __FILE__) |
If a programmer is unaware that using gets() is a bad thing, this should make them aware very quickly.
Now, while this disaster waiting to happen is fairly well known, the reaction from the C community has been far less than stellar. On the one hand, the ANSI C committee, which has revised the C standard a number of times since its first introduction has never deprecated or taken gets() out of the standard even though there are members of the committee who are aware of the problem. This is representative of the fact that the committee members are insensitive to the needs of programmers versus the needs of compiler vendors. (Update: The committee apparently has seen the light, and will, in fact, deprecate gets() in the forthcoming C0x standard.)
On the other hand the common retort from C programmers (even those considered experienced) is to say that fgets() should be used as an alternative. Of course, by itself, fgets() doesn't really handle user input per se. Besides having a bizarre string termination condition (upon encountering \n or EOF, but not \0) the mechanism chosen for termination when the buffer has reached capacity is to simply abruptly halt the fgets() operation and \0 terminate it. So if user input exceeds the length of the preallocated buffer, fgets() returns a partial result. To deal with this programmers have a couple choices; 1) simply deal with truncated user input (there is no way to feed back to the user that the input has been truncated, while they are providing input) 2) Simulate a growable character array and fill it in with successive calls to fgets(). The first solution, is almost always a very poor solution for variable length user input because the buffer will inevitably be too large most of the time because its trying to capture too many ordinary cases, and too small for unusual cases. The second solution is fine except that it can be complicated to implement correctly. Neither deals with fgets' odd behavior with respect to '\0'.
Exercise left to the reader: In order to determine how many bytes was really read by a call to fgets(), one might try by scanning, just as it does, for a '\n' and skip over any '\0' while not exceeding the size passed to fgets(). Explain why this is insufficient for the very last line of a stream. What weakness of ftell() prevents it from addressing this problem completely?
Exercise left to the reader: Solve the problem determining the length of the data consumed by fgets() by overwriting the entire buffer with a non-zero value between each call to fgets().
So with fgets() we are left with the choice of writing a lot of code and living with a line termination condition which is inconsistent with the rest of the C library, or having an arbitrary cut-off. If this is not good enough, then what are we left with? scanf() mixes parsing with reading in a way that cannot be separated, and fread() will read past the end of the string. In short, the C library leaves us with nothing. We are forced to roll our own based on top of fgetc() directly. So lets give it a shot.
getInputFrag()
First we need to establish the goals and properties we require of a mechanism that could be used for user input. Our goal is to rise above the confines of the available C mechanisms which are at the heart of the problem and try to solve the problem in a definitive way.
The mechanism must not allow any condition of buffer overflowing that cannot be avoided by the programmer while still being able to perform in normal circumstances.
The mechanism must allow for the creation of a standard '\0' terminated string that is a faithful representation of the input data.
The mechanism must allow the programmer to set policies regarding maximum input length (if any), input line terminators, and how aborting is handled. I.e., it should not impose such policies as a matter of arbitrary convention.
The mechanism should not automatically dictate the amount of memory used for capturing input outside of the control of the programmer.
The mechanism must not directly or indirectly leak the input data into memory that is outside of programmer control (for password security.)
The mechanism should not introduce new undefined behaviors that a programmer would not naturally expect.
The mechanism should allow the input length to be completely unbounded by any value (including actual available memory).
The mechanism should be thread safe and re-entrant (it should not use any static storage).
Based on these criteria, we can write a single function which can help us satisfy all these constraints:
long getInputFrag ( /* @out@ */ char * buf, long len, int termchar, int (* cbGetc) (void * ctx), void * cbGetcCtx) { long i; int c; if (!buf || len < 0 || !cbGetc) return -1; for (c = (~termchar) | 1, i=0; i < len; i++) { if (c == (int) '\0' || c == termchar || EOF == (c = cbGetc (cbGetcCtx))) { buf[i] = '\0'; break; } buf[i] = (char) c; } return i; } |
The key idea here is to present finite input fragments, much like fgets does, however it always returns the length of each fragment read. So receiving long inputs boils down to successive calls to getInputFrag and concatening (or otherwise incrementally parsing) the results.
The abstracted cbGetc function is simply a stand-in for the fgetc() function. A programmer can use an alternative pseudo-IO function in its place (for something like sockets, or a memory simulated file, or a version of fgetc() that does not echo as examples.)
'\0' is never ignored (since as a matter of consistency, no other string-aware C library function, besides fgets(), ignores the '\0' terminator in C strings). The contents of the string may contain at most one occurrence of '\0' and it will appear as the last character of the buffer. The termchar parameter allows for an additional terminator character to be used if so desired. To disable the use of an additional terminator, simply pass '\0' as the terminator.
Exercise left to the reader: Although getInputFrag has multiple modes of dealing with line terminators, show that the input byte sequence is always reconstructable exactly as it was read from the sequence of cbGetc calls. In particular demonstrate how to determine if the last character of a file is a terminator character or not with O(1) operations.
This function may be a little bit too primitive to use in the raw as is. We would also want to reduce the potential for a lot of redundant wrapper implementations. So we should supply some sample wrappers which will give us the same input mechanisms that the C language standard library has without any of its defects:
/* Work around an ANSI requirement that casting a function pointer does not change its parameters in any way. */ static int xfgetc (void * ctx) { return fgetc ((FILE *) ctx); } /* Very much like fgets except that it returns the length of the read string or -1 if there was a problem reading the string and it also terminates early if a '\0' is read. A terminating '\0' is always added to the end. */ long fgetstr (char * buff, int n, FILE * fp) { long len; if (!buff) return -1; if (1 == n) { buff[0] = '\0'; return 0; } if (n < 1 || !fp) return -1; len = getInputFrag (buff, n-1, (int) '\n', xfgetc, fp); if ((len) >= 0) buff[len] = '\0'; return len; } #define getstr(buff, n) fgetstr ((buff), (n), stdin) /* Fills *p with a malloced buffer that contains an input line. The line is always '\0' terminated. If p is NULL, it is not filled in. The number of characters read is returned. */ size_t fgetstralloc (char ** p, FILE * fp) { size_t mlen, slen, dlen, len; char * s, * t; if (!fp) return (size_t) -1; if (!p) { char blk[64]; for (slen = 0; ; slen += 64) { len = (size_t) getInputFrag (blk, 64, (int) '\n', xfgetc, fp); if (len != 64 || blk[len-1] == '\n' || blk[len-1] == '\0') return slen + len; } } mlen = 8; slen = 0; if (NULL == (s = (char *) malloc (mlen))) return (size_t) -1; for (;;) { mlen += mlen; if (NULL == (t = realloc (s, mlen))) break; /* unsafe for passwords */ len = (size_t) getInputFrag ((s = t) + slen, dlen = mlen - slen - 1, (int) '\n', xfgetc, fp); slen += len; if (len < dlen || t[slen-1] == '\n' || t[slen-1] == '\0') { s[slen] = '\0'; *p = s; return slen; } } free (s); return (size_t) -1; } #define getstralloc(p) fgetstralloc ((p), stdin) |
The only trick is in determining if an earlier encountered '\0' actually came from the input stream or from these functions just appending them. However the only case where that can happen is if and only if buff[len-1] == '\0', so there is no difficulty in determining this. The lengths are such that if one uses these functions repetitively to consume an entire file, the sum of the lengths should correspond exactly to the length of the file (assuming it was opened in a binary mode).
Ok, now that the core functionality concerns have been addressed we have to consider the cost. With all the features that are being supported by getInputFrag() isn't there going to be a major speed penalty? Well, the answer is that yes there is a performance penalty, but the problem is isolated to the fact that we are consuming the raw input character by character via fgetc(). The rest of the overhead actually has trivial cost by comparison.
The performance impact, however, has to be put into proper context:
User input is very rarely the performance bottleneck of any application.
Many compiler libraries implement fgets() as a loop over fgetc(), so one would expect similar performance between fgets() and getInputFrag().
To improve the performance would require using a block read via fread() in the inner loop, however, its not clear that there is any sensible way of using this without introducing the potentially undesriable effect of reading past the end of the string. At the least this would require the implementation of some sort of "unread" function which requires an abstraction beyond the ANSI C FILE semantics. (The Better String Library does exactly this with bsStreams.)
As another note about performance, notice the size doubling reallocation policy used in fgetstralloc(). By using this sort of memory allocation policy this will have minimum impact on performance and perhaps more importantly fragmentation of the heap itself. A naive linear reallocation scheme will call realloc exponentially more often for long inputs which can easily leave the heap in a shredded mess.
Download the sources for the above.
Exercise left to the reader: If one grows the back buffer for a dynamically sized string input in constant steps (i.e., linearly) show that the performance of this is actually O(n2) for inputs of length n. (Hint: what is the performance of realloc()?)
Exercise left to the reader: Make an input function based on getInputFrag() which dynamically allocates a buffer as necessary, but whose size does not exceed an upper bound given as a parameter. Keep in mind that the programmer needs to distinguish between the cases of being cut off by the upper bound versus the input terminating exactly on the upper bound.
Exercise left to the reader: Make an input function based on getInputFrag() which is suitable for reading passwords. The goal is to read input into a sequence of buffers that are not copied to any memory buffers that are not cleaned up and that the caller doesn't have direct access to. I.e., the input cannot be indirectly leaked to the heap via realloc for example.
String Libraries
The gets() function and its ilk from the ANSI C standard are hardly the only problem with string handling in the language. Most of the ANSI C string functions do not have any built-in buffer overrun detection and prevention mechanisms. Those that do are (like fgets()) restricted to passing the length of a buffer separately from the pointer to the buffer. Scarcely a function exists in the ANSI C that does not have some configuration of parameters (which by themselves are legal instances of their respective type) that necessarily leads to undefined behavior. These scenarios are not well documented anywhere except for the ANSI C specification itself and are often not intuitive or expected. The ANSI C string library functions are no exception. This has lead many programmers to write their own string libraries to try to insulate them from the C standard library minefield.
So given that using a substitute for ANSI C strings is well motivated one should weight the pros and cons of writing their own string library versus using an existing library versus trying to slog through using the ANSI C functionality. Personally, I am very biased on this issue – I whole heartedly recommend using the Better String Library which is an open source string library of my own creation. However, there are a number of other alternatives such as Vstr, Stralloc, c2lib, firestring, and the various C++ solutions, including the STL. Since most of these solutions have very liberal licensing agreements one should consider looking into these before trying to write a string library oneself.
If one decides to use one of these libraries or make a new one, the input function getInputFrag() may have to be revisited. For example, the Better String Library changes the nature of strings so that they can include '\0' as a regular content character. So it would make sense to change the termination condition to account for this.
Input tainting
One of the security problems that occur with string processing is when input strings are parsed, then concatenated to shell instruction strings that are passed to the system() function. The problem is that if the parsing is sufficiently naive, and the end user is sufficiently clever and aware, then an input string can cause commands to be executed that the original programmer did not expect. The Perl programming language came up with a solution to this by adding a “tainted” attribute to strings. The idea is that if an operation mixes tainted and untainted strings, then the result is always a tainted string, otherwise use of homogeneous strings results in strings of the same taint state. In this way, the program can, at run time, determine whether or not a string has been mixed with unqualified strings (such as those that might be input by an end user) before using them for something like a shell command.
The problem with the Perl solution is that its a purely runtime dynamic solution. Because Perl is a dynamically typed language, this is really the only way that it can work in that environment. But the real root cause of such “string poisons” is inevitably that the program itself has been written to erroneously mix tainted and untainted strings. I.e., one would like to know about this flaw prior to executing the program.
In compiled languages like C, C++, this leads us to one inevitable idea. Untainted strings should in fact be strings of their own type, so that we can leverage the power of strong typing. One can imagine implementing an “untaintedStr” module with the following definition:
struct untaintedStr { char ofs[1]; /* base type for string type */ }; |
Which would have an include file that would include:
#define system(s) #error “Cannot use system() directly, use usSystem() instead” extern int usSystem (const struct untaintedStr * s); extern struct untaintedStr * usPromote (const char * s, int memSz); extern char * usDemote (const struct untaintedStr * s, int memSz); extern int usFree (struct untainedStr * s); extern int usConcat (struct untainedStr * s0, const struct untainedStr * s1); extern struct untainedStr * usCopy (const struct untainedStr * s1); extern struct untainedStr * usSubstr (const struct untainedStr * s1, int pos, int len); /* ... etc */ |
The point would be that the actual definition of struct untaintedStr would be hidden and only accessible in the untainedStr module, with just a gang of very simplistic wrapper functions. The strong typing of the compiler would then enforce a well structured separation between ordinary char * strings and struct untaintedStr * strings. So flawed mixing of string types would be more obvious during development time. Those that slipped through would be isolated to invalid uses of usPromote(), or neglecting to include the untainted include file.
Of course, given the fairly serious difficulty of using the ANSI C standard library safely, one might in fact write such a wrapper as above on top of an alternate string library.