2.6 Input Validation
If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilisation.
Gerald Weinberg
Of all the stupid things, and there are many in software and security, perhaps the most stupid is the continuing inability of koders to remember to validate their input. From cross-site scripting to SQL injection attacks to so many other things, not properly validating input leads to a whole host of problems on your hosts. Perhaps my favorite reference on this is from Randall Munroe’s xkcd, in which he writes about a child named, in part, “Drop Tables”; see https://xkcd.com/327/. There are now libraries and guidelines galore on this topic, for every computer language you can imagine; all that is required is the will to use them.
Dear KV,
I work for a company the builds all kinds of different web applications. We do everything from blogs and news sites to mail and financial systems. It depends on what the customer wants.
Right now our biggest problem at work is the number of bugs we have that relate to input validation. These bugs are totally maddening because each time one of them is fixed some other problem pops up in the same code, and the checking code is getting very close to spaghetti. Is there any way out of this tangle without some mythical technology, like natural language understanding?
Input Invalid
Dear II,
You’ve come across one of the biggest programming problems since the day we stupidly let non-engineers, i.e., users, touch our nice toys. Of course, computers aren’t really very useful if they don’t do something for actual people, but it is a pain. Systems would be so much cleaner without people. Alas, user input is a fact of life, and one that we all have to work with every day. User input is also one of the biggest sources of security holes in software as any reader of the Bugtraq mailing list can tell you.
The first rule of handling user input is, ”Trust no one!” in particular your users. Although I’m sure 90% of them are perfectly nice people who go to their religious shrine of choice at the appointed time every week, or whatever it is perfectly nice people do, I don’t actually know any perfectly nice people, but I have heard about them; nevertheless, there are the usual minority of thieves, jerks, and just plain idiots who will look at your nice web form as a place to steal money, play tricks, and generally cause havoc. The rest of the people, the perfectly nice ones, whom I’ve never met, won’t actually attack your system, they’ll just use it in a way they think is logical, and if their logic and your logic do not match, kaboom. Kode Vicious hates kabooms; they mean late nights and complaints from my doctor about alcohol and caffeine intake. I can’t help it if he’s stingy with the prescription meds, but let’s not get into that now.
The second rule is, ”Don’t trust yourself!” This is another way of saying that you should check your results to make sure you’re not missing anything. Just because you sent something to the user does not mean that they didn’t do something a bit odd to it before it came back to you. A quick example is a web form. If you depend on the data you sent in a web form to the user, you had better check the whole form, and not just the parts you expected the user to change with their browser. It’s a simple trick to exploit an error in form submission code by sending a slightly changed form with proper user input.
It sounds, from your description, as if the system you’re using was written using what is called a black list. A black list is a set of rules that says which things are bad. During the Cold War the United States maintained black lists to prevent people it didn’t like from getting jobs. Your name is on the list, sorry, no job. In the same way, software uses black lists to say which types of operations, in this case user input, are bad. The problem with black lists is that they are hard to maintain. They start off simple enough, saying things like, ”Do not accept input with URLs in them.” But they quickly get out of hand, with lists of the names for ”JavaScript,” of which there are many, and different types of tags to check for, and, and, and… I hope you get the idea. It is better to use white lists where this is possible.
A white list, unsurprisingly, is the opposite of a black list. White lists only contain the things that are allowed, and are often very short. An example is ”accept only ASCII alphabetic characters.” White lists can be overly restrictive, but they have a distinct advantage over black lists in that the only time you have to change a white list is to make it more permissive. A black list is, by default, mostly permissive, with the few exceptions that are the entries in the list.
My recommendation is to switch to using white lists, and to be very restrictive in what can be given to you by the user. Initially this seems a bit draconian, but it’s probably the best way to protect your code, both from users and from turning into spaghetti.
KV