What Else Is There?
What do REs leave out? They can't express general recursion. REs are a poor match, for example, in trying to parse general C source code. Remember how exciting it was when you first realized that you could automatically scan the programs you write for certain classes of errors, or other metrics management needed? It's reasonable to think at first that a clever RE might quickly pull out all your unused variables or the headers for all function definitions. In special caseswhen programs have been coded against very strict and simple style guides, mostlythis is possible.
In general, though, it's not possible. Single REs cannot parse common languages such as C, SQL, SNMP, XML, or even HTML. There's an RE to parse what looks like the simple case of e-mail addresses, but it's hundreds of characters long.
Don't let that discourage you excessively. If you need to parse a grammar, and don't quickly see how to make it fit an RE, keep these tips in mind:
REs often make good partners with procedural code. There are many casesthe e-mail address example is a good onewhere an RE is difficult or impossible, but one or two simple REs partner well with a loop or recursion coded in C. Parsing is like other programming: The best coders know to pick tools that naturally express the problem at hand, or a part of it. REs are great for non-recursive patterns. Identify the non-recursive, or regular, parts of your parsing challenges, and exploit REs there.
For many problems, good purpose-built parsers are already available. I often receive requests for help with a specific RE syntax involved in what turns out to be an XML or SQL text. Some of these problems are narrow enough to be solvable with an RE. Almost without exception, though, it's simpler to invoke one of the publicly available parsers for XML or SQL. Don't reinvent those particular wheels.
Analyze the problem well. Many parsing problems look "regular" because they're the output of a computing processyou're trying to extract data, for example, from the HTML some other computer has generated. This kind of "scraping" can be important; a significant portion of my own work fits this description. As SGML (think of SGML as a more complicated form of XML) specialist Joe English observes, though, you should always ask where your data originated, "And if the answer is 'the output of ...', the first thing you should ask is to get your hands on 'the input of ...'." In managerial terms, that's simply good analysis.