- Parsing LaTeX
- Building the Tokenizer
- Iterating Over an NSString
- Building the Parser
Building the Tokenizer
Traditional parsers use a tokenizer, which converts a stream of characters into a stream of tokens. The parser then constructs some kind of internal data structure out of this stream.
There are several classes in Cocoa that you can use for building simple tokenizers. The simplest is NSScanner. This class works very well for structured data. You initialize it from a string, and then read values. For example, if you had a file listing places of interest, with the longitude and latitude coordinates followed by the name of the place, you could parse it like this:
NSScanner *s = [NSScanner scannerWithString: fileContents]; NSCharacterSet *nl = [NSCharacterSet newlineCharacterSet]; while (![s isAtEnd]) { double lat, long; NSString *name; if ([s scanDouble: &lat] && [s scanDouble: &long] && [s scanUpToCharactersFromSet: nl intoString: &name]) { // Handle entry } else { // Report error } }
This doesn't work quite as well for data that doesn't have a well-defined structure, so it's not the right approach here. Another alternative is NSRegularExpression. This class is currently only supported in GNUstep and recent versions of iOS, not in OS X. It provides regular expression matching on input strings, and is quite good for certain categories of tokenizer.
The requirements for my tokenizer were a bit different. Most of the input is plain text. This needs to be passed straight through to the parser, although a few TeXisms need removingfor example, open double quotes are indicated as `` in TeX. The parser, in fact, only cares about commands (which start with a backslash) and their arguments. For example, in the command \class{NSString}, the parser wants to be told that the command 'class' has been read, that the argument is the string NSString. In fact, because commands might be nested, it gets separate message for beginning ending arguments, and then a text message in the middle.
Because the scanner requirements are quite simple, I chose to write a very simple ad-hoc parser, which iterates over each character in an NSString in turn. It only ever needs a single character of read-ahead and very few characters are special, so it can be a simple loop containing a switch statement.