- Parsing LaTeX
- Building the Tokenizer
- Iterating Over an NSString
- Building the Parser
Iterating Over an NSString
Getting individual characters from an NSString is something that I quite often see people doing in a slow or inefficient way. There are two common (wrong) approaches.
The first approach is to call -characterAtIndex: in a loop. This works, but it involves a message send for every single character. Often this message send can cost as much as the rest of the work in the loop body.
The other common approach is to call -UTF8String and then iterate over the characters as a C string. This is badly wrong. If you have any characters in the string that can't be represented in 7-bit ASCII, then they will be multi-byte characters, so you need lots of special casing for them. Additionally, the NSString probably doesn't use UTF-8 internally, so the initial call involves iterating over every character in the string, encoding it as UTF-8, storing it in a temporary (autoreleased) buffer, and returning that. The cost of the -UTF8String call may well be more than the entire cost of the rest of the code.
Conceptually, an NSString is a string of UTF-16 characters. In theory, UTF-16 is a multi-byte encoding, but in practice you are unlikely to encounter a multi-byte code sequence when processing text; there certainly weren't any in my input, which simplifies things a lot. The inner loop needs to iterate over these 16-bit charactersidentified by the unichar typeand let the parser do something with each one.
The correct way of doing this uses a pattern similar to fast enumeration. The -getCharacters:range: method lets you grab a group of characters from a string in a single call. Typically, you'll use something like this:
#define BUFFER_SIZE 32 NSRange range = { 0, BUFFER_SIZE }; NSUInteger end = [aString length]; while (range.location < end) { unichar buffer[BUFFER_SIZE]; if (range.location + range.length > end) { range.length = end - range.location; } [aString getCharacters: buffer range: range]; range.location += BUFFER_SIZE; for (unsigned i=0 ; i<range.length ; i++) { unichar c = buffer[i]; switch (c) // Cases for different characters. } }
This lets you handle the special characters easily, with everything else falling through the default branch of the switch statement. Most of the time, this loop scanned long strings of characters that were just going to be passed through to the main body of the string. Rather than build a new string by collecting the characters, I just generated an NSRange and then called -substringWithRange: on the original string to generate it.
This highlights one of the strengths of NSString. Because it is an abstract class, it's possible to do a lot of optimizations behind the scenes. When you call substringWithRange:, you get a new NSString object, which you can use just like any other string. Most of the time, creating this string has a constant cost; it is a small object that just holds a reference to the original string and a range.
This is why microbenchmarks of Objective-C are often misleading. The loose coupling that message sending provides comes with some overhead, but encourages very efficient code.