1.5 Basic Data Structures
Now that we got into Hamlet, let's analyze the text a bit further. For example, for each Dramatis Persona, we'd like to collect some information, such as how many words they say in total, and how rich their vocabulary is. To do that, we need to be able to associate several data items with one persona.4 To group such information in one place, we can define a data structure as follows:
struct PersonaData { uint totalWordsSpoken; uint[string] wordCount; }
In D you get structs and then you get classes. They share many amenities but have different charters: structs are value types, whereas classes are meant for dynamic polymorphism and are accessed solely by reference. That way confusions, slicing-related bugs, and comments à la // No! Do NOT inherit! do not exist. When you design a type, you decide upfront whether it'll be a monomorphic value or a polymorphic reference. C++ famously allows defining ambiguous-gender types, but their use is rare, error-prone, and objectionable enough to warrant simply avoiding them by design.
In our case, we just need to collect some data and we have no polymorphic ambitions, so using struct is a good choice. Let's now define an associative array mapping persona names to PersonaData values:
PersonaData[string] info;
and all we have to do is fill info appropriately from hamlet.txt
. This needs some work because a character's paragraph may extend on several lines, so we need to do some simple processing to coalesce physical lines into paragraphs. To figure out how to do that, let's take a look at a short fragment from hamlet.txt
, dumped verbatim below (with leading spaces made visible for clarity):
␣␣Pol. Marry, I will teach you! Think yourself a baby ␣␣␣␣That you have ta'en these tenders for true pay, ␣␣␣␣Which are not sterling. Tender yourself more dearly, ␣␣␣␣Or (not to crack the wind of the poor phrase, ␣␣␣␣Running it thus) you'll tender me a fool. ␣␣Oph. My lord, he hath importun'd me with love ␣␣␣␣In honourable fashion. ␣␣Pol. Ay, fashion you may call it. Go to, go to!
Whether or not Polonius' enthusiasm about goto was a factor in his demise is, even to this day, a matter of speculation. Regardless of that, let's note how each character's line is preceded by exactly two spaces, followed by the (possibly contracted) character's name, followed by a period and a space, finally followed by the actual content of the line. If a logical line extends to multiple physical lines, the continuations are always preceded by exactly four spaces. We could do such simple pattern matching by using a regular expression engine (found in the std.regex module), but we want to learn arrays so let's match things "by hand." We only enlist the help of the Boolean function a.startsWith(b), defined by std.algorithm, which tells whether a starts with b.
The main driver reads input lines, concatenates them in logical paragraphs (ignoring everything that doesn't fit our pattern), passes complete paragraphs to an accumulator function, and at the end prints the desired information.
import std.algorithm, std.ctype, std.regex, std.range, std.stdio, std.string; struct PersonaData { uint totalWordsSpoken; uint[string] wordCount; } void main() { // Accumulates information about dramatic personae PersonaData[string] info; // Fill info string currentParagraph; foreach (line; stdin.byLine()) { if (line.startsWith(" ") && line.length > 4 && isalpha(line[4])) { // Persona is continuing a line currentParagraph ~= line[3 .. $]; } else if (line.startsWith(" ") && line.length > 2 && isalpha(line[2])) { // Persona just started speaking addParagraph(currentParagraph, info); currentParagraph = line[2 .. $].idup; } } // Done, now print collected information printResults(info); }
After we've equipped ourselves with information on how arrays work, the code should be self-explanatory, save for the presence of .idup. Why is it needed, and what if we forgot about it?
The foreach loop that reads from stdin deposits successive lines of text in the variable line. Because it would be wasteful to allocate a brand new string for each line read, the contents of line is reused every pass through the loop. As such, if you want to squirrel away the contents of a line, you better make a copy of it. Obviously currentParagraph is meant to indeed save text, so duplication is needed, hence the presence of .idup. Now, if we forgot .idup and subsequently the code would still compile and run, the results would be nonsensical and the bug rather hard to find. Having a part of a program modify data held in a different part of the program is very unpleasant to track down because it's a non-local effect (just how many .idups could one forget in a large program?). Fortunately, that's not the case because the types of line and currentParagraph reflect their respective capabilities: line has type char[], i.e., an array of characters that could be overwritten at any time; whereas currentParagraph has type string, which is an array of characters that cannot be individually modified. The two cannot refer to the same memory content because line would break the promise of currentParagraph. So the compiler refuses to compile the erroneous code and demands a copy, which you provide in the form of .idup, and everybody's happy. (The "i" in "idup" stands for "immutable.") On the other hand, when you copy string values around, there's no more need to duplicate the underlying data—they can all point to the same memory because it's known neither will overwrite it, which makes string copying at the same time safe and efficient. Better yet, strings can be shared across threads without problems because, again, there's never contention. Immutability is really cool indeed. If, on the other hand, you need to modify individual characters intensively, you may want to operate on char[], at least temporarily.
PersonaData as defined above is very simple, but in general structs can define not only data, but also other entities such as private sections, member functions, unittests, operators, constructors, and destructor. By default, each data member of a structure is initialized with its default initializer (zero for integral numbers, NaN for floating point numbers,5 and null for arrays and other indirect-access types). Let's now implement addParagraph that slices and dices a line of text and puts it into the associative array.
The line as served by main has the form "Ham. To be, or not to be-that is the question:" We need to find the first ". " to distinguish the persona's name from the actual line. To do so, we use the find function. a.find(b) returns the right-hand portion of a starting with the first occur-rence of b. (If no occurrence is found, find returns an empty string.) While we're at it, we should also do the right thing when collecting the vocabulary. First, we must convert the sentence to lowercase such that capitalized and non-capitalized words count as the same vocabulary element. That's easily taken care of with a call to tolower. Second, we must eliminate a strong source of noise: punctuation that makes for example "him." and "him" count as distinct words. To clean up the vocabulary, all we need to do is pass an additional parameter to split mentioning a regular expression that eliminates all chaff: regex("[ \t,.;:?]+"). With that argument, the split function will consider any sequence of the characters mentioned in between '[' and ']' as part of word separators. That being said, we're ready to do a lot of good stuff in just little code:
void addParagraph(string line, ref PersonaData[string] info) { // Figure out persona and sentence line = strip(line); auto sentence = line.find(". "); if (sentence.empty) { return; } auto persona = line[0 .. $ - sentence.length]; sentence = tolower(strip(sentence[2 .. $]));
// Get the words spoken auto words = split(sentence, regex("[ \t,.;:?]+")); // Insert or update information auto data = persona in info; if (data) { // heard this persona before data.totalWordsSpoken += words.length; foreach (word; words) ++data.wordCount[word]; } else { // first time this persona speaketh PersonaData newData; newData.totalWordsSpoken = words.length; foreach (word; words) newData.wordCount[word] = 1; info[persona] = newData; } }
The expression persona in info not only tells whether a given string is present as a key in an associative array, but also provides a handy pointer to the corresponding value in the array. In D there is no need (as it is in C and C++) to use '->' to access data referenced by a pointer—the regular field access operator '.' works unambiguously. If the key was not found, our code creates and inserts a brand new PersonaData, which concludes the addParagraph function.
Finally, let's implement printResults to print a quick summary for each persona:
void printResults(PersonaData[string] info) { foreach (persona, data; info) { writefln("%20s %6u %6u", persona, data.totalWordsSpoken, data.wordCount.length); } }
Ready for a test drive? Save and run!
Queen 1104 500 Ros 738 338 For 55 45 Fort 74 61 Gentlemen 4 3 Other 105 75 Guil 349 176 Mar 423 231 Capt 92 66 Lord 70 49 Both 44 24 Oph 998 401 Ghost 683 350 All 20 17 Player 16 14 Laer 1507 606 Pol 2626 870 Priest 92 66 Hor 2129 763 King 4153 1251 Cor., Volt 11 11 Both [Mar 8 8 Osr 379 179 Mess 110 79 Sailor 42 36 Servant 11 10 Ambassador 41 34 Fran 64 47 Clown 665 298 Gent 101 77 Ham 11901 2822 Ber 220 135 Volt 150 112 Rey 80 37
Now that's some fun stuff. Unsurprisingly, our friend "Ham" gets the lion's share by a large margin. Voltemand's (Volt) role is rather interesting: he doesn't have many words to say, but in these few words he does his best in displaying a solid vocabulary, not to mention the Sailor, who almost doesn't repeat himself. Also compare the well-rounded Queen with Ophelia: the Queen has about 10% more words to say than Ophelia, but her vocabulary is about 25% larger.
As you can see, the output has some noise in it (such as "Both [Mar"
), easy to fix by a diligent programmer and hardly affecting the important statistics. Nevertheless, fixing the last little glitches would be an instructive (and recommended) exercise.