- 7.1 Quadratic Behavior of Repeated List Search
- 7.2 Deleting or Adding Elements to the Middle of a List
- 7.3 Strings Are Iterables of Strings
- 7.4 (Often) Use enum Rather Than CONSTANT
- 7.5 Learn Less Common Dictionary Methods
- 7.6 JSON Does Not Round-Trip Cleanly to Python
- 7.7 Rolling Your Own Data Structures
- 7.8 Wrapping Up
7.2 Deleting or Adding Elements to the Middle of a List
An early discussion in this book, in Chapter 3, A Grab Bag of Python Gotchas, addresses how naive string concatenation within a loop might encounter quadratic complexity. That is to say, the overall time and computation needed to perform a sequence of N operations is O(N2).a2
Although in many situations the solution to a slowdown in (certain) string operations is to simply “use a list instead” (perhaps followed by a final "".join(thelist) to get back a string), lists have their own very similar danger. The problem here is in not understanding what is “cheap” and what is “expensive” for lists. Specifically, inserting or removing items from a list anywhere other than at the end is expensive.
We first explore some details of exactly how lists are implemented in Python, then look at which other data structures would be good choices for which actual use cases.
Python gives you the ability to insert or remove items from anywhere within a list, and for some purposes it will seem like the obvious approach. Indeed, for a few operations on a relatively small list, the minor inefficiency is wholly unimportant.
Inserting and removing words from middle of list
>>> words = [get_word() for _ in range(10)] >>> words ['hennier', 'oughtness', 'testcrossed', 'railbus', 'ciclatoun', 'consimilitudes', 'trifacial', 'mauri', 'snowploughing', 'ebonics'] >>> del words[3] # ❶ >>> del words[7] >>> del words[3] # ❶ >>> words ['hennier', 'oughtness', 'testcrossed', 'consimilitudes', 'trifacial', 'mauri', 'ebonics'] >>> words.insert(3, get_word()) >>> words.insert(1, get_word()) >>> words # ❷ ['hennier', 'awless', 'oughtness', 'testcrossed', 'wringings', 'consimilitudes', 'trifacial', 'mauri', 'ebonics']
❶ The word deleted at initial index 3 was railbus, but on next deletion ciclatoun was at that index.
❷ The word wringings was inserted at index 3, but got moved to index 4 when awless was inserted at index 1.
For the handful of items inserted and removed from the small list in the example, the relative inefficiency is not important. However, even in the small example, keeping track of where each item winds up by index becomes confusing.
As the number of operations gets large, this approach becomes notably painful. The following toy function performs fairly meaningless insertions and deletions, always returning five words at the end. But the general pattern it uses is one you might be tempted towards in real-world code.
Asymptotic timing for insert-and-delete from list middle
>>> from random import randrange >>> def insert_then_del(n): ... words = [get_word() for _ in range(5)] ... for _ in range(n): ... words.insert(randrange(0, len(words)), get_word()) ... for _ in range(n): ... del words[randrange(0, len(words))] ... return words ... >>> insert_then_del(100) ['healingly', 'cognitions', 'borsic', 'rathole', 'division'] >>> insert_then_del(10_000) ['ferny', 'pleurapophyses', 'protoavis', 'unhived', 'misinform'] >>> %timeit insert_then_del(100) 109 μs ± 2.42 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) >>> %timeit insert_then_del(10_000) 20.3 ms ± 847 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> %timeit insert_then_del(1_000_000) 1min 52s ± 1.51 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Going from 200 operations (counting each of insertion and deletion) to 20,000 operations takes on the order of 200x as long. At these sizes the lists themselves are small enough to matter little; the time involved is dominated by the number of calls to get_word(), or perhaps a bit to randrange(), although we still see a 2x proportional slowdown from the list operations.
However, upon increasing the number of operations by another 100x, to 2 million, linear scaling would see an increase from 20 ms to about 2 seconds. Instead it jumps to nearly 2 minutes, or about a 55x slowdown from linear scaling. I watched my memory usage during the 15 minutes that %timeit took to run the timing seven times, and it remained steady.
It’s not that these operations actually use very much memory; rather, every time we insert one word near the middle of a 1 million word list, that requires the interpreter to move 500,000 pointers up one position in the list. Likewise, each deletion near the middle of a 1 million word list requires us to move the top 500,000 pointers back down. This gets much worse very quickly as the number of operations increases further.
7.2.1 More Efficient Data Structures
There is no one solution to the problem described here. On the other hand, there is exceedingly rarely an actual use case for the exact behavior implemented by code such as the preceding example. Trust me, code like that is not purely contrived for this book—I have encountered a great much like it in production systems (with the problem buried beneath a lot of other functionality in such code).
If you merely need to be able to insert and delete from either the end or the beginning of a concrete sequence, collections.deque gives you exactly what you need. This is not an arbitrary middle for insertion and deletion, but very often all you actually want is .appendleft() and .popleft() to accompany .append() and .pop().
In some cases, sortedcontainers or pyrsistent may have closer to the performance characteristics you need, while still offering a sequence datatype. Generally, using these third-party containers is still only going to get you to O(N×log N) rather than O(N), but that remains strikingly better than O(N2).
Later in this chapter, in the section “Rolling Your Own Data Structures,” I show an example where creating a custom data structure actually can make sense. My pure-Python implementation of CountingTree is able to do exactly the “insert into the middle” action that is described in this section, and remains relatively efficient. For this narrow and specific use case, my custom data structure is actually pretty good.
However, instead of reaching for the abovementioned collections—as excellent as each of them genuinely is—this problem is probably one in which you (or the developer before you) misunderstood what the underlying problem actually requires.
For example, a somewhat plausible reason you might actually want to keep an order for items is because they represent some sort of priority of actions to be performed or data to be processed. A wonderful data structure in which to maintain such priorities is simply a Python dict. A plausible way of using this fast data structure is to keep your “words” (per the earlier example) as keys, and their priority as values.
A priority is not exactly the same thing as an index position, but it is something that very quickly allows you to maintain a sequence for the data you wish to handle, while keeping insertion or deletion operations always at O(1). This means, of course, that performing N such operations is O(N), which is the best we might plausibly hope for. Constructing a sequence at the end of such operations is both cheap and easy, as the following example shows.
A collection of items with a million possible priorities
>>> from pprint import pprint >>> from functools import partial >>> priority = partial(randrange, 1, 1_000_000) >>> words = {get_word():priority() for _ in range(100_000)} >>> words_by_priority = sorted(words.items(), key=lambda p: p[1]) >>> pprint(words_by_priority[:10]) [('badland', 8), ('weakliest', 21), ('sowarry', 28), ('actinobiology', 45), ('oneself', 62), ('subpanel', 68), ('alarmedly', 74), ('marbled', 98), ('dials', 120), ('dearing', 121)] >>> pprint(words_by_priority[-5:]) [('overslow', 999976), ('ironings', 999980), ('tussocked', 999983), ('beaters', 999984), ('tameins', 999992)]
It’s possible—even likely—that the same priority occurs for multiple words, occasionally. It’s also very uncommon that you actually care about exactly which order two individual items come in out of 100,000 of them. However, even with duplicated priorities, items are not dropped, they are merely ordered arbitrarily (but you could easily enough impose an order if you have a reason to).
Deleting items from the words data structure is just slightly more difficult than was del words[n] where it had been a list. To be safe, you’d want to do something like:
>>> for word in ['producibility', 'scrambs', 'marbled']: ... if word in words: ... print("Removing:", word, words[word]) ... del words[word] ... else: ... print("Not present:", word) ... Not present: producibility Removing: scrambs 599046 Removing: marbled 98
The extra print() calls and the else clause are just for illustration; presumably if this approach is relevant to your requirements, you would omit them:
>>> for word in ['producibility', 'scrambs', 'marbled']: ... if word in words: ... del words[word]
This approach remains fast and scalable, and is quite likely much closer to the actual requirements of your software than was misuse of a list.