The Python Standard Library: Data Structures
- 2.1 collectionsContainer Data Types
- 2.2 arraySequence of Fixed-Type Data
- 2.3 heapqHeap Sort Algorithm
- 2.4 bisectMaintain Lists in Sorted Order
- 2.5 QueueThread-Safe FIFO Implementation
- 2.6 structBinary Data Structures
- 2.7 weakrefImpermanent References to Objects
- 2.8 copyDuplicate Objects
- 2.9 pprintPretty-Print Data Structures
Python includes several standard programming data structures, such as list, tuple, dict, and set, as part of its built-in types. Many applications do not require other structures, but when they do, the standard library provides powerful and well-tested versions that are ready to use.
The collections module includes implementations of several data structures that extend those found in other modules. For example, Deque is a double-ended queue that allows the addition or removal of items from either end. The defaultdict is a dictionary that responds with a default value if a key is missing, while OrderedDict remembers the sequence in which items are added to it. And namedtuple extends the normal tuple to give each member item an attribute name in addition to a numeric index.
For large amounts of data, an array may make more efficient use of memory than a list. Since the array is limited to a single data type, it can use a more compact memory representation than a general purpose list. At the same time, arrays can be manipulated using many of the same methods as a list, so it may be possible to replace lists with arrays in an application without a lot of other changes.
Sorting items in a sequence is a fundamental aspect of data manipulation. Python's list includes a sort() method, but sometimes it is more efficient to maintain a list in sorted order without resorting it each time its contents are changed. The functions in heapq modify the contents of a list while preserving the sort order of the list with low overhead.
Another option for building sorted lists or arrays is bisect. It uses a binary search to find the insertion point for new items and is an alternative to repeatedly sorting a list that changes frequently.
Although the built-in list can simulate a queue using the insert() and pop() methods, it is not thread-safe. For true ordered communication between threads, use the Queue module. multiprocessing includes a version of a Queue that works between processes, making it easier to convert a multithreaded program to use processes instead.
struct is useful for decoding data from another application, perhaps coming from a binary file or stream of data, into Python's native types for easier manipulation.
This chapter covers two modules related to memory management. For highly interconnected data structures, such as graphs and trees, use weakref to maintain references while still allowing the garbage collector to clean up objects after they are no longer needed. The functions in copy are used for duplicating data structures and their contents, including recursive copies with deepcopy().
Debugging data structures can be time consuming, especially when wading through printed output of large sequences or dictionaries. Use pprint to create easy-to-read representations that can be printed to the console or written to a log file for easier debugging.
And, finally, if the available types do not meet the requirements, subclass one of the native types and customize it, or build a new container type using one of the abstract base classes defined in collections as a starting point.
2.1 collections—Container Data Types
- Purpose Container data types.
- Python Version 2.4 and later
The collections module includes container data types beyond the built-in types list, dict, and tuple.
2.1.1 Counter
A Counter is a container that tracks how many times equivalent values are added. It can be used to implement the same algorithms for which other languages commonly use bag or multiset data structures.
Initializing
Counter supports three forms of initialization. Its constructor can be called with a sequence of items, a dictionary containing keys and counts, or using keyword arguments mapping string names to counts.
import collections print collections.Counter(['a', 'b', 'c', 'a', 'b', 'b']) print collections.Counter({'a':2, 'b':3, 'c':1}) print collections.Counter(a=2, b=3, c=1)
The results of all three forms of initialization are the same.
$ python collections_counter_init.py Counter({'b': 3, 'a': 2, 'c': 1}) Counter({'b': 3, 'a': 2, 'c': 1}) Counter({'b': 3, 'a': 2, 'c': 1})
An empty Counter can be constructed with no arguments and populated via the update() method.
import collections c = collections.Counter() print 'Initial :', c c.update('abcdaab') print 'Sequence:', c c.update({'a':1, 'd':5}) print 'Dict :', c
The count values are increased based on the new data, rather than replaced. In this example, the count for a goes from 3 to 4.
$ python collections_counter_update.py Initial : Counter() Sequence: Counter({'a': 3, 'b': 2, 'c': 1, 'd': 1}) Dict : Counter({'d': 6, 'a': 4, 'b': 2, 'c': 1})
Accessing Counts
Once a Counter is populated, its values can be retrieved using the dictionary API.
import collections c = collections.Counter('abcdaab') for letter in 'abcde': print '%s : %d' % (letter, c[letter])
Counter does not raise KeyError for unknown items. If a value has not been seen in the input (as with e in this example), its count is 0.
$ python collections_counter_get_values.py a : 3 b : 2 c : 1 d : 1 e : 0
The elements() method returns an iterator that produces all items known to the Counter.
import collections c = collections.Counter('extremely') c['z'] = 0 print c print list(c.elements())
The order of elements is not guaranteed, and items with counts less than or equal to zero are not included.
$ python collections_counter_elements.py Counter({'e': 3, 'm': 1, 'l': 1, 'r': 1, 't': 1, 'y': 1, 'x': 1, 'Z': 0}) ['e', 'e', 'e', 'm', 'l', 'r', 't', 'y', 'x']
Use most_common() to produce a sequence of the n most frequently encountered input values and their respective counts.
import collections c = collections.Counter() with open('/usr/share/dict/words', 'rt') as f: for line in f: c.update(line.rstrip().lower()) print 'Most common:' for letter, count in c.most_common(3): print '%s: %7d' % (letter, count)
This example counts the letters appearing in all words in the system dictionary to produce a frequency distribution, and then prints the three most common letters. Leaving out the argument to most_common() produces a list of all the items, in order of frequency.
$ python collections_counter_most_common.py Most common: e: 234803 i: 200613 a: 198938
Arithmetic
Counter instances support arithmetic and set operations for aggregating results.
import collections c1 = collections.Counter(['a', 'b', 'c', 'a', 'b', 'b']) c2 = collections.Counter('alphabet') print 'C1:', c1 print 'C2:', c2 print '\nCombined counts:' print c1 + c2 print '\nSubtraction:' print c1 - c2 print '\nIntersection (taking positive minimums):' print c1 & c2 print '\nUnion (taking maximums):' print c1 | c2
Each time a new Counter is produced through an operation, any items with zero or negative counts are discarded. The count for a is the same in c1 and c2, so subtraction leaves it at zero.
$ python collections_counter_arithmetic.py C1: Counter({'b': 3, 'a': 2, 'c': 1}) C2: Counter({'a': 2, 'b': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1}) Combined counts: Counter({'a': 4, 'b': 4, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1}) Subtraction: Counter({'b': 2, 'c': 1}) Intersection (taking positive minimums): Counter({'a': 2, 'b': 1}) Union (taking maximums): Counter({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
2.1.2 defaultdict
The standard dictionary includes the method setdefault() for retrieving a value and establishing a default if the value does not exist. By contrast, defaultdict lets the caller specify the default up front when the container is initialized.
import collections def default_factory(): return 'default value' d = collections.defaultdict(default_factory, foo='bar') print 'd:', d print 'foo =>', d['foo'] print 'bar =>', d['bar']
This method works well, as long as it is appropriate for all keys to have the same default. It can be especially useful if the default is a type used for aggregating or accumulating values, such as a list, set, or even int. The standard library documentation includes several examples of using defaultdict this way.
$ python collections_defaultdict.py d: defaultdict(<function default_factory at 0x100d9ba28>, {'foo': 'bar'}) foo => bar bar => default value
See Also:
- defaultdict examples (http://docs.python.org/lib/defaultdict-examples.html) Examples of using defaultdict from the standard library documentation.
- Evolution of Default Dictionaries in Python (http://jtauber.com/blog/2008/02/27/evolution_of_default_dictionaries_in_python/) Discussion from James Tauber of how defaultdict relates to other means of initializing dictionaries.
2.1.3 Deque
A double-ended queue, or deque, supports adding and removing elements from either end. The more commonly used structures, stacks, and queues are degenerate forms of deques where the inputs and outputs are restricted to a single end.
import collections d = collections.deque('abcdefg') print 'Deque:', d print 'Length:', len(d) print 'Left end:', d[0] print 'Right end:', d[-1] d.remove('c') print 'remove(c):', d
Since deques are a type of sequence container, they support some of the same operations as list, such as examining the contents with __getitem__(), determining length, and removing elements from the middle by matching identity.
$ python collections_deque.py Deque: deque(['a', 'b', 'c', 'd', 'e', 'f', 'g']) Length: 7 Left end: a Right end: g remove(c): deque(['a', 'b', 'd', 'e', 'f', 'g'])
Populating
A deque can be populated from either end, termed "left" and "right" in the Python implementation.
import collections # Add to the right d1 = collections.deque() d1.extend('abcdefg') print 'extend :', d1 d1.append('h') print 'append :', d1 # Add to the left d2 = collections.deque() d2.extendleft(xrange(6)) print 'extendleft:', d2 d2.appendleft(6) print 'appendleft:', d2
The extendleft() function iterates over its input and performs the equivalent of an appendleft() for each item. The end result is that the deque contains the input sequence in reverse order.
$ python collections_deque_populating.py extend : deque(['a', 'b', 'c', 'd', 'e', 'f', 'g']) append : deque(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) extendleft: deque([5, 4, 3, 2, 1, 0]) appendleft: deque([6, 5, 4, 3, 2, 1, 0])
Consuming
Similarly, the elements of the deque can be consumed from both ends or either end, depending on the algorithm being applied.
import collections print 'From the right:' d = collections.deque('abcdefg') while True: try: print d.pop(), except IndexError: break print print '\nFrom the left:' d = collections.deque(xrange(6)) while True: try: print d.popleft(), except IndexError: break print
Use pop() to remove an item from the right end of the deque and popleft() to take from the left end.
$ python collections_deque_consuming.py From the right: g f e d c b a From the left: 0 1 2 3 4 5
Since deques are thread-safe, the contents can even be consumed from both ends at the same time from separate threads.
import collections import threading import time candle = collections.deque(xrange(5)) def burn(direction, nextSource): while True: try: next = nextSource() except IndexError: break else: print '%8s: %s' % (direction, next) time.sleep(0.1) print '%8s done' % direction return left = threading.Thread(target=burn, args=('Left', candle.popleft)) right = threading.Thread(target=burn, args=('Right', candle.pop)) left.start() right.start() left.join() right.join()
The threads in this example alternate between each end, removing items until the deque is empty.
$ python collections_deque_both_ends.py Left: 0 Right: 4 Right: 3 Left: 1 Right: 2 Left done Right done
Rotating
Another useful capability of the deque is to rotate it in either direction, to skip over some items.
import collections d = collections.deque(xrange(10)) print 'Normal :', d d = collections.deque(xrange(10)) d.rotate(2) print 'Right rotation:', d d = collections.deque(xrange(10)) d.rotate(-2) print 'Left rotation :', d
Rotating the deque to the right (using a positive rotation) takes items from the right end and moves them to the left end. Rotating to the left (with a negative value) takes items from the left end and moves them to the right end. It may help to visualize the items in the deque as being engraved along the edge of a dial.
$ python collections_deque_rotate.py Normal : deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Right rotation: deque([8, 9, 0, 1, 2, 3, 4, 5, 6, 7]) Left rotation : deque([2, 3, 4, 5, 6, 7, 8, 9, 0, 1])
See Also:
- Deque (http://en.wikipedia.org/wiki/Deque) Wikipedia article that provides a discussion of the deque data structure.
- Deque Recipes (http://docs.python.org/lib/deque-recipes.html) Examples of using deques in algorithms from the standard library documentation.
2.1.4 namedtuple
The standard tuple uses numerical indexes to access its members.
bob = ('Bob', 30, 'male') print 'Representation:', bob jane = ('Jane', 29, 'female') print '\nField by index:', jane[0] print '\nFields by index:' for p in [ bob, jane ]: print '%s is a %d year old %s' % p
This makes tuples convenient containers for simple uses.
$ python collections_tuple.py Representation: ('Bob', 30, 'male') Field by index: Jane Fields by index: Bob is a 30 year old male Jane is a 29 year old female
On the other hand, remembering which index should be used for each value can lead to errors, especially if the tuple has a lot of fields and is constructed far from where it is used. A namedtuple assigns names, as well as the numerical index, to each member.
Defining
namedtuple instances are just as memory efficient as regular tuples because they do not have per-instance dictionaries. Each kind of namedtuple is represented by its own class, created by using the namedtuple() factory function. The arguments are the name of the new class and a string containing the names of the elements.
import collections Person = collections.namedtuple('Person', 'name age gender') print 'Type of Person:', type(Person) bob = Person(name='Bob', age=30, gender='male') print '\nRepresentation:', bob jane = Person(name='Jane', age=29, gender='female') print '\nField by name:', jane.name print '\nFields by index:' for p in [ bob, jane ]: print '%s is a %d year old %s' % p
As the example illustrates, it is possible to access the fields of the namedtuple by name using dotted notation (obj.attr) as well as using the positional indexes of standard tuples.
$ python collections_namedtuple_person.py Type of Person: <type 'type'> Representation: Person(name='Bob', age=30, gender='male') Field by name: Jane Fields by index: Bob is a 30 year old male Jane is a 29 year old female
Invalid Field Names
Field names are invalid if they are repeated or conflict with Python keywords.
import collections try: collections.namedtuple('Person', 'name class age gender') except ValueError, err: print err try: collections.namedtuple('Person', 'name age gender age') except ValueError, err: print err
As the field names are parsed, invalid values cause ValueError exceptions.
$ python collections_namedtuple_bad_fields.py Type names and field names cannot be a keyword: 'class' Encountered duplicate field name: 'age'
If a namedtuple is being created based on values outside of the control of the program (such as to represent the rows returned by a database query, where the schema is not known in advance), set the rename option to True so the invalid fields are renamed.
import collections with_class = collections.namedtuple( 'Person', 'name class age gender', rename=True) print with_class._fields two_ages = collections.namedtuple( 'Person', 'name age gender age', rename=True) print two_ages._fields
The new names for renamed fields depend on their index in the tuple, so the field with name class becomes _1 and the duplicate age field is changed to _3.
$ python collections_namedtuple_rename.py ('name', '_1', 'age', 'gender') ('name', 'age', 'gender', '_3')
2.1.5 OrderedDict
An OrderedDict is a dictionary subclass that remembers the order in which its contents are added.
import collections print 'Regular dictionary:' d = {} d['a'] = 'A' d['b'] = 'B' d['c'] = 'C' for k, v in d.items(): print k, v print '\nOrderedDict:' d = collections.OrderedDict() d['a'] = 'A' d['b'] = 'B' d['c'] = 'C' for k, v in d.items(): print k, v
A regular dict does not track the insertion order, and iterating over it produces the values in order based on how the keys are stored in the hash table. In an OrderedDict, by contrast, the order in which the items are inserted is remembered and used when creating an iterator.
$ python collections_ordereddict_iter.py Regular dictionary: a A c C b B OrderedDict: a A b B c C
Equality
A regular dict looks at its contents when testing for equality. An OrderedDict also considers the order the items were added.
import collections print 'dict :', d1 = {} d1['a'] = 'A' d1['b'] = 'B' d1['c'] = 'C' d2 = {} d2['c'] = 'C' d2['b'] = 'B' d2['a'] = 'A' print d1 == d2 print 'OrderedDict:', d1 = collections.OrderedDict() d1['a'] = 'A' d1['b'] = 'B' d1['c'] = 'C' d2 = collections.OrderedDict() d2['c'] = 'C' d2['b'] = 'B' d2['a'] = 'A' print d1 == d2
In this case, since the two ordered dictionaries are created from values in a different order, they are considered to be different.
$ python collections_ordereddict_equality.py dict : True OrderedDict: False
See Also:
- collections (http://docs.python.org/library/collections.html) The standard library documentation for this module.