Suppose you are implementing a service that works with data sets that represent routes on a map (e.g., collections of driving directions or logs of past trips). The road network is represented as a graph with nodes and edges, and each route is conceptually a collection of edges. You are tasked with choosing an appropriate data structure for routes that meets at least the following criteria:
What are your options in implementing a data structure for routes that meets these criteria, and what issues should you consider?
Both built-in and user-defined data structures in Python can be either mutable or immutable. This article explains why Python makes this distinction for built-in data structures, breaks down the independent characteristics that are often associated with immutable data structures, and explores several approaches you can employ when addressing the above use case.
Each of the built-in types found in Python is either mutable or immutable:
list
, dict
, and set
are mutable,tuple
, frozenset
, and range
are immutable,bool
, int
, float
, and str
are immutable, andbytes
are immutable but instances of bytearray
are mutable.Mutable types are accompanied by methods that modify instances of the corresponding data structure in-place (usually returning None
) while immutable types are usually accompanied by functions and methods that return a new instance of that type (such as string concatenation). But why does Python distinguish between mutable and immutable built-in types? The reasons are subtle and relate to an interplay between programming language design decisions and practical performance requirements. A brief overview is provided below, and you can find a detailed answer to this question in the Python documentation.
Some programming languages such as Haskell) have only immutable values (and thus all new values are necessarily copies or entirely new objects). One significant benefit of this approach is that code in Haskell is much easier and safer to refactor and transform (e.g., for purposes of optimization) because the context of an expression will never affect its meaning. As an example, consider the following for
loop.
for i in range(3):
x = 1 + 2 + 3 + 4
print(x)
Because the expression 1 + 2 + 3 + 4
is immutable (i.e., its value will not change depending on where it appears in a program), it can safely be moved up and outside of the for
loop without changing the behavior of the program. Modern interpreters and compilers routinely use such an approach to perform performance optimizations directly on the abstract syntax tree of the program.
immutable_value = 1 + 2 + 3 + 4
for i in range(3):
x = immutable_value
print(x)
Python set
instances support an extremely fast lookup/membership operation (which can be invoked using the infix in
operator) because the Python interpreter builds a hash table that contains hashes of all the individual elements of a set. As a reference point, consider the performance when searching for an element in a list.
import time
l = list(range(0,1000000))
start = time.perf_counter()
999999 in l
time.perf_counter() - start
When evaluating the in
operator in an expression such as 4 in {1, 2, 3, 4, 5}
, the interpreter hashes 4
and finds the hash value in the hash table for {1, 2, 3, 4, 5}
in nearly constant time. As shown below, this is significantly faster than performing a search through all of the elements in the set as in the above example (which is the only option without some sort of alternative comparison and sorting mechanism).
s = set(l)
start = time.perf_counter()
999999 in s
time.perf_counter() - start
Suppose that elements in a set
instance were mutable. This would mean that their hash would also be mutable, which would in term mean that the hash table for the set
instance would need to be updated. But how would the interpreter even know to update the hash table? Consider the example below.
try:
e = [1, 2, 3]
u = {e, "a"}
v = {e, True, False}
w = {e, 1.2, 2.3, 3.4}
e.pop()
except:
pass
Every time a statement such as e.pop()
is executed, the interpreter would need to check whether e
is a member of any sets (there are three such sets in this case) and would need to update the hash table for every one of them to accurately reflect that the hash value corresponding to e
is different. If the interpreter did not do this, then an expression such as [1, 2, 3] in u
could not be evaluated both correctly and efficiently.
It is worth noting that Python's set
and dict
data structures were intentionally designed this way under an assumption: programmers will usually want to perform lookup in set
and dict
instances based on the value and not the particular instance of a data structure. An alternate approach could have been to simply build the hash table using the memory address of the each element rather than its value. However, this approach begins to fall apart as soon as strings are used as elements: after the statements k = "abc!"
and d = {"abc!": 123}
, a programmer probably expects d[k] == 123
to be True
even though the address of the string instance "abc!"
in k = "abc!"
is different from the address of the distinct string instance "abc"
in d = {"abc!": 123}
. This can be confirmed using the built-in id
function (though you should be aware of interning to avoid confusion when testing your own examples).
k = "abc!"
d = {"abc!": 123}
id(k) == id(list(d.keys())[0])
The above explains why built-in immutable types exist and why Python requires them in certain contexts. This also indicates that to satisfy the criteria in the motivating scenario described in the introduction, the data structure you define for representing a route must be one that Python recognizes as being immutable.
In Python, there are a number of approaches available to you when you are defining a data structure for a use case such as the one described in the introduction.
One approach is simply to adopt a convention of using an existing built-in type to represent instances of your data structure. For example, a route can be represented as a frozenset
of tuple
instances (with two int
components in each tuple representing the endpoints of the edge). This may be advantageous if you are trying to avoid unnecessary clutter or would like to make it easier for other libraries or components to use your data without dealing with (or introducing within their own code) application-specific boilerplate.
A second approach is to define a derived class that inherits the features of a built-in type (as demonstrated in an another article on operator overloading). This has most of the benefits of the first approach above, but gives you more control over the interface of the data structure. This is useful if you would like to add custom methods, to modify how certain default methods inherited from the built-in type behave, to enforce type or value constraints on method arguments, to throw application-specific exceptions, or simply to provide more user-friendly and application-specific synonyms for existing functions and methods.
A third approach is to define a brand new class that conceals its internal representation. This has the benefit of encapsulation (allowing you to modify the internal representation of the actual route in the future), but requires a more careful approach on the conceptual side and more boilerplate code on the practical side.
A route could be represented using an instance of frozenset
containing instances of tuple
(with each tuple representing one edge) that in turn each contain two integers (with each integer representing one of the two nodes that an edge connects). Because integers and tuples are immutable, they can be elements inside frozenset
instances.
route_one = frozenset({(0, 1), (1, 2), (2, 3)})
route_two = frozenset({(1, 2), (0, 1), (2, 3)})
len({route_one, route_two}) # Two routes with the same edges.
Because frozenset
instances are immutable, all four criteria for the route data structure can be satisfied. In particular, routes can be deduplicated by inserting them into a set
and a mapping from routes to their distances can be implemented using a dict
. Because a frozenset
behaves like a mathematical set, the order of the edges does not matter.
distances = {route_one: 3} # No exception is raised.
len({route_one, route_two}) # Deduplication occurs.
Note that an instance of an immutable type can contain a mutable object inside it. However, in order for a value to be used as a key in a dict
instance or an element in a set
instance, all elements inside the immutable type must also be immutable (and so on, down to the leaves of the data structure instance). In the example below, the mutable object [True, False]
causes an exception even though it is inside an immutable frozenset
instance that is itself inside an immutable tuple
instance.
try:
{tuple("a", frozenset({[True, False]})): 0}
except Exception as e:
print(e)
It is possible to take advantage of Python's support for inheritance to define a class that is derived from one of the immutable types. This ensures that the derived class has the same attributes and methods as the base class. In the example below, the route
class inherits all the features of frozenset
. In addition, it has a method distance
that computes the distance of the route (defined as the number of hops that occur between two distinct nodes) and a custom definition of __repr__
to display an instance in a friendly way.
class route(frozenset):
def distance(self):
return sum([1 for e in self if e[0] != e[1]])
def __repr__(self):
return "route({" + ", ".join([str(e) for e in self]) + "})"
To create an instance of route
, it is sufficient to wrap an instance of frozenset
with the constructor.
route({(0,0), (0,1), (1,1)}).distance()
With respect to its immutability, instances of route
can be used in any context in which frozenset
can be used.
{route({(0,1), (1,2)}): 2}
If your data structure were simpler (e.g., a record with a fixed collection of named attributes), you could have your derived class inherit from a type generated using namedtuple
from the built-in collections library.
from collections import namedtuple
class record(namedtuple("record", "name age")):
pass
record("Alice", 32)
When creating your own class for the route data structure, you first need to determine which characteristics of built-in immutable types you would like your class to possess.
set
instance or as a key in a dict
instance, you must define appropriate methods to make this possible.To satisfy the first requirement, it is sufficient to provide definitions for the __hash__
and __eq__
methods. The Python interpreter will invoke these methods when building a hash table (e.g., for a set
or dict
instance) as well as when retrieving a value. The __eq__
method is required in addition to the __hash__
method because hashes are not guaranteed to be unique and the interpreter needs to be able to disambiguate between objects of your class in such scenarios. Note that the Python interpreter expects that two values that are equal according to __eq__
must have the same hash values according to __hash__
.
class route():
def __init__(self, edges):
self.es = edges
def __hash__(self):
es = sorted(list(set(self.es)))
import hashlib
return int(hashlib.sha256(str(es).encode()).hexdigest(), 16)
def __eq__(self, other):
return set(self.es) == set(other.es)
def distance(self):
return sum([1 for e in self.es if e[0] != e[1]])
It is now possible to use instances of route
as elements of set
instances and as keys for dict
instances.
route_one = route([(0,0), (0,1), (1,1)])
route_two = route([(0,1), (1,2), (2,3)])
route_three = route([(0,1), (1,2), (2,3)])
distances = {route_one: 3}
len({route_one, route_two, route_three})
Note that in the definition of the __hash__
method, a cryptographic hash function sha256
from the built-in hashlib
library is applied to a string version of a normalized representation (i.e., sorted and deduplicated) of a route. Normalization ensures that the order and multiplicity of edges does not change the hash of a route. Use of sha256
ensures that the hash of any instance of the same string will always be the same in any environment and under any version of Python. This is not guaranteed by the built-in hash
function, which may return different results across different Python sessions (even in the same environment). Such an inconsistency could be an issue if, for example, users of your data structure store instances of it on disk and load them again at a later time.
One way to satisfy the requirement that users cannot modify or extend instances of your data structure is to explicitly set the __slots__
attribute to a tuple containing only those attributes that you require for your implementation.
class route():
__slots__ = ("es")
def __init__(self, edges):
self.es = edges
It is now not possible to create new attributes for any route
instance.
try:
route_example = route([(0,1), (1,2)])
route_example.duration = 123
except Exception as e:
print(e)
Another option is to provide a definition for __setattr__
that raises an exception, ensuring that it is not possible to add new attributes or to assign new values to an existing attribute.
This article provides an overview of the distinction between mutable and immutable data structures in Python, why the distinction exists, and guidance on what options are available to a programmer when they are implementing an application-specific immutable data structure. The information about the namedtuple
factory function in the built-in collections library may be relevant to some scenarios. The third-party attrs library can help reduce the amount of boilerplate required to define new classes (even if you are not relying on inheritance) and has features that help introduce immutability. To learn more about the nuances of object and memory management within the Python interpreter, you can read about interning. Those more familiar with C may find it useful to explore the Memory Management documentation (targeting primarily at readers who are interested in implementing their own extensions to the Python interpreter).