Applications of Immutability

Suppose you are implementing a service that works with data sets that represent routes on a map (e.g., collections of driving directions or logs of past trips). The road network is represented as a graph with nodes and edges, and each route is conceptually a collection of edges. You are tasked with choosing an appropriate data structure for routes that meets at least the following criteria:

  • it should be possible to deduplicate large collections of routes,
  • it should be possible to use individual routes as keys into a dictionary (e.g., to build a cache that maps each route to its total distance or average trip time),
  • programmers should not be able to modify a route instance if they obtain a reference to it, and
  • the collections of edges found in a route may be supplied to your data structure's constructor in any order (even though they represent the same route for purposes of caching or deduplication).

What are your options in implementing a data structure for routes that meets these criteria, and what issues should you consider?

Both built-in and user-defined data structures in Python can be either mutable or immutable. This article explains why Python makes this distinction for built-in data structures, breaks down the independent characteristics that are often associated with immutable data structures, and explores several approaches you can employ when addressing the above use case.

Mutable and Immutable Built-in Types

Each of the built-in types found in Python is either mutable or immutable:

Mutable types are accompanied by methods that modify instances of the corresponding data structure in-place (usually returning None) while immutable types are usually accompanied by functions and methods that return a new instance of that type (such as string concatenation). But why does Python distinguish between mutable and immutable built-in types? The reasons are subtle and relate to an interplay between programming language design decisions and practical performance requirements. A brief overview is provided below, and you can find a detailed answer to this question in the Python documentation.

Some programming languages such as Haskell) have only immutable values (and thus all new values are necessarily copies or entirely new objects). One significant benefit of this approach is that code in Haskell is much easier and safer to refactor and transform (e.g., for purposes of optimization) because the context of an expression will never affect its meaning. As an example, consider the following for loop.

In [1]:
for i in range(3):
    x = 1 + 2 + 3 + 4
    print(x)
10
10
10

Because the expression 1 + 2 + 3 + 4 is immutable (i.e., its value will not change depending on where it appears in a program), it can safely be moved up and outside of the for loop without changing the behavior of the program. Modern interpreters and compilers routinely use such an approach to perform performance optimizations directly on the abstract syntax tree of the program.

In [2]:
immutable_value = 1 + 2 + 3 + 4
for i in range(3):
    x = immutable_value
    print(x)
10
10
10

Python set instances support an extremely fast lookup/membership operation (which can be invoked using the infix in operator) because the Python interpreter builds a hash table that contains hashes of all the individual elements of a set. As a reference point, consider the performance when searching for an element in a list.

In [3]:
import time
l = list(range(0,1000000))
start = time.perf_counter()
999999 in l
time.perf_counter() - start
Out[3]:
0.03784119999999991

When evaluating the in operator in an expression such as 4 in {1, 2, 3, 4, 5}, the interpreter hashes 4 and finds the hash value in the hash table for {1, 2, 3, 4, 5} in nearly constant time. As shown below, this is significantly faster than performing a search through all of the elements in the set as in the above example (which is the only option without some sort of alternative comparison and sorting mechanism).

In [4]:
s = set(l)
start = time.perf_counter()
999999 in s
time.perf_counter() - start
Out[4]:
0.00020169999999986032

Suppose that elements in a set instance were mutable. This would mean that their hash would also be mutable, which would in term mean that the hash table for the set instance would need to be updated. But how would the interpreter even know to update the hash table? Consider the example below.

In [5]:
try:
    e = [1, 2, 3]
    u = {e, "a"}
    v = {e, True, False}
    w = {e, 1.2, 2.3, 3.4}
    e.pop()
except:
    pass

Every time a statement such as e.pop() is executed, the interpreter would need to check whether e is a member of any sets (there are three such sets in this case) and would need to update the hash table for every one of them to accurately reflect that the hash value corresponding to e is different. If the interpreter did not do this, then an expression such as [1, 2, 3] in u could not be evaluated both correctly and efficiently.

It is worth noting that Python's set and dict data structures were intentionally designed this way under an assumption: programmers will usually want to perform lookup in set and dict instances based on the value and not the particular instance of a data structure. An alternate approach could have been to simply build the hash table using the memory address of the each element rather than its value. However, this approach begins to fall apart as soon as strings are used as elements: after the statements k = "abc!" and d = {"abc!": 123}, a programmer probably expects d[k] == 123 to be True even though the address of the string instance "abc!" in k = "abc!" is different from the address of the distinct string instance "abc" in d = {"abc!": 123}. This can be confirmed using the built-in id function (though you should be aware of interning to avoid confusion when testing your own examples).

In [6]:
k = "abc!"
d = {"abc!": 123}
id(k) == id(list(d.keys())[0])
Out[6]:
False

The above explains why built-in immutable types exist and why Python requires them in certain contexts. This also indicates that to satisfy the criteria in the motivating scenario described in the introduction, the data structure you define for representing a route must be one that Python recognizes as being immutable.

Defining an Immutable Data Structure

In Python, there are a number of approaches available to you when you are defining a data structure for a use case such as the one described in the introduction.

  • One approach is simply to adopt a convention of using an existing built-in type to represent instances of your data structure. For example, a route can be represented as a frozenset of tuple instances (with two int components in each tuple representing the endpoints of the edge). This may be advantageous if you are trying to avoid unnecessary clutter or would like to make it easier for other libraries or components to use your data without dealing with (or introducing within their own code) application-specific boilerplate.

  • A second approach is to define a derived class that inherits the features of a built-in type (as demonstrated in an another article on operator overloading). This has most of the benefits of the first approach above, but gives you more control over the interface of the data structure. This is useful if you would like to add custom methods, to modify how certain default methods inherited from the built-in type behave, to enforce type or value constraints on method arguments, to throw application-specific exceptions, or simply to provide more user-friendly and application-specific synonyms for existing functions and methods.

  • A third approach is to define a brand new class that conceals its internal representation. This has the benefit of encapsulation (allowing you to modify the internal representation of the actual route in the future), but requires a more careful approach on the conceptual side and more boilerplate code on the practical side.

Using Built-in Types

A route could be represented using an instance of frozenset containing instances of tuple (with each tuple representing one edge) that in turn each contain two integers (with each integer representing one of the two nodes that an edge connects). Because integers and tuples are immutable, they can be elements inside frozenset instances.

In [7]:
route_one = frozenset({(0, 1), (1, 2), (2, 3)})
route_two = frozenset({(1, 2), (0, 1), (2, 3)})
len({route_one, route_two}) # Two routes with the same edges.
Out[7]:
1

Because frozenset instances are immutable, all four criteria for the route data structure can be satisfied. In particular, routes can be deduplicated by inserting them into a set and a mapping from routes to their distances can be implemented using a dict. Because a frozenset behaves like a mathematical set, the order of the edges does not matter.

In [8]:
distances = {route_one: 3} # No exception is raised.
len({route_one, route_two}) # Deduplication occurs.
Out[8]:
1

Note that an instance of an immutable type can contain a mutable object inside it. However, in order for a value to be used as a key in a dict instance or an element in a set instance, all elements inside the immutable type must also be immutable (and so on, down to the leaves of the data structure instance). In the example below, the mutable object [True, False] causes an exception even though it is inside an immutable frozenset instance that is itself inside an immutable tuple instance.

In [9]:
try:
    {tuple("a", frozenset({[True, False]})): 0}
except Exception as e:
    print(e)
unhashable type: 'list'

Defining a Derived Class

It is possible to take advantage of Python's support for inheritance to define a class that is derived from one of the immutable types. This ensures that the derived class has the same attributes and methods as the base class. In the example below, the route class inherits all the features of frozenset. In addition, it has a method distance that computes the distance of the route (defined as the number of hops that occur between two distinct nodes) and a custom definition of __repr__ to display an instance in a friendly way.

In [10]:
class route(frozenset):
    def distance(self):
        return sum([1 for e in self if e[0] != e[1]])
    
    def __repr__(self):
        return "route({" + ", ".join([str(e) for e in self]) + "})"

To create an instance of route, it is sufficient to wrap an instance of frozenset with the constructor.

In [11]:
route({(0,0), (0,1), (1,1)}).distance()
Out[11]:
1

With respect to its immutability, instances of route can be used in any context in which frozenset can be used.

In [12]:
{route({(0,1), (1,2)}): 2}
Out[12]:
{route({(1, 2), (0, 1)}): 2}

If your data structure were simpler (e.g., a record with a fixed collection of named attributes), you could have your derived class inherit from a type generated using namedtuple from the built-in collections library.

In [13]:
from collections import namedtuple
class record(namedtuple("record", "name age")):
    pass
record("Alice", 32)
Out[13]:
record(name='Alice', age=32)

Defining a New Class

When creating your own class for the route data structure, you first need to determine which characteristics of built-in immutable types you would like your class to possess.

  1. If you would like to be able to use instances of the class inside a set instance or as a key in a dict instance, you must define appropriate methods to make this possible.
  2. If you would like to ensure that it is not possible to modify or extend an instance of your class once it has been created, you will need to redefine specific methods and attributes in a particular way. However, it is worth noting that this approach does not enforce immutability to the same extent as the approach of defining a class derived from an immutable type.

To satisfy the first requirement, it is sufficient to provide definitions for the __hash__ and __eq__ methods. The Python interpreter will invoke these methods when building a hash table (e.g., for a set or dict instance) as well as when retrieving a value. The __eq__ method is required in addition to the __hash__ method because hashes are not guaranteed to be unique and the interpreter needs to be able to disambiguate between objects of your class in such scenarios. Note that the Python interpreter expects that two values that are equal according to __eq__ must have the same hash values according to __hash__.

In [14]:
class route():
    def __init__(self, edges):
        self.es = edges

    def __hash__(self):
        es = sorted(list(set(self.es)))
        import hashlib
        return int(hashlib.sha256(str(es).encode()).hexdigest(), 16)

    def __eq__(self, other):
        return set(self.es) == set(other.es)

    def distance(self):
        return sum([1 for e in self.es if e[0] != e[1]])

It is now possible to use instances of route as elements of set instances and as keys for dict instances.

In [15]:
route_one = route([(0,0), (0,1), (1,1)])
route_two = route([(0,1), (1,2), (2,3)])
route_three = route([(0,1), (1,2), (2,3)])
distances = {route_one: 3}
len({route_one, route_two, route_three})
Out[15]:
2

Note that in the definition of the __hash__ method, a cryptographic hash function sha256 from the built-in hashlib library is applied to a string version of a normalized representation (i.e., sorted and deduplicated) of a route. Normalization ensures that the order and multiplicity of edges does not change the hash of a route. Use of sha256 ensures that the hash of any instance of the same string will always be the same in any environment and under any version of Python. This is not guaranteed by the built-in hash function, which may return different results across different Python sessions (even in the same environment). Such an inconsistency could be an issue if, for example, users of your data structure store instances of it on disk and load them again at a later time.

One way to satisfy the requirement that users cannot modify or extend instances of your data structure is to explicitly set the __slots__ attribute to a tuple containing only those attributes that you require for your implementation.

In [16]:
class route():
    __slots__ = ("es")
    
    def __init__(self, edges):
        self.es = edges

It is now not possible to create new attributes for any route instance.

In [17]:
try:
    route_example = route([(0,1), (1,2)])
    route_example.duration = 123
except Exception as e:
    print(e)
'route' object has no attribute 'duration'

Another option is to provide a definition for __setattr__ that raises an exception, ensuring that it is not possible to add new attributes or to assign new values to an existing attribute.

Further Reading

This article provides an overview of the distinction between mutable and immutable data structures in Python, why the distinction exists, and guidance on what options are available to a programmer when they are implementing an application-specific immutable data structure. The information about the namedtuple factory function in the built-in collections library may be relevant to some scenarios. The third-party attrs library can help reduce the amount of boilerplate required to define new classes (even if you are not relying on inheritance) and has features that help introduce immutability. To learn more about the nuances of object and memory management within the Python interpreter, you can read about interning. Those more familiar with C may find it useful to explore the Memory Management documentation (targeting primarily at readers who are interested in implementing their own extensions to the Python interpreter).