Suppose you are working on a Python API that provides access to a real-time data stream (perhaps from an array of sensors or from a web service that handles user requests). You would like to deliver to the consumers of your API a simple but flexible abstraction that allows them to operate on new items from the stream when they choose to do so. Furthermore, you would like the API to allow users to do the following three things:
What abstraction should you use? How much of it must be custom-built and how much can be done using native Python features? When working with data streams, state spaces, and other abstractions that represent large or unbounded structures, it can be tempting to custom-build solutions that may become increasingly complex and difficult to maintain. Understanding a range of features that are already available in a language or its built-in libraries can help mitigate this while saving significant time and effort (both your own and that of others who build upon your work).
Iterators and generators are powerful tools in the Python language that have compelling applications in a number of contexts. This article reviews how they are defined, how they can be used, how they are related to one another, and how they can help you work in an elegant and flexible way with data structures and data streams of an unknown or infinite size.
When discussing Python, the terms iterable, iterator, and generator often appear in similar contexts or are even used interchangeably. These language features also solve similar problems. This can lead to some confusion, but there is a reason that on occasion these are conflated.
One way to understand the term iterator is that it refers to any Python data structure that has an interface that supports iteration over objects one at a time. A data structure is iterable if there is at least one way to construct an iterator that traverses it in some way. On the other hand, a generator is a particular kind of data structure, defined in a specific way within Python, that maintains an internal state and constructs or retrieves zero or more objects or values one at a time.
Thus, a generator by virtue of its characteristics can have an interface that allows it to qualify as an iterator, which consequently also makes it an iterable. In fact, all generators are iterators and iterable. However, not all iterators or iterable data structures are generators because there exist other approaches for building a Python object that possesses the kind of interface an iterator or iterable is expected to have.
If you want to implement an iterator data structure directly, you need to include a special method __next__
in the class definition, which will be invoked whenever the built-in next
function is applied to an instance of that data structure. The skips
data structure below can emit every other positive integer via its definition of a __next__
method.
class skips:
def __init__(self):
self.integer = 0
def __next__(self):
self.integer += 2
return self.integer
Now it is possible to use the built-in next
function to retrieve each item one at a time from an instance of the skips
data structure.
ns = skips()
[next(ns), next(ns), next(ns)]
The number of items over which the data structure will iterate can be limited by raising the StopIteration
exception when more items can not (or should not) be returned.
class skips:
def __init__(self, start, end):
self.integer = start-2
self.end = end
def __next__(self):
self.integer += 2
if self.integer > self.end:
raise StopIteration
return self.integer
It is then the responsibility of any code that uses an instance of this iterator to catch this exception and handle it appropriately. It is worth acknowledging that this is a somewhat unusual use of a language feature normally associated with catching errors (because an iterator being exhausted is not always an error condition).
ns = skips(0, 10)
while True:
try:
print(next(ns))
except StopIteration:
break
In Python, there is a distinction between an iterator and an iterable data structure. This distinction is useful to maintain for a variety of reasons, including the ones below.
As an example, consider a data structure interval
that represents all positive integers in some range. Users might be allowed to obtain two different kinds of iterators for an instance of this data structure: those that iterate over only the even integers and those that iterate over only the odd integers.
class interval:
def __init__(self, lower, upper):
self.lower = lower
self.upper = upper
def evens(self):
return skips(
self.lower + (0 if (self.lower % 2) == 0 else 1),
self.upper
)
def odds(self):
return skips(
self.lower + (0 if (self.lower % 2) == 1 else 1),
self.upper
)
The example below illustrates how an iterator returned by one of the methods in the definition of interval
can be used.
ns = interval(0, 10).odds()
while True: # Keep iterating and printing until exhaustion.
try:
print(next(ns))
except StopIteration:
break
So far in this article, the distinction between iterators and iterable data structures has been explicit for clarity. However, the convention that is supported (and sometimes expected) throughout Python is that an iterable data structure has a single iterator that can be used to iterate over it. This iterator is returned by a special method __iter__
that is included in the class definition. In the example below, the interval
class supports the creation of an iterator that visits every integer in the interval.
class every:
def __init__(self, start, end):
self.integer = start - 1
self.end = end
def __next__(self):
self.integer += 1
if self.integer > self.end:
raise StopIteration
return self.integer
class interval:
def __init__(self, lower, upper):
self.lower = lower
self.upper = upper
def __iter__(self):
return every(self.lower, self.upper)
Python's built-in iter
function can be used to invoke __iter__
for an instance of this data structure.
ns = iter(interval(1, 3))
while True: # Keep iterating and printing until exhaustion.
try:
print(next(ns))
except StopIteration:
break
list(interval(0, 10)), set(interval(0, 10))
This also includes comprehensions and for
loops.
for n in interval(1, 4):
print([k for k in interval(1, n)])
There is nothing stopping you from making the iterator itself an iterable by having it return itself, as in the variant below.
class every:
def __init__(self, start, end):
self.integer = start - 1
self.end = end
def __next__(self):
self.integer += 1
if self.integer > self.end:
raise StopIteration
return self.integer
def __iter__(self):
return self
This approach ensures that there is no ambiguity (from a programmer's perspective) about what will happen when built-in functions such as list
are applied to an instance of the data structure.
list(every(0, 10))
This practice is common and is the cause of some of the confusion and conflation that occurs between iterators and iterables. In addition to the potential for confusion, users of such a data structure must be careful to use the iterator as an iterable only once (or, alternatively, the object must reset its internal state every time __iter__
is invoked).
ns = every(0, 10)
list(ns), list(ns) # Only returns contents the first time.
Nevertheless, this can also be a useful practice. Going back to the example with evens
and odds
, ensuring the iterators returned by these methods are also iterable means they can be fed directly into contexts that expect an iterable.
class skips:
def __init__(self, start, end):
self.integer = start - 2
self.end = end
def __next__(self):
self.integer += 2
if self.integer > self.end:
raise StopIteration
return self.integer
def __iter__(self):
return self
class interval:
def __init__(self, lower, upper):
self.lower = lower
self.upper = upper
def evens(self):
return skips(
self.lower + (0 if (self.lower % 2) == 0 else 1),
self.upper
)
def odds(self):
return skips(
self.lower + (0 if (self.lower % 2) == 1 else 1),
self.upper
)
The example below illustrates how this kind of interface can be used.
i = interval(0, 10)
list(i.evens()), set(i.odds())
Generators are data structures defined using either the yield
statement or comprehension notation (also known as a generator expression). The example below defines a generator skips
using both approaches.
def skips(start, end):
integer = start
while integer <= end:
yield integer
integer += 2
def skips(start, end):
return (
integer
for integer in range(start, end)
if (integer - start) % 2 == 0
)
When it is evaluated, a generator returns an iterator (more precisely called a generator iterator). These are technically both iterators and iterables. For example, as with any iterator, next
can be applied directly to instances of this data structure.
ns = skips(0, 10)
next(ns), next(ns), next(ns)
As with any iterator, exhaustion can be detected by catching the StopIteration
exception.
ns = skips(0, 2)
try:
next(ns), next(ns), next(ns)
except StopIteration:
print("Exhausted generator iterator.")
Finally, an instance of the data structure can be used in any context that expects an iterable.
list(skips(0, 10))
It is possible to confirm that the result of evaluating skips
is indeed a generator by checking its type.
import types
isinstance(skips(0, 10), types.GeneratorType)
It is also possible to inspect its type to confirm that skips
indeed evaluates to an iterator.
import collections
isinstance(skips(0, 10), collections.abc.Iterator)
Among the use cases that demonstrate how iterators/generators serve as a powerful language feature are scenarios involving data structures whose size is unknown or unbounded/infinite (such as streams, very large files, databases, and so on). You have already seen that you can define an iterable that can produce new objects or values indefinitely, so iterables are an effective way to represent and encapsulate such structures.
Returning to the example described at the beginning of the article, recall that you are faced with creating a Python API for working with data streams that might (or might not) run out of items that can be drawn from them. The advantages of leveraging iterables and generators should be evident at this point, so suppose you move ahead with this option and implement an iterable to represent a data stream. How can you address the three specific requirements (i.e., default/fall-back streams, interleaving, and splitting for parallelism) using these features?
To satisfy the first requirement, you must allow a user to exhaust one iterable and then switch to another one. This is straightforward to do by constructing a generator that concatenates two iterables.
def concatenate(xs, ys):
for x in xs:
yield x
for y in ys:
yield y
Concatenating two instances of an iterable data structure is now straightforward.
list(concatenate(skips(0,5), skips(6,11)))
Notice that if the first iterable is never exhausted, the second one will never be used.
To address the second requirement, first consider a simpler scenario. What if you would like to "line up" or "pair up" entries in two or more iterables? You can use the built-in zip
function.
list(zip(skips(0,5), skips(6,11)))
Notice that the result of evaluating zip
is indeed an iterator.
import collections
isinstance(
zip(skips(0,5), skips(6,11)),
collections.abc.Iterator
)
Combining zip
with comprehension syntax, you can now define a generator that interleaves two iterables (switching back and forth between emitting an item from one and then the other).
def interleave(xs, ys):
return (
z
for (x, y) in zip(xs, ys)
for z in (x, y)
)
As with concatenation, interleaving is now concise and straightforward.
list(interleave(skips(0,5), skips(6,11)))
Finally, how can you help users process items from a stream in parallel? Because you are already using iterables, users have some options available to them from the built-in itertools
library.
One option is islice
, which behaves in a similar manner to Python slice notation (such as xs[0:10]
to extract the first ten entries from a list xs
). Users can use this function to extract items in batches and (1) pass each item in a batch to its own separate thread or (2) pass batches of items to separate threads. A basic batching method is presented below.
from itertools import islice
def batch(xs, size):
ys = list(islice(xs, 0, size))
while len(ys) > 0:
yield ys
ys = list(islice(xs, 0, size))
Notice that this method inherits the graceful behavior of slice notation when the boundaries of the slices do not line up exactly with the number entries in the data structure instance.
list(batch(skips(0,21), 3))
Can you define a generator that returns batches of batches (e.g., at most n
batches each of size at most k
)?
Another option is to use the tee
function, which can duplicate a single iterable into multiple iterables. However, this function is really only simulating this effect by storing a large amount of auxiliary information from one of the iterables. Thus, it may use a significant amount of memory and is not safe to use with multiprocessing. It is best suited for situations in which the iterables are known to have a small number of items, as in the example below.
from itertools import tee
(a, b) = tee(skips(0,11), 2)
(list(a), list(b))
The example above is arguably implemented in a more clear and familiar way by simply wrapping the iterables using list
.
ns = list(skips(0,11))
(ns, ns)
This article presents the relationships and distinctions between a number of related Python language constructs, and illustrates via the use case of data streams how these constructs can be leveraged. You can visit the Python Wiki if you are looking for additional discussion and examples of both iterators and generators. The Python documentation also contains a Functional Programming HOWTO that discusses how iterators and generators offer new kinds of modularity and composability. Furthermore, the Python documentation entry for the built-in itertools library contains many built-in functions and recommended patterns that are specialized to particular scenarios.