Don't Use Python Lists Where Generators Work Better!

Don’t Use Python Lists Where Generators Work Better!

By Itamar Ostricher Sunday, February 15, 2015 1 Software Engineering python Permalink 1

I recently learned about a subtle difference between lists and generators in Python.

The gist: when passing iterables, always prefer generators over lists!

Also: when returning iterables, always prefer generators over lists!

Most of the time, the intuitive way does the Right Thing. I noticed the subtlety in some cases where my intuition didn’t do the Right Thing.

In this post I explain the difference using several examples. The last example is an important one, demonstrating how fnmatch.filter() can misbehave!

Don’t use Python lists where generators do a better job!

Simple examples with builtin all and any functions

Consider these two similar ways to use the builtin all() function:

def foo(num):
    print 'hi from foo', num
    return num < 5

print 'Using all([ ... ])'
print all([foo(n) for n in range(10)])

print 'Using all( ... )'
print all(foo(n) for n in range(10))

The only difference is whether the iterable passed is a list or a generator. When using [ ... ], a list is constructed in-place. Without the [ ... ], a generator is passed.

Running this snippet:

Using all([ ... ])
hi from foo 0
hi from foo 1
hi from foo 2
hi from foo 3
hi from foo 4
hi from foo 5
hi from foo 6
hi from foo 7
hi from foo 8
hi from foo 9
False
Using all( ... )
hi from foo 0
hi from foo 1
hi from foo 2
hi from foo 3
hi from foo 4
hi from foo 5
False

Both ways returned the correct answer – False, but the behaviors are inherently different! In the list-based example, Python constructed the list before passing it into all(), so all 10 foo(n) calls were executed. In the generator-based example, Python passed the generator into all(), allowing it to generate only as few elements as it needed to return the correct answer!

This may seem like a minor nuance, but it is easy to see how this nuance may have significant performance and correctness implications.

The performance part is straight forward. If foo() is expensive, or there are many many elements, and foo() returns False for one of the early elements, then many computations are wasted cycles that can be spared.

The correctness aspect is less straight forward. If foo() changes the state of the system in some way (e.g. creating / deleting files), then the list-based example can cause unexpected state changes.

The example applies to the any() function as well:

def foo(num):
    print 'hi from foo', num
    return num >= 5

print 'Using any([ ... ])'
print any([foo(n) for n in range(10)])

print 'Using any( ... )'
print any(foo(n) for n in range(10))

Running it:

Using any([ ... ])
hi from foo 0
hi from foo 1
hi from foo 2
hi from foo 3
hi from foo 4
hi from foo 5
hi from foo 6
hi from foo 7
hi from foo 8
hi from foo 9
True
Using any( ... )
hi from foo 0
hi from foo 1
hi from foo 2
hi from foo 3
hi from foo 4
hi from foo 5
True

More examples with for loops over list comprehension

Consider these two less obvious examples of a for-loop over some dynamic expression.

def foo_gen(num):
    i = 0
    while i < num:
        print 'hi from foo', i, num
        yield i
        i += 1

print 'Using for x in [ ... ]'
for x in [n/2 for n in foo_gen(20) if n%2==0]:
    print 'hi from for loop, x:', x
    if x > 5:
        print 'breaking out from for loop'
        break

print ''

print 'Using for x in ( ... )'

for x in (n/2 for n in foo_gen(20) if n%2==0):
    print 'hi from for loop, x:', x
    if x > 5:
        print 'Breaking out of for loop'
        break

Both loops use my made-up foo_gen(), which is just a chatty version of the builtin xrange().

The difference between the for-loops is quite subtle. The first one puts the list comprehension inside [ ... ], and the second one puts it in ( ... ).

What do you expect will happen when I run this? Let’s see:

Using for x in [ ... ]
hi from foo 0 20
hi from foo 1 20
hi from foo 2 20
hi from foo 3 20
hi from foo 4 20
hi from foo 5 20
hi from foo 6 20
hi from foo 7 20
hi from foo 8 20
hi from foo 9 20
hi from foo 10 20
hi from foo 11 20
hi from foo 12 20
hi from foo 13 20
hi from foo 14 20
hi from foo 15 20
hi from foo 16 20
hi from foo 17 20
hi from foo 18 20
hi from foo 19 20
hi from for loop, x: 0
hi from for loop, x: 1
hi from for loop, x: 2
hi from for loop, x: 3
hi from for loop, x: 4
hi from for loop, x: 5
hi from for loop, x: 6
breaking out from for loop

Using for x in ( ... )
hi from foo 0 20
hi from for loop, x: 0
hi from foo 1 20
hi from foo 2 20
hi from for loop, x: 1
hi from foo 3 20
hi from foo 4 20
hi from for loop, x: 2
hi from foo 5 20
hi from foo 6 20
hi from for loop, x: 3
hi from foo 7 20
hi from foo 8 20
hi from for loop, x: 4
hi from foo 9 20
hi from foo 10 20
hi from for loop, x: 5
hi from foo 11 20
hi from foo 12 20
hi from for loop, x: 6
Breaking out of for loop

Kudos to you if you expected this! When I started playing around with this subject, I could not foresee it!

With a fully-constructed list (the [ ... ] case), Python calls foo_gen() 20 times to finish constructing the list, and only then the for-loop begins, breaking out after 7 iterations. The for-loop never reaches beyond the 13th value that foo_gen() generated, but nonetheless – all values were generated.

With a “naked generator” (the ( ... ) case – does this have a Pythonic term?), the for-loop runs while values are being generated, also breaking out after 7 iterations. The clear advantage in this case is the fact that Python calls foo_gen() exactly 13 times – no more than actually needed to complete the for-loop.

Just like the simple all() / any() examples, the potential impact here is huge.

A similar example with string splitting

If the previous example is too made-up for you, consider this example:

def my_split(my_str):
    start_pos = 0
    str_len = len(my_str)
    if str_len == 0:
        yield ''
        return
    end_pos = 0
    while end_pos < str_len:
        if my_str[end_pos] == '\n':
            print 'yielding:', my_str[start_pos:end_pos]
            yield my_str[start_pos:end_pos]
            start_pos = end_pos + 1
        end_pos += 1
    print 'yielding:', my_str[start_pos:end_pos]
    yield my_str[start_pos:end_pos]

my_test_str = 'foo\nbar\nbaz\n'

print 'Iterating over a list'
for part in list(my_split(my_test_str)):
    print 'got', part
    if part == 'bar':
        print 'breaking from loop'
        break

print ''
print 'Iterating over the raw generator'
for part in my_split(my_test_str):

    print 'got', part
    if part == 'bar':
        print 'breaking from loop'
        break

I implemented my own custom string-splitting function, so I can embed some prints inside it. As far as I can tell, the builtin Python string-split returns a fully-constructed list, and not a generator as I did here.

I hope you agree that this is a realistic example – iterating over split string, potentially breaking out midway through.

Running it:

Iterating over a list
yielding: foo
yielding: bar
yielding: baz
yielding:
got foo
got bar
breaking from loop

Iterating over the raw generator
yielding: foo
got foo
yielding: bar
got bar
breaking from loop

By now I guess you could have guessed the behavior 🙂 .

fnmatch.filter returns a list, not a generator!

Consider this final example a Public Service Announcement, warning against problematic behavior of fnmatch.filter():

import fnmatch
my_files = 'a.foo\nb.foo\nc.bar\nd.foo'

print 'using fnmatch.filter'
for filename in fnmatch.filter(my_split(my_files), '*.foo'):
    print 'working on', filename

print ''
print 'using fnmatch.fnmatch with list comprehension'
for filename in [fn for fn in my_split(my_files) if fnmatch.fnmatch(fn, '*.foo')]:
    print 'working on', filename

print ''
print 'using fnmatch.fnmatch with generator comprehension'
for filename in (fn for fn in my_split(my_files) if fnmatch.fnmatch(fn, '*.foo')):
    print 'working on', filename

The output:

using fnmatch.filter
yielding: a.foo
yielding: b.foo
yielding: c.bar
yielding: d.foo
working on a.foo
working on b.foo
working on d.foo

using fnmatch.fnmatch with list comprehension
yielding: a.foo
yielding: b.foo
yielding: c.bar
yielding: d.foo
working on a.foo
working on b.foo
working on d.foo

using fnmatch.fnmatch with generator comprehension
yielding: a.foo
working on a.foo
yielding: b.foo
working on b.foo
yielding: c.bar
yielding: d.foo
working on d.foo

fnmatch.filter is NOT well-behaved as far as generator inputs are concerned!

It appears that fnmatch.filter() turns a generator into a list internally, and returns a filtered list!

This can be a terribly Bad Thing to do in some cases, like:

As in previous examples, if the for-loop breaks midway through, the filter still processed all items.
This actually happened to me in production code: if the iterable passed to fnmatch.filter() does paged API calls inside, this behavior significantly hurts performance! It artificially increased the load and rate of API calls, and counteracts any parallelization attempts of background I/O while crunching CPU cycles in the loop body.

My recommendation here is to always prefer the generator-comprehension form of fnmatch.fnmatch over fnmatch.filter.

Summary

Always prefer generator-based Python expressions over lists!

Got more examples of lists vs. generators? Share them in the comments!

Additional cases of bad behavior similar to fnmatch.filter are especially welcome.