Manipulating Python os.walk Recursion

The os.walk function in Python is a powerful function. It generates the file names and sub-directory names in a directory tree by walking the tree. For each directory in the tree, it yields a 3-tuple (dirpath, dirnames, filenames).

It is not well-known that you can modify dirnames in the body of the os.walk() loop to manipulate the recursion!

I’ve seen programmers avoid using os.walk(), and hack their own version of it using recursive calls to os.listdir(), with various path manipulations in the process. It was rare that the programmer doing this was not familiar with os.walk(). More often than not, the reason was that the programmer wanted more control over the recursion. Unfortunately, if the programmer was aware that this can be done with os.walk(), she would probably use it and save time and sweat!

This specific feature is well documented in the Python os.walk docs. Seeing how under-used it is, I wanted to highlight it here, hoping it will serve someone out there 🙂 .

The os.walk Python function in-place dirnames manipulation is powerful, but not well-known enough!

The case for manipulating directory tree recursion

Why would anyone want to manipulate the dir-tree recursion, anyway?

In fact, there are multiple valid reasons to do that! (also mentioned in the Python docs, by the way)

  1. Prune the directory tree being traversed, skipping specific sub-trees.
  2. Impose a specific order of visiting sub-directories.
  3. Adding directories that were created during iteration.
  4. Updating names of directories that were renamed during iteration.

Cool! How do I do it?

Just edit the dirnames list in-place, in the body of the loop!

For example, if you’d walk starting the current directory like this:

for dirpath, dirnames, filenames in os.walk('.'):
    print dirpath, dirnames, filenames

You can do any of these manipulations to change the behavior of the walk:

# Walk sub-directories in reverse order
for dirpath, dirnames, filenames in os.walk('.', topdown=True):
    dirnames.reverse()
    print dirpath, dirnames, filenames

# Prune the ".git" directory

for dirpath, dirnames, filenames in os.walk('.', topdown=True):
    dirnames[:] = [dirname for dirname in dirnames if dirname != '.git']
    print dirpath, dirnames, filenames

# Pruning directories that contain a file named "foo"

for dirpath, dirnames, filenames in os.walk('.', topdown=True):
    if 'foo' in filenames:
        del dirnames
        continue
    print dirpath, dirnames, filenames

You get the idea.

It should be emphasized that this is effective only when topdown=True! Think about it for a moment to become convinced… 🙂

Use Python os.walk with recursion manipulation to get the control you need
4 Comments
  • Jon Slavin
    September 3, 2015

    Thanks for this piece. However, I’m having trouble understanding how the pruning actually works. For example, I tried:
    for root,dirs,files in os.walk(TOPDIR,topdown=True):
    if len(dirs) > 1:
    dirs.sort()
    dirs = [dirs[-1]] # or dirs = dirs[-1], doesn’t seem to matter
    print ‘root =’,root

    This is for a directory tree with 4 levels of subdirectories below TOPDIR. I thought this would lead to printing the directories for the greatest sorted value directory name at each level (which correspond to year/month/day/hour) but instead it appears that the entire directory tree is printed. Can you explain why?

    Thanks,
    Jon

    • Itamar Ostricher
      September 3, 2015

      Thank you for the feedback, Jon!

      I think I can explain what’s going on with your example. I think that the problem is in dirs = [dirs[-1]]. This doesn’t modify the actual list used in the external loop, it only creates a new list with one element and binds it to the name “dirs”. In order to actually modify the list, you need to use dirs[:] = [dirs[-1]].

      Can you try it and update here if this indeed was the issue?

Leave a Reply