Running a Time-Limited Subprocess In Python (concurrency caveats inside!)

By Tuesday, January 13, 2015 3 Permalink 1

I tried to write a “simple” Python function. All it had to do is to run a command line in a subprocess, and enforce a timeout on that subprocess.

Turns out that this “simple” goal is not so simple. I considered multiple approaches to solve this, and ran into interesting concurrency issues.

In this post I describe a solution based on Python subprocess and various threading constructs. I demonstrate a concurrency inconsistency I discovered when testing my solution on Linux and OS X.

The conclusion, if that’s all what you’re looking for, is that timer.is_alive() is not a safe way to test whether a timer had expired!

Note: My experience is based on Python 2.7 with the default subprocess module from the standard library. If you’re on Python 3.3+, a timeout argument was added to subprocess. You can also install python-subprocess32, which brings this joy to Python 2.4-2.7.

Running a time-limited subprocess in Python – much concurrency fun!

Problem statement

To be clear, the function I want to implement is this:

def run_command_with_timeout(cmd, timeout_sec):
    """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds.

    Return subprocess exit code on natural completion of the subprocess.
    Raise an exception if timeout expires before subprocess completes."""
    # Pseudo-code:
    run cmd in subprocess
    if subprocess completed naturally - return its exit code
    otherwise, if timeout expired - raise an exception

To be even clearer, here’s a short unittest program that tests for the expected behavior:

import unittest

class SubprocessTimeoutError(RuntimeError):
  """Error class for subprocess timeout."""
  pass

class CommandWithTimeoutTest(unittest.TestCase):
   def test_natural_success(self):
       """Try one that should complete naturally and successfully"""
       self.assertEqual(0, run_command_with_timeout(['sleep', '0.1'], 1.0))
   def test_natural_failure(self):
       """Try one that should complete naturally with an error code"""
       self.assertEqual(1, run_command_with_timeout(['sleep', '-w00t'], 1.0))
   def test_timeout(self):
       """Try one that should be killed after the timeout"""
       with self.assertRaises(SubprocessTimeoutError):
           run_command_with_timeout(['sleep', '10'], 1.0)

if '__main__' == __name__:
    unittest.main()

First attempt

An intuitive approach to solve this may be:

  1. Use Python subprocess to run the command in a subprocess.
  2. Use Python threading.Timer to enforce the timeout.

Here’s an implementation based on this approach:

def run_command_with_timeout(cmd, timeout_sec):
    """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds.

    Return subprocess exit code on natural completion of the subprocess.
    Raise an exception if timeout expires before subprocess completes."""
    proc = subprocess.Popen(cmd)
    timer = Timer(timeout_sec, proc.kill)
    timer.start()
    proc.communicate()
    if timer.is_alive():
        # Process completed naturally - cancel timer and return exit code
        timer.cancel()
        return proc.returncode
    # Process killed by timer - raise exception
    raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec))

Seems straight forward, doesn’t it?

I start the timer before calling the communicate method. The call to communicate() returns only after the process is finished (for whatever reason). If the timer expires before the process finishes naturally, proc.kill() is called, killing the process, and making the communicate() call return.

Once communicate() returns, I need to check why. To do that, I use the timer.is_alive() method to check if the timer is still running. If it is, it means proc.kill() wasn’t called, and the process exited naturally. In that case, I cancel the timer (so it doesn’t kill a non-existing process) and return the returncode. If the timer is not alive, it means proc.kill() was called, so I raise the Timeout error.

Good enough?

Running the tests on my Mac:

itamar@legolas misc $ ./timercaveat.py
usage: sleep seconds
...
----------------------------------------------------------------------
Ran 3 tests in 1.122s

OK

Looks promising.

Not good enough!

Running on my Linux VM:

itamar@ubuntu-vm misc>./timercaveat.py
sleep: invalid option -- 'w'
Try 'sleep --help' for more information.
..F
======================================================================
FAIL: test_timeout (__main__.CommandWithTimeoutTest)
Try one that should be killed after the timeout
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./timercaveat.py", line 39, in test_timeout
    run_command_with_timeout(['sleep', '10'], 1.0)
AssertionError: SubprocessTimeoutError not raised

----------------------------------------------------------------------
Ran 3 tests in 1.112s

FAILED (failures=1)

Damn. That’s not good.

What happened?

Writing concurrent code is a bitch. When multithreaded code breaks, it usually happens in non-deterministic ways. Oddly enough, this issue seems to be deterministic but inconsistent across the two systems I’m using. Running the tests hundreds of times, I always get the same behavior.

Looking closer at the timer.is_alive() method:

This method returns True just before the `run()` method starts until just after the `run()` method terminates.

What do you mean just before and just after?! Can I get more specific guarantees please??

Apparently I can’t, so I’ll test it my self. To do that, I add some debug prints:


def run_command_with_timeout(cmd, timeout_sec):
    """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds.

    Return subprocess exit code on natural completion of the subprocess.
    Raise an exception if timeout expires before subprocess completes."""
    proc = subprocess.Popen(cmd)
    timer = Timer(timeout_sec, proc.kill)
    timer.start()
    proc.communicate()
    print 'timer alive?', timer.is_alive()
    sleep(0.001)
    print 'and now?', timer.is_alive()
    if timer.is_alive():
        # Process completed naturally - cancel timer and return exit code
        timer.cancel()
        return proc.returncode
    # Process killed by timer - raise exception
    raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec))

Running just the problematic test:

On my Mac:

itamar@legolas misc $ ./timercaveat.py CommandWithTimeoutTest.test_timeout
timer alive? False
and now? False
.
----------------------------------------------------------------------
Ran 1 test in 1.009s

OK

On my Linux VM:

itamar@ubuntu-vm misc>./timercaveat.py CommandWithTimeoutTest.test_timeout
timer alive? True
and now? False
.
----------------------------------------------------------------------
Ran 1 test in 1.005s

OK

Aha! The is_alive() method changes to False “just after” the timer terminates, with “just after” being different across the two systems!

Pinpointing the problem?

It seems my code has a race condition between proc.communicate() and timer.is_alive(), when the process is killed by the timer.

On my Mac, the timer kills the process and sets is_alive to False before returning control to the main thread. This makes the test behave correctly.

On my Linux VM, the control returns to the main thread immediately after the timer kills the process, but before setting is_alive to False. This makes the test fail, unless the added sleep gives the timer a chance to change is_alive.

I found it hard to believe that the “consistent inconsistency” is a result of the OS (Mac vs. Linux). When thinking about it further, I realized it might be a result of CPU cores.

On the Mac, with 4 cores, the timer can kill the process and change is_alive on one core while the main thread returns from proc.communicate() on another core. This means that there’s a race condition between the two threads running on two cores, on for some reason the thread that changes the value of is_alive consistently wins the race.

On my Linux VM I allocated just 1 core. This means that when the timer expires, it will call proc.kill(), which will context switch (because kill is a syscall). Once the kernel finishes handling the syscall, it will see there are two threads in the ready state – the timer thread, and the main thread. For some reason, it consistently schedules the main thread (maybe because the timer thread was the last to run). Since this happens on a single CPU core, the timer thread is waiting (not yet changing still_alive), while the main thread calls is_alive() to get the wrong value. This is the same race condition, but running on 1 CPU core makes the timer thread consistently lose the race.

To test this theory, I created a new Linux VM with 2 CPU cores, and ran the same test:

itamar@ubuntu-vm misc>grep -c ^processor /proc/cpuinfo
2
itamar@ubuntu-vm misc>./timercaveat.py CommandWithTimeoutTest.test_timeout
timer alive? False
and now? False
.
----------------------------------------------------------------------
Ran 1 test in 1.005s

OK

Theory confirmed! On the 2-core Linux VM the behavior is consistent with the Mac!

OK, but you still have a race condition!

You may claim that this whole thing is pointless, because I had a real race condition in my original solution. You’d be right, of course. But it’s still interesting to learn from mistakes of others, isn’t it? 🙂

To keep this post from getting too long, I’ll just show you two different fixes – one that introduces other race conditions, and one that solves it (hopefully).

A fix that still has race conditions


def run_command_with_timeout(cmd, timeout_sec):
    """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds.

    Return subprocess exit code on natural completion of the subprocess.
    Raise an exception if timeout expires before subprocess completes."""
    proc = subprocess.Popen(cmd)
    def kill_proc():
        timer.expired = True
        proc.kill()
    timer = Timer(timeout_sec, kill_proc)
    timer.expired = False
    timer.start()
    proc.communicate()
    if timer.expired:
        # Process killed by timer - raise exception
        raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec))
    # Process completed naturally - cancel timer and return exit code
    timer.cancel()
    return proc.returncode

The threading.Timer object doesn’t really have an expired attribute. I artificially inject this attribute, to use it as a flag for myself. In effect, this replaces is_alive() with expired that I explicitly set before killing the process.

Can you spot all the race conditions in this so-called solution? Try to, in the comments! 🙂

A fix that actually fixes (I hope)

def run_command_with_timeout(cmd, timeout_sec):
    """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds.

    Return subprocess exit code on natural completion of the subprocess.
    Raise an exception if timeout expires before subprocess completes."""
    proc = subprocess.Popen(cmd)
    proc_thread = Thread(target=proc.communicate)
    proc_thread.start()
    proc_thread.join(timeout_sec)
    if proc_thread.is_alive():
        # Process still running - kill it and raise timeout error
        try:
            proc.kill()
        except OSError, e:
            # The process finished between the `is_alive()` and `kill()`
            return proc.returncode
        # OK, the process was definitely killed
        raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec))
    # Process completed naturally - return exit code
    return proc.returncode

You might want to refactor this a bit to allow for proper use of communicate() – with support for read/write buffers.

This solution still has a race condition, but I deal with it using a try-except block around the proc.kill().

If you catch more corner cases, I’d love to hear about it!

(tip: try adding a test of the form self.assertEqual(0, run_command_with_timeout(['sleep', '1'], X)), where X is 1.0, 1.1, 0.9, 0.99, 0.999, etc.)

Another possible approach

One could argue that all this concurrency-fu is an overkill for this scenario, and suggest doing this instead:

def run_command_with_timeout(cmd, timeout_sec):
    proc = subprocess.Popen(['timeout', str(timeout_src)] + cmd)
    proc.communicate()
    if proc.returncode == TIMEOUT_RETCODE:
        raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec))
    return proc.returncode

This approach uses the OS timeout command to run the requested command with a time limit.

I opted not to pursue this approach for two main reasons:

  1. It assumes that there is a timeout command in the underlying OS, which is not always true.
  2. It assumes that TIMEOUT_RETCODE is not a valid return code of the requested command, which is not generally true.

Summary

I just wanted to run a command from Python in a subprocess with a timeout.

Sounds like something that should be simple, but I found out it isn’t.

I demonstrated a straight forward solution based on Python threading Timer, and showed how it is prone to concurrency issues. The improved solution, based on a Python thread, should do better.

I’d love to hear about even better solutions, or more problems with the solution I presented! If you have something to say, go ahead and say it in the comments 🙂 .

Use my thread-safe solution to run a time-limited subprocess in Python
3 Comments
  • MabraFoo MabraFoo
    November 5, 2015

    Excellent post. This is a tough problem that I only figured out with stack overflow.

    http://stackoverflow.com/questions/1191374/using-module-subprocess-with-timeout

    import subprocess, shlex
    from threading import Timer
    
    def run(cmd, timeout_sec):
      proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE)
      kill_proc = lambda p: p.kill()
      timer = Timer(timeout_sec, kill_proc, [proc])
      try:
        timer.start()
        stdout,stderr = proc.communicate()
      finally:
        timer.cancel()
    
    run("long.sh",60) #run my long.sh command with 60 limit
    
    • Itamar Ostricher
      November 5, 2015

      Thanks!

      Indeed, a tough problem to tackle.

      This is a nice solution. It’s one of the variants of the solution that the timer kills the subprocess instead of communicating with the subprocess.

      I personally didn’t like these variants, because it creates another race condition between the natural completion of the subprocess and the cancellation of the timer, in which the timer may expire and try to kill a non-existing process.

      In practice, I’ve been using the subprocess32 package to take advantage of the back-ported timeout feature from Python 3.2, instead of all these custom solutions 🙂

      • Itamar Ostricher
        November 5, 2015

        actually, I have the same issue in my solution. I addressed it by wrapping the kill in a try-except block, so if the process is already gone when the kill is called, nothing bad happens. this can be applied here too!

Leave a Reply