Shell Foo: Parallelizing Multiple wget Downloads

Got a bunch of files to download? Got an open terminal session? Want to use wget to parallelize the download?

How about this:

echo http://dl.whatever.com/dl/file{1..1000} | xargs -n 1 -P 16 wget -q

Shell-Foo is a series of fun ways to take advantage of the powers of the shell. In the series, I highlight shell one-liners that I found useful or interesting. Most of the entries should work on bash on Linux, OS X and other UNIX-variants. Some probably work with other shells as well. Your mileage may vary.

Feel free to suggest your own Shell-Foo one-liners!

Explanation

The first part (before the pipe) generates the list of files to download. If it’s a simple numbered list, like the case in the example, it’s a piece of cake. Any method will do though. In the general case, you could build a plain text file with the URLs to download, one per line, and cat it.

The xargs part is the fun part!

In its simple form, xargs foo takes all lines from stdin, and passes them as an argument list to foo. xargs wget would take the URLs list, and pass them to wget, which will download them one by one. Maybe wget has built-in connection-pooling and stuff, but I’m not sure about it, and don’t want to rely on that.

xargs accepts the -n [N] flag, which makes it collect N stdin lines at a time and run the command, so xargs -n 1 wget makes it run wget on each URL separately. Add the -p [P] flag, and xargs does that using a process-pool with P processes.

I used wget with the -q flag to avoid the default progress report, and force it the keep quiet. I used 16 processes assuming it’s a reasonable number for a machine with 4 CPUs, with processes that are network-bound.

Got tips on improving this? Let me know via the comments!

No Comments Yet.

Leave a Reply