Shell Foo: Simple Distributed grep on a GCE Cluster

By Itamar Ostricher Wednesday, March 4, 2015 0 Software Engineering bash, google-cloud, howto, linux, shell-foo Permalink 1

You just finished running a distributed multi-phase pipeline on a cluster of 200 GCE VM’s. Good for you! But something doesn’t look right, and you’d like to investigate by grepping the local log files across all nodes and do something with the results. How would you do that?

Let’s say that if you had 1-2 machines, you would ssh into each, and run this command:

$ grep ERROR /tmp/*.log.* >errors

This post describes a simple one-liner that scales the local grep to cluster-scale.

Shell-Foo credit for this one: Eyal Fink.

Shell-Foo is a series of fun ways to take advantage of the powers of the shell. In the series, I highlight shell one-liners that I found useful or interesting. Most of the entries should work on bash on Linux, OS X and other UNIX-variants. Some probably work with other shells as well. Your mileage may vary.

Feel free to suggest your own Shell-Foo one-liners!

The solution

The solution assumes a simple naming pattern for the cluster VM’s, e.g. “node-$num” with $num = 1..200. If that’s not the case, the same solution can be adapted using a text file with the names of the VM’s, one per line.

ssh into one GCE VM and run this:

$ seq 200 | xargs -I nodeid -n 1 -P 8 gcutil ssh node-nodeid "grep ERROR /tmp/*.log.* | gsutil -q cp - gs://my-bucket/grep-result-nodeid"

Once this completes (might take a while), you have all grep results in a Google Cloud Storage bucket that you can process further any way you like! It’s magic!

ShellFoo: Simple distributed grep on a GCE cluster

Explanation

seq 200: write numbers 1, 2, …, 200 to STDOUT, one per line.
I explained xargs with the “-n” and “-P” flags here, and added the “-I” flag here. Essentially, it uses 8 processes to run gcutil ssh node-$n "grep ERROR /tmp/*.log.* | gsutil -q cp - gs://my-bucket/grep-result-$n" for every $n in the range [1,2,…,200].
gcutil ssh node-name command is just like ssh node-address command for the Google Compute Engine environment. It uses ssh to run command on remote machine node-name. Practically, gcutil ssh is a fancy wrapper for ssh. Everything inside the double quotes (grep ... | gsutil ...) is the command to run on the remote machine.
I hope the grep command requires no explanation.
The output of grep (matching lines) is piped to the input of gsutil, which performs a copy (cp) from STDIN (using - as the first argument) to some Google Cloud Storage object (using the gs://bucket-name/object-name scheme). The -q flag just makes the command quiet, so there’s less data to send back to the initiating machine.

That’s it. Naturally, this is a quick and simple way to do something like this. If your main objective is to process lots of text in a distributed system, you might want to consider something less blunt, like Apache Hadoop, or Apache Spark.

Bonus points if you can come up with a similar thing that works with AWS instead of Google Cloud Platform 🙂 .