Shell Foo: Merging Common Files In a Directory

Say you have two directories with a bunch of text files. How can you create a third directory that will contain all the files that are common to both input directories, such that every output file is a concatenation of both input files?

Shell-Foo is a series of fun ways to take advantage of the powers of the shell. In the series, I highlight shell one-liners that I found useful or interesting. Most of the entries should work on bash on Linux, OS X and other UNIX-variants. Some probably work with other shells as well. Your mileage may vary.

Feel free to suggest your own Shell-Foo one-liners!

My solution

comm -12 <(ls input1/) <(ls input2/) | \
    xargs -n 1 -P 8 -I fname sh -c \
    'cat input1/fname input2/fname >combined/fname'
ShellFoo: How to combine common files from two directories

Explanation

  1. comm takes two files, and prints the lines from the files in 3 columns:
    • The first column contains lines that appear in the first file, but not in the second.
    • The second column contains lines that appear in the second file, but not in the first.
    • The third column contains lines that appear in both files.
  2. comm -12 prints just the third column, resulting a list of common lines.
  3. <(command) takes the output of command and “wraps” it as a file. It’s a neat bash shortcut to using temporary files like this:
    ls input1/ >input1.ls
    ls input2/ >input2.ls
    comm -12 input1.ls input2.ls
    rm input1.ls input2.ls
      
  4. I wrote about xargs (with the -n and -P flags) before. The -I replstr flag is used to tell xargs to replace occurrences of “replstr” in the arguments list with the line from stdin. By default, up to 5 occurrences are replaced. This can be controlled using the -R number flag.

In case you’re wondering why I wrapped the entire command in sh -c '...', it’s because I want to redirect the output of every command separately, as opposed to redirecting outputs from all commands together to one file. To make this clearer, consider the “intuitive alternative”: xargs -n 1 -P 8 -I fname cat input1/fname input2/fname >combined/fname. This will run cat as expected, but the result will be that all files are concatenated into a single file combined/fname, keeping just the output from the last command.

Extra tips

This can be easily generalized to any “combining function” (cat in this case). For example, to get a sorted combined file:

comm -12 <(ls input1/) <(ls input2/) | \
    xargs -n 1 -P 8 -I fname sh -c \
    'cat input1/fname input2/fname | sort >combined/fname'

I believe this is bash-specific, due to the way I redirected the output of two ls commands into the input of comm.

No Comments Yet.

Leave a Reply