Get Safe Paths From Arbitrary Strings In Python

By Itamar Ostricher Wednesday, March 30, 2016 0 Software Engineering ostrichlib, python Python OstrichLib Permalink 1

Sometimes, all you want to do with an arbitrary string, is to use it to create a file or a directory. Really, that’s all. Nothing too special about it, right?

Alas! This is the root of all evil!

Arbitrary strings are dangerous, and should be handled with the utmost care, as if they were explosives, or Frank Underwood’s new liver! (sorry)

Wait, but, how exactly are they to be handled? And why should you reimplement this apparently basic, but practically risky, functionality every time you need it?

This is exactly what my ostrich.utils.text.get_safe_path() OstrichLib function set out to solve once and for all 🙂

It’s already available in Ostrich Lib in release v0.0. It’s also released to PyPI, meaning you can get it now with pip install ostrichlib. It’s tested (using Travis CI) against Python 2 & 3, and requires only the future library as an external dependency (which makes everyone happier with Python 2 / Python 3 compatibility). Detailed library documentation are available via Read the Docs. Hurray!

I would love to get some review from others for my solution, given the risky nature of the problem.

Using arbitrary strings as filenames? Hmmm…

What does it do?

Since it’s pretty straight forward, a demo should do:

itamar@legolas 20:37:06 ~>docker run -it --rm python:3.5 bash
root@292a5eeed6e9:/# pip install --upgrade pip ipython
Collecting pip
  Downloading pip-8.1.1-py2.py3-none-any.whl (1.2MB)
...
Successfully installed ipython-4.1.2 ipython-genutils-0.1.0 path.py-8.1.2 pexpect-4.0.1 pickleshare-0.6 pip-8.1.1 ptyprocess-0.5.1 setuptools-20.3.1 simplegeneric-0.8.1 traitlets-4.2.1 pip-8.1.1
root@292a5eeed6e9:/# pip install ostrichlib
Collecting ostrichlib
  Downloading ostrichlib-0.0.0.dev3-py2.py3-none-any.whl
Installing collected packages: ostrichlib
Successfully installed ostrichlib-0.0.0.dev3
root@292a5eeed6e9:/# ipython
Python 3.5.0 (default, Sep 14 2015, 20:19:17)
Type "copyright", "credits" or "license" for more information.

...

In [1]: from ostrich.utils.text import get_safe_path

In [2]: get_safe_path('foo.bar')
Out[2]: 'foo.bar'

In [3]: get_safe_path('a/b/c/../foo.bar')
Out[3]: 'a_b_c_.._foo.bar'

In [4]: get_safe_path('<1!2:3@4.{5}-6_7(8)9=0>')
Out[4]: '_1_2_3_4._5_-6_7_8_9=0_'

In [5]: get_safe_path("let's dö söme funky Ünicöde? Yeäh!")
Out[5]: 'let_s_do__so_me_funky_U_nico_de__Yea_h_'

In [6]: len(get_safe_path(''.join('a' for _ in range(5000))))
Out[6]: 255

In [7]: get_safe_path(' foo/bar.baz') == get_safe_path('foo$bar.baz ')
Out[7]: True

Essentially, it takes your shiny string, and replaces everything that’s not in a whitelist with underscores. The whitelist includes alphanumeric characters, -, ., and = – which are arbitrary characters I decided should be allowed. Surrounding spaces are stripped before conversions.

Some extra nifty things it does:

Unicode NFKD normalization before conversions. This means that evil unicode characters that have ASCII-look-alikes will be replaced with those ASCII’s! (see example 5 above)
Truncating resulting string at 255 characters, which is the standard entry name length limit on most common filesystems. (see example 6 above)
Make sure that the resulting string isn’t empty, or just a bunch of dots (.) (which may cause path traversal). Such a condition will trigger an exception.

See the tests file for more.

This is all accomplished with very little code, as you can see in the source file (which contains more documentation than code),
while being compatible with both Python 2 and Python 3 (as far as I can tell at least; that’s what Travis-CI says).

OstrichLib provides a utility Python function for converting an arbitrary string to a safe path part, weeeee

Discussion

As I wrote above, I am very interested in review from others, as I want this to be bullet proof.

Beyond that, it is important to understand that this function is still stupid. It takes a string, and returns another string. It has no awareness of files and directories, and does not want to be aware of such things. The fact that a string is safe to use as a filename, does not mean that you should write a file there, or create a directory there. That’s an application decision. Do you want to first check for existence? Create parent directories? Guarantee uniqueness somehow? This is not in the scope of this function – it’s your responsibility.

Specifically, regarding uniqueness, it is important to understand that these conversions make it so that multiple input strings may result the same safe output string, as you can see in example 7 above. Whether this is an issue or not depends, again, on the application.

If your application handles truly arbitrary strings, and needs to map them to truly unique paths in the same namespace (e.g. parent directory), then I can think of two possible approaches:

Use my function, but manage a mapping in your application from the safe name to the original name. If the function returns a result that’s already in that mapping, but is for a different original name, then append a '.%d' % n to the safe name for increasing values of n, until the modified safe name is “available”. You can probably wrap this in a decorator to simplify the using code. It’s like a smartass cache for the function result.
Why bother with these conversions at all? Just hash the crap out of the input strings!

root@7c65e882b3c1:/# ipython
Python 3.5.0 (default, Sep 14 2015, 20:19:17)
Type "copyright", "credits" or "license" for more information.

...

In [1]: from hashlib import sha1

In [2]: sha1('../foo/../bar.baz?!'.encode('utf8')).hexdigest()
Out[2]: 'a810bb4a5694ef259780b1be1fcb7776dc8d090e'

To hash or not to hash. That’s a silly question. Guess what. It depends!

The downsides of the hashing approach, as I see them:

You completely lose “visual” relation between the input and output strings. With the normalization approach, the output string usually looks like the input string.
While collisions are rare, they can happen. Do you want to rely on chance? Or will you implement the safety mechanism I described anyway?