Pro-tip: Write Python like Python

My last post accused Python of being The Slow of the Internet, not because Python is bad but because bad Python is awful.

In many cases, Python is really not slow for the reasons you think it is

Python is a great glue language, a terrific scripting language, because it provides fantastic facilities for manipulating bulky amounts of data. The terrible language that makes our day-to-day lives slower and more miserable is actually anti-Python.

There are two sides to the Python problem: non-engineers using it to write runtime descriptions of data manipulations performed by non-python backends, and engineers writing it as an expose of their non-python backends.

Between the two groups, nobody is really here for Python.

Non-engineers write their scripts in terms they understand: be prepared for code such as

import re
import subprocess

def work(files, pattern):
    matches = []
    for filename in files:
      if subprocess.call(f"grep -q BEGIN {filename}".split(" ")):
        for line in open(filename).read().split("\n"):
          if re.search(pattern, line):
            matches.append(filename)
            break
    return matches

# ...

def consumer(...):
    if len(work(list_of_350_000_files, r"\bEND\b") > 1:
        return Success
    raise RuntimeError("No matches")

(based on actual code)

Meanwhile…

Engineers write their scripts in terms they understand, aka the other language. Be prepared for code such as:

# Tools/Allocator/allocator.py
# this module provides fast allocation of large numbers of objects.
def Allocate(className, count, *args, **kwargs):
    array = [None]*count
    array[:] = (className(*args, **kwargs) for element_no in range (count))


# Types/Data/data.py
index_no = 0


class Data:
    def __init__(self):
        self.index_no = -1
        self.seed = 0
        self._data = None
        self._saved = False

    def generate(self):
        self._data = data_generator_fn(self.index_no, self.seed)
        self._saved = False

    @property
    def data(self):
        return self._data

    @property
    def saved(self):
        return self._saved

    def save(self):
        if not self.saved:
          with open("%07x-%07x.data", "wb") as out:
            out.write(self.data)
          self._saved = True

def init(data, scope=1e6):
    global index_no
    data.index_no = index_no
    index_no += 1
    data.seed = random.uniform(0, scope)

# main.py
from . import Tools.Allocator
from . import Types.Data

def main(args):
    # allocate 1e6 Data elements quickly
    array = Tools.Allocator.Allocate(Types.Data.Data, 1e6, arg1, debug=False, dry_run=True)

    random.seed(time.time_ns())
    map(Types.Data.init, array)

    for data in array:
        data.generate()
        data.save()

There’s a lot going on in this one, which is again, based on actual code.

I was pretty surprised by this one, because it showed some sensible awareness of performance. Then it found a pile of crap and danced in it, face down. It was written in a crunch under great pressure and the author just hadn’t made the mental switch. He’d started out with a pythonic anti pattern and then applied a non-python set of optimizations to it and incrementally made things worse and worse.

He’d also taken a C++-like top-down approach to it: this code is going to be fairly trivial, it’ll do this and that, so I’ll need this function and this data, and then … so he’d written most of it before he ever tried any of it.

I have the author’s consent to show this modified version after I introduced him to Jupyter. He’s not a Python programmer, but he does have to get his hands dirty with it sometimes, and he’s found working interactively with Python a much better way to build what he needs; using Jupyter from Visual Studio Code is awesome sauce.

Back to my original claim that Python is the slow of the internet: You’d be amazed at what parts of your daily thumb-twiddling are the result of things being run in Python. But they’re not slow because they’re written in Python, they’re slow because their code was written in Bad Python.

They’re not slow because they were written in Python; they’re not-fast because they were written in Python, but their actual outright painful slowness is generally the result of people saying

What I am doing is slow, but Python is slow, so it is slow because it is Python and I don’t need to worry about it.

— Everyone who used Python, ever.

I hope she won’t mind, but I once highlighted this to a friend in code they were responsible for: know there was 2 years of git history behind it.

def to_str(s):
    return str("%s" % str(s))

def string_repr(array):
    return " ".join(to_str(a) for a in array)

# ...

def singular_invocation_site_in_code_hotpath(something):
  if "OK" in something.string_repr(something.value):
    return True
  ...

Yes, str(s) at one point had returned something that was not a string (None). The rest had happened in baby steps of refactoring and normalization (and tests people felt ok skipping eliding if the extra wrap was there to make it seem ok).

I put together a Jupyter Notebook to show how this went:

Optimizations in Python (github.com)

the key results are this:

The last function – f4 – and the last test, each eliminate a function call, and what we’re seeing is a clear demonstration that it costs about 40ns to invoke a Python function on this machine.

But 40ns is nothing

This single piece I’ve chosen to highlight is a trivial 40ns operation that was part of a tool used as a regular part of minute-to-minute activities by developers company wide, it was part of the startup scan of around 620,000 config files containing an average of roughly 300 significant lines, that is lines for which this condition was tested.

That’s 7 seconds of startup time spent looking for “OK” in lines of files.

Except, it was spending 320ns on a trivial 40ns operation, adding nearly a minute to startup time of the tool.

But this has all just been a distraction…

The real problem was that the comparison only needed doing on the first line of a file, and only if that line did not begin with a ‘#’.

Not only were we spending 8x more cpu cycles per operation, we were doing it 300x more than we needed to by simple line count, and 186,000,000 times more than we needed when you learn that there was only a single file which had ‘OK’ on the first line, and it was in the word “BROKEN”…

The original use case for the feature had gone away. But people just expected Python to be slow, so they accepted that it took 5-15 minutes for the tool to start up, because it was written in Python.

The original usage was this:

if "OK" in to_str(s) and not s.beginswith("#"):

Recorded for posterity in the review commits was:

R: If you do the beginswith first it can short-circuit out most files, and if we know it’s a string do you need to call a conversion function?
A: I think it reads better and if we’re worrying about performance we shouldn’t be doing it in Python

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: