My last post accused Python of being The Slow of the Internet, not because Python is bad but because bad Python is awful.
In many cases, Python is really not slow for the reasons you think it is
Python is a great glue language, a terrific scripting language, because it provides fantastic facilities for manipulating bulky amounts of data. The terrible language that makes our day-to-day lives slower and more miserable is actually anti-Python.
There are two sides to the Python problem: non-engineers using it to write runtime descriptions of data manipulations performed by non-python backends, and engineers writing it as an expose of their non-python backends.
Between the two groups, nobody is really here for Python.
Non-engineers write their scripts in terms they understand: be prepared for code such as
import re
import subprocess
def work(files, pattern):
matches = []
for filename in files:
if subprocess.call(f"grep -q BEGIN {filename}".split(" ")):
for line in open(filename).read().split("\n"):
if re.search(pattern, line):
matches.append(filename)
break
return matches
# ...
def consumer(...):
if len(work(list_of_350_000_files, r"\bEND\b") > 1:
return Success
raise RuntimeError("No matches")
(based on actual code)
Meanwhile…
Engineers write their scripts in terms they understand, aka the other language. Be prepared for code such as:
# Tools/Allocator/allocator.py
# this module provides fast allocation of large numbers of objects.
def Allocate(className, count, *args, **kwargs):
array = [None]*count
array[:] = (className(*args, **kwargs) for element_no in range (count))
# Types/Data/data.py
index_no = 0
class Data:
def __init__(self):
self.index_no = -1
self.seed = 0
self._data = None
self._saved = False
def generate(self):
self._data = data_generator_fn(self.index_no, self.seed)
self._saved = False
@property
def data(self):
return self._data
@property
def saved(self):
return self._saved
def save(self):
if not self.saved:
with open("%07x-%07x.data", "wb") as out:
out.write(self.data)
self._saved = True
def init(data, scope=1e6):
global index_no
data.index_no = index_no
index_no += 1
data.seed = random.uniform(0, scope)
# main.py
from . import Tools.Allocator
from . import Types.Data
def main(args):
# allocate 1e6 Data elements quickly
array = Tools.Allocator.Allocate(Types.Data.Data, 1e6, arg1, debug=False, dry_run=True)
random.seed(time.time_ns())
map(Types.Data.init, array)
for data in array:
data.generate()
data.save()
There’s a lot going on in this one, which is again, based on actual code.
I was pretty surprised by this one, because it showed some sensible awareness of performance. Then it found a pile of crap and danced in it, face down. It was written in a crunch under great pressure and the author just hadn’t made the mental switch. He’d started out with a pythonic anti pattern and then applied a non-python set of optimizations to it and incrementally made things worse and worse.
He’d also taken a C++-like top-down approach to it: this code is going to be fairly trivial, it’ll do this and that, so I’ll need this function and this data, and then … so he’d written most of it before he ever tried any of it.
I have the author’s consent to show this modified version after I introduced him to Jupyter. He’s not a Python programmer, but he does have to get his hands dirty with it sometimes, and he’s found working interactively with Python a much better way to build what he needs; using Jupyter from Visual Studio Code is awesome sauce.
Back to my original claim that Python is the slow of the internet: You’d be amazed at what parts of your daily thumb-twiddling are the result of things being run in Python. But they’re not slow because they’re written in Python, they’re slow because their code was written in Bad Python.
They’re not slow because they were written in Python; they’re not-fast because they were written in Python, but their actual outright painful slowness is generally the result of people saying
What I am doing is slow, but Python is slow, so it is slow because it is Python and I don’t need to worry about it.
— Everyone who used Python, ever.
I hope she won’t mind, but I once highlighted this to a friend in code they were responsible for: know there was 2 years of git history behind it.
def to_str(s):
return str("%s" % str(s))
def string_repr(array):
return " ".join(to_str(a) for a in array)
# ...
def singular_invocation_site_in_code_hotpath(something):
if "OK" in something.string_repr(something.value):
return True
...
Yes, str(s) at one point had returned something that was not a string (None). The rest had happened in baby steps of refactoring and normalization (and tests people felt ok skipping eliding if the extra wrap was there to make it seem ok).
I put together a Jupyter Notebook to show how this went:
Optimizations in Python (github.com)
the key results are this:

The last function – f4 – and the last test, each eliminate a function call, and what we’re seeing is a clear demonstration that it costs about 40ns to invoke a Python function on this machine.
But 40ns is nothing
This single piece I’ve chosen to highlight is a trivial 40ns operation that was part of a tool used as a regular part of minute-to-minute activities by developers company wide, it was part of the startup scan of around 620,000 config files containing an average of roughly 300 significant lines, that is lines for which this condition was tested.
That’s 7 seconds of startup time spent looking for “OK” in lines of files.
Except, it was spending 320ns on a trivial 40ns operation, adding nearly a minute to startup time of the tool.
But this has all just been a distraction…
The real problem was that the comparison only needed doing on the first line of a file, and only if that line did not begin with a ‘#’.
Not only were we spending 8x more cpu cycles per operation, we were doing it 300x more than we needed to by simple line count, and 186,000,000 times more than we needed when you learn that there was only a single file which had ‘OK’ on the first line, and it was in the word “BROKEN”…
The original use case for the feature had gone away. But people just expected Python to be slow, so they accepted that it took 5-15 minutes for the tool to start up, because it was written in Python.
The original usage was this:
if "OK" in to_str(s) and not s.beginswith("#"):
Recorded for posterity in the review commits was:
R: If you do the beginswith first it can short-circuit out most files, and if we know it’s a string do you need to call a conversion function?
A: I think it reads better and if we’re worrying about performance we shouldn’t be doing it in Python
Recent Comments