Python, the slow of the internet.

Unpopular Opinion: CPython is stupidly slow. CPython is the Python you’re using if you don’t know which Python you use.

Before Go, Python had taken a firm hold of the systems admin coding, and huge amounts of Linux tooling is written in Python.

During the Great Python 3 Migration of 2019, Python libraries bloated with people introducing bidirectional compatibility, generally by just grabbing some 3rd-party libraries to minimize the footprint of change.

I’m not going to rant about people not knowing the standard ‘dis‘ module exists, or they don’t know about timeit/%timeit… It’s not really an “optimization” issue tho.

Today’s Linux admin activities are agonizingly slow because so many Python developers hear adages about not optimizing Python code they think that you never need to worry about it, so they have no idea how expensive some very common practices are.

Sadly, CPython makes no-need-for-performance-thinking untrue in one really unfortunate detail, one detail that has been agonizingly inflated by the bloat of compatibility code:

Function call overhead :(

The code from this post is in a Jupyter notebook in my github, here.

If you want to interact with it (run it for yourself), you can either use an online notebook viewer (e.g https://nbviewer.jupyter.org/), or Visual Studio Code has really nice support for notebooks, now.

The golang example is here.

Let me start with an unpopular comparison. CPython is slow. If you’re really worried about performance of long-running Python processes, you may want to consider pypy or – and this is me saying this – jython.

For many mundane tasks, you might be horrified to find that CPython and cross-platform Powershell Core compete for performance…

Here’s a cheesy, contrived example:

def ridiculous():
   r = 0
   for n in range(100, 200):
     for i in range(1, n*1000):
       r += i
   return r
%timeit ridiculous()
print(ridiculous())  # check value
787 ms ± 65.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1159167525000

Or in Powershell:

function ridiculous () {
    $r = 0
    foreach ($n in 100..(200-1)) {
        foreach ($i in 1..(($n * 1000) - 1)) {
            $r += $i
        }
    }
    return $r
}
> measure-command { ridiculous; } | select totalmilliseconds

TotalMilliseconds
-----------------
         657.2722
> ridiculous
1159167525000

But there are no function calls here, and I’m claiming that function calls make Python slow.

class C:
    """A class to demonstrate overhead of invoking methods."""
    def m1(self, a: str, b: str) -> str:
        """The simplest case, just join the strings."""
        return a + ' ' + b

    def m2(self, a: str, b: str) -> str:
        """The overhead of invoking such a simple method."""
        return c.m1(a, b)

    def m3(self) -> str:
        """Eliminate the argument forwarding."""
        return c.m1('hello', 'world')

    def m4(self, a: str, b: str) -> str:
        """Add a second function call to emphasize the function invocation overhead."""
        return c.m2(b, a)

"""Non-method versions of the same."""
def f1(a: str, b: str) -> str:
    return a + ' ' + b

def f2(a: str, b: str) -> str:
    return f1(a, b)

def f3() -> str:
    return f1('hello', 'world')
    
def f4(a: str, b: str) -> str:
    return f2(b, a)

I’ve chosen trivial make-work, because we want something fairly innocuous, something that won’t entirely dwarf the cost of making a function call.

What do I mean by “function call overhead”? Unlike many languages, Python has fancy options for passing arguments that provide the facilities to do things like *args and **kwargs, named parameters, and now the feature that lets you force naming of parameters. There are decorators, which are really just a function call hooked between the caller and the destination function; and there are caveats of Python’s awesome dynamism/monkeypatching.

Ultimately what we want to see here is the difference between adding three strings, vs calling a helper to add them, vs calling a function that happens to use a helper (as would happen if we had a property, which is actually a function call).

# This may take upto a minute to start producing output.
print("Benchmarking, Please Wait...")
c = C()
%timeit -r 21 -n 10000000 c.m1('hello', 'world')
%timeit -r 21 -n 10000000 c.m2('hello', 'world')
%timeit -r 21 -n 10000000 c.m3()
%timeit -r 21 -n 10000000 c.m4('world', 'hello')

Benchmarking, Please Wait...
143 ns ± 2.61 ns per loop (mean ± std. dev. of 21 runs, 10000000 loops each)
210 ns ± 4.05 ns per loop (mean ± std. dev. of 21 runs, 10000000 loops each)
199 ns ± 5.64 ns per loop (mean ± std. dev. of 21 runs, 10000000 loops each)
312 ns ± 6.75 ns per loop (mean ± std. dev. of 21 runs, 10000000 loops each)
# Non-method, direct function all, versions of the same.
print("Benchmarking, Please Wait...")
%timeit -r 21 -n 10000000 f1('hello', 'world')
%timeit -r 21 -n 10000000 f2('hello', 'world')
%timeit -r 21 -n 10000000 f3()
%timeit -r 21 -n 10000000 f4('world', 'hello')

Benchmarking, Please Wait...
129 ns ± 3.47 ns per loop (mean ± std. dev. of 21 runs, 10000000 loops each)
186 ns ± 6.24 ns per loop (mean ± std. dev. of 21 runs, 10000000 loops each)
176 ns ± 4.72 ns per loop (mean ± std. dev. of 21 runs, 10000000 loops each)
258 ns ± 4.75 ns per loop (mean ± std. dev. of 21 runs, 10000000 loops each)

Fussing over nanoseconds?

My goal here is to demonstrate the relative cost of function calls vs relatively minor operations. I’ve actually seen methods like the one above in a LOT of code (although it’s usually a ‘/’ and not a space), and they get invoked millions, billions of times in a run.

What we really care about is the ratios.

### Punch in the timings from above.
# Measure the overhead of the m4/f4 which uses 2 function calls to add the strings.
print(f"2 fn calls -> direct implementation {float(312) * 100 / 143:6.2f}%")

# Measure the overhead of calling a class-method vs a simple function.
print(f"m1->f1 {float(143) * 100 / 129:6.2f}%")
print(f"m2->f2 {float(210) * 100 / 186:6.2f}%")
print(f"m3->f3 {float(199) * 100 / 176:6.2f}%")
print(f"m4->f4 {float(312) * 100 / 258:6.2f}%")

2 fn calls -> direct implementation 218.18%
m1->f1 110.85%
m2->f2 112.90%
m3->f3 113.07%
m4->f4 120.93%

The m4 implementation spent as much time on the function overhead as it did doing the actual work of joining the strings, and while the example is contrived it’s derived from very real code I’ve seen in the wild:

class Resource:
  def __init__(self, protocol, host, path):
    self._protocol = protocol
    self._host = host
    self._path = path
    self._args = None

  @property
  def protocol(self): return self._protocol
  @property
  def host(self): return self._host
  @property
  def path(self): return self._path
  @property
  def args(self): return self._args or ""
  @property
  def prefix(self): return self.protocol + "://" + self.host

  @property
  def uri(self):
    # Cache the uri when we do produce it.
    return self.prefix + self.path + self.args

  @property
  def uri_test(self):
    for i in range(24):
      self.uri

print(Resource("https", "wiki.python.org", "moin/PythonSpeed/PerformanceTips").uri)


class Versus:
  def __init__(self, protocol, host, path):
    self.protocol = protocol
    self.host = host
    self.path = path
    self.args = None

  def prefix(self):
    return self.protocol + "://" + self.host

  def uri(self):
    return self.protocol + "://" + self.host + "/" + self.path + (self.args or "")

  def uri_test(self):
    for i in range(24):
      self.uri()

print(Versus("https", "wiki.python.org", "moin/PythonSpeed/PerformanceTips").uri())

The timings for this trivial work are kind of shocking:

%timeit -r 21 Resource("https", "wiki.python.org", "moin/PythonSpeed/PerformanceTips").uri_test
%timeit -r 21 Versus("https", "wiki.python.org", "moin/PythonSpeed/PerformanceTips").uri_test()

15.4 µs ± 463 ns per loop (mean ± std. dev. of 21 runs, 100000 loops each)
7.83 µs ± 273 ns per loop (mean ± std. dev. of 21 runs, 100000 loops each)

I’m guessing you probably didn’t expect a couple of dozen string constructions to take over 10 microseconds, did you?

I went out of my way to do it badly in GoLang:

package main

import (
	"fmt"
	"time"
)

type versus struct {
	protocol string
	host string
	path string
	args string
}

func newVersus(prot string, host string, path string) *versus {
    c := versus{protocol: prot, host: host, path: path, args: ""}
	return &c
}

func (v *versus) uri() string {
    return v.protocol + "://" + v.host + "/" + v.path + v.args
}

func (v *versus) testURI() int {
    t := 0
    for i := 0; i < 24; i++ {
	  t += len(v.uri())
	}
	return t
}

func main() {
	var totalTime int64
	const runs int64 = 25000
	t := 0
    for i := int64(0); i < runs; i++ {
      start := time.Now()
      v := newVersus("https", "wiki.python.org", "moin/PythonSpeed/PerformanceTips")
	  t += v.testURI()
	  elapsed := time.Since(start)
	  totalTime += elapsed.Nanoseconds()
	}
	
	fmt.Printf("%d runs took %dns, average %d per loop\n", runs, totalTime, totalTime/runs)
	fmt.Printf("accumulated %d total characters.\n", t)
}

And

25000 runs took 45002000ns, average 1800 per loop
accumulated 33600000 total characters.

“But, of course, go is faster … reasons”

I’m not focusing on complex operations. These are fundamentals. And if we were to start focusing on optimization, well clearly we’d want to start trying to eliminate function calls – it’s a low-hanging fruit.

A quick aside:

“Python 3 now provides format-literals, you should have used those!”

People assume they are an automatic speed gain, but it’s not always true.

def simple_vars(protocol, host, path, args):
    for i in range(24):
        s = protocol + "://" + host + "/" + path + (args or '')

%timeit -r 25 simple_vars("https", "wiki.python.org", "moin/PythonSpeed/PerformanceTips", None)
4.69 µs ± 445 ns per loop (mean ± std. dev. of 25 runs, 100000 loops each)

def f_in_simple_vars(protocol, host, path, args):
    for i in range(24):
        s = f"{protocol}://{host}/{path}{args or None}"

%timeit -r 25 f_in_simple_vars('https', 'wiki.python.org', 'moin/PythonSpeed/PerformanceTips', None)
5.24 µs ± 279 ns per loop (mean ± std. dev. of 25 runs, 100000 loops each)

What can be done?

Python is slow, you shouldn’t be fretting performance in it the majority of the time.

Consider using a different python implementation if you are running long-running processes or cpu-heavy processes. There’s “cython” if you don’t have to sweat distribution.

Remember that python cannot cache member lookups – it can’t trust that invoking them didn’t mutate them.

class Joiner:
    join_char = ','
    def put_together(self, a, b, c):
        self.result = a + self.join_char
        self.result += b + self.join_char
        self.result += c + self.join_char

import dis ; dis.dis(Joiner.put_together)


4           0 LOAD_FAST                1 (a)
              2 LOAD_FAST                0 (self)
              4 LOAD_ATTR                0 (join_char)
              6 BINARY_ADD
              8 LOAD_FAST                0 (self)
             10 STORE_ATTR               1 (result)

  5          12 LOAD_FAST                0 (self)
             14 DUP_TOP
             16 LOAD_ATTR                1 (result)
             18 LOAD_FAST                2 (b)
             20 LOAD_FAST                0 (self)
             22 LOAD_ATTR                0 (join_char)
             24 BINARY_ADD
             26 INPLACE_ADD
             28 ROT_TWO
             30 STORE_ATTR               1 (result)

  6          32 LOAD_FAST                0 (self)
             34 DUP_TOP
             36 LOAD_ATTR                1 (result)
             38 LOAD_FAST                3 (c)
             40 LOAD_FAST                0 (self)
             42 LOAD_ATTR                0 (join_char)
             44 BINARY_ADD
             46 INPLACE_ADD
             48 ROT_TWO
             50 STORE_ATTR               1 (result)
             52 LOAD_CONST               0 (None)
             54 RETURN_VALUE
# see also the repeated lookups of 'result'

Compare the following:

def join_these_things_v1(things, into):
    for thing in things:
      into.things += thing + into.separator

import dis
dis.dis(join_these_things_v1)

2           0 SETUP_LOOP              32 (to 34)
              2 LOAD_FAST                0 (things)
              4 GET_ITER
        >>    6 FOR_ITER                24 (to 32)
              8 STORE_FAST               2 (thing)

  3          10 LOAD_FAST                1 (into)
             12 DUP_TOP
             14 LOAD_ATTR                0 (things)
             16 LOAD_FAST                2 (thing)
             18 LOAD_FAST                1 (into)
             20 LOAD_ATTR                1 (separator)
             22 BINARY_ADD
             24 INPLACE_ADD
             26 ROT_TWO
             28 STORE_ATTR               0 (things)
             30 JUMP_ABSOLUTE            6
        >>   32 POP_BLOCK
        >>   34 LOAD_CONST               0 (None)
             36 RETURN_VALUE
def join_these_things_v2(things, into):
    into.things.extend(thing + into.separator for thing in things)

import dis
dis.dis(join_these_things_v2)

2           0 LOAD_DEREF               0 (into)
              2 LOAD_ATTR                0 (things)
              4 LOAD_METHOD              1 (extend)
              6 LOAD_CLOSURE             0 (into)
              8 BUILD_TUPLE              1
             10 LOAD_CONST               1 (<code object <genexpr> at 0x0000012ECF0DF8A0, file "<ipython-input-95-eed2b012bb4d>", line 2>)
             12 LOAD_CONST               2 ('join_these_things_v2.<locals>.<genexpr>')
             14 MAKE_FUNCTION            8
             16 LOAD_FAST                0 (things)
             18 GET_ITER
             20 CALL_FUNCTION            1
             22 CALL_METHOD              1
             24 POP_TOP
             26 LOAD_CONST               0 (None)
             28 RETURN_VALUE

Disassembly of <code object <genexpr> at 0x0000012ECF0DF8A0, file "<ipython-input-95-eed2b012bb4d>", line 2>:
  2           0 LOAD_FAST                0 (.0)
        >>    2 FOR_ITER                16 (to 20)
              4 STORE_FAST               1 (thing)
              6 LOAD_FAST                1 (thing)
              8 LOAD_DEREF               0 (into)
             10 LOAD_ATTR                0 (separator)
             12 BINARY_ADD
             14 YIELD_VALUE
             16 POP_TOP
             18 JUMP_ABSOLUTE            2
        >>   20 LOAD_CONST               0 (None)
             22 RETURN_VALUE

This small refactor to aggregate data rather than repeating operations on it gives us a quick and easy doubling:

class Test:
    things = []
    separator = ","

things = [str(i) for i in range(2000)]

%timeit join_these_things_v1(things, Test())
%timeit join_these_things_v2(things, Test())

504 µs ± 35.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
210 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Always try to write iterable handling functions, it allows you to avoid being repeatedly invoked:

def non_agg_total_percentage(a, b):
    return a * 100 / b

def non_agg_invoker(values):
    total = 0
    for a, b in values:
        total += non_agg_total_percentage(a, b)
    return total / len(values)

def agg_total_percentage(values):
    return sum((a * 100 / b) for a, b in values)

def agg_invoker(values):
    return agg_total_percentage(values) / len(values)

values = [(a, b) for a, b in zip(range(30000, 90000, 3), range(20000, 140000, 6))]
%timeit non_agg_invoker(values)
%timeit agg_invoker(values)

2.72 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.86 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you’re writing something you expect will regularly repeat a member method thousands or millions of times, consider a simple hoist:

def f1(values):
    for i in range(10000):
        values.append(i)

def f2(values):
    append = values.append
    for i in range(10000):
        append(i)

%timeit f1([])
%timeit f2([])

527 µs ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
379 µs ± 7.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And would it really hurt you to compile your regexes occasionally? I know sometimes it clutters, but when you’re going to be matching it against every file name on a filesystem or every row in a database, in a template or yaml parser…

import re
creditsPattern = r"^.*? ([+-]?\d+) [Cc][Rr]"
creditsRe = re.compile(creditsPattern)
str1 = "Lorem ipsum, unfortunately I don't match. +123 Coins"
str2 = "Well +123 Charlie, I do match because I have -321 cRs."

def s1():
    for i in range(300000):
        m = re.match(creditsPattern, str1)
        m = re.match(creditsPattern, str2)

def s2():
    for i in range(300000):
        m = creditsRe.match(str1)
        m = creditsRe.match(str2)

def s3():
    creditsMatch = creditsRe.match
    for i in range(300000):
        m = creditsMatch(str1)
        m = creditsMatch(str2)

%timeit s1()
%timeit s2()
%timeit s3()

1.09 s ± 70.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
789 ms ± 7.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
769 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: