High-Performance web apps in Python - Diagnosing performance issues

date June 30, 2021

date 10 min read

Python code running slow? We show you how to get going fast!

Tim Armstrong, Founder, Consultant Engineer, Expert in Performance & Security

High-Performance web apps in Python - Diagnosing performance issues

Let’s get something straight: Microservices, CDNs, Caches, Horizontal-Scaling, and Async are not going to make your web app/website faster! They are tools and bandaids, not cures. Taking paracetamol might reduce the pain of a toothache, but sooner or later you’re going to need to go to the dentist to fix the root cause of the pain.

Now, I’m not saying that any of that list of buzzwords won’t help at all. But, as Developers / Programmers / Engineers (or whatever label you prefer), we love to jump on the newest thing that promises us the world. As a result, we sometimes forget that these are just tools and not the “solutions” that they’re pitched as.

In this article, we’re going to take a look at how to use profiling to identify bottlenecks in web apps, using a practical example from the benchmark code of a popular Web Framework benchmark site.

Okay, so how do you diagnose and fix performance problems?

Well, that kind of depends.

A prominent case of miss-attribution, where there was no attempt to diagnose the underlying cause is FastAPI’s TechEmpower benchmarks. These unfortunately flawed benchmarks are even being used as a marketing point:

Fast: Very high performance, on par with NodeJS and Go (thanks to Starlette and Pydantic). One of the fastest Python frameworks available.

So let’s take a look at the code for TechEmpower’s Flask benchmark and compare that to their FastAPI benchmark and see if we can’t find the issue.

If you glance through these examples (at the time of writing I’m looking at #2afc55d), it is very clear that it’s not even a remotely fair comparison, as I’m sure most of you can see.

So let’s pull out a single example for comparison:

FastAPI - Single SQL(RAW):

async def single_database_query():
     row_id = randint(1, 10000)
     async with connection_pool.acquire() as connection:
         number = await connection.fetchval(READ_ROW_SQL, row_id)
     return UJSONResponse({'id': row_id, 'randomNumber': number})

Flask - Single SQL(RAW):

def get_random_world_single_raw():
     connection = dbraw_engine.connect()
     wid = randint(1, 10000)
     result = connection.execute("SELECT * FROM world WHERE id = " + str(wid)).fetchone()
     worlds = {'id': result[0], 'randomNumber': result[1]}
     connection.close()
     return json_response(worlds)

Immediately one key difference in these snippets of python code stands out, which is a problem throughout the rest of these code bases. The FastAPI code is simply acquiring an existing database connection from an existing connection pool, whereas, for some inexplicable reason (that is so bad that it almost looks like intentional sabotage - although this was added 8 years ago by someone seemingly copying from StackOverflow, so…), the Flask code is being forced to create a brand-new connection each time.

Not only that but the way the Flask version has been written is not safe, and if the code fails after opening the connection, for any reason, then we leave behind an orphaned connection, slowing down the database server and risking a “Too many open files” exception.

Profiling code

While in this case profiling might seem unnecessary, as there are glaring issues, by following through the process as we make our improvements reveals some things that might surprise you.

So then in profiling the database code from the Flask version as it stands (Note: To remove network latency as a factor, for now, I’ve switched to using SQLite3):

Profile of code found in TechEmpower’s Flask Benchmark

We see that for 10k runs it takes 1636ms, of which 514ms was opening the connection, and 243ms was closing the connection, and only 773ms being used to actually execute the query.

This is probably in line with what a lot of readers are expecting, as it’s common knowledge at this point that this is not something you should do, but the sheer extent of the impact can be a surprise nonetheless.

Since we’re looking at the Raw endpoint here, let’s start by ditching SQLAlchemy in favour of using something with fewer layers of abstraction (in this case just the native sqlite3) and let’s move to keeping a single connection alive rather than opening a new one every time the function runs:

Profile of improved code for TechEmpower’s Flask Benchmark

We now see that for 10k runs it takes only 70ms in total, of which 44ms is executing the SQL, 11ms is spent on randint, and 7ms is spent on fetching the result of the SQL query.

So then let’s bring Async into the picture, as this is the popular thing to do these days. To do this we’ll take the native SQLite3 code that used connection sharing, swap SQLite3 for aiosqlite, and then make the necessary adjustments to run it as an async operation:

Profile of code for equivalent to that found in TechEmpower’s FastAPI Benchmark

Okay, so the async version took almost 4x longer than our sync version, why is that? Well, there are two elements here:

This query is incredibly simple
There is no network or server latency because we’re using SQLite.

So, this is arguably providing an unfair advantage to the sync code, as there is no external resource to wait for, that’s a fair argument, so let’s fix that: We’ll swap over to Postgresql for both and use a separate VM as a database server.

Looking first at the improved sync version (now using Postgres):

Profile of improved code for TechEmpower’s Flask Benchmark - Using Postgres

We see it took 650ms, of which 623ms was spent entirely on the execute method, not bad, but that’s a lot of waiting and results in a total execution time for our test program of 655ms. It seems pretty obvious then, that handing over control at this point to a second executor of some kind while we wait would shave off some time.

So let’s look at an Async version:

Profile of code for equivalent to that found in TechEmpower’s FastAPI Benchmark - Using Postgres

Okay, so that shaved off around 100ms from our function which now took 549ms, of which 234ms was the fetchval method (equivalent).

However, it’s not quite that simple, looking more closely we see that this method now only consumes 22.1% of the total execution time, which is now at an absolutely massive 2425ms! This means that while we’re spending less time executing our actual code, we’re spending considerably more time managing the execution of it.

Note: You may have noticed that the “hit” numbers have increased - making it appear as though I’ve run it more than once to bias the results. This is not the case, it is an artefact from using Async that messes with the way the profiler counts “Hits”. But don’t take my word for it, all of the code is open-source, so you’re welcome to run it for yourself and verify my results.

Furthermore, this is now an unfair example favouring Async, which runs concurrently, while our sync code is being forced to run serially. So let’s correct that by using the ThreadPoolExecutor from concurrent.futures - thus making our test code into a more accurate analogue of using Flask, thereby levelling the playing field (I’m not using any libraries that release the GIL, so this is actually a pretty fair comparison).

Profile of the improved code for TechEmpower’s Flask Benchmark running multi-threaded - Using Postgres

Now, due to the threading, we’re losing some visibility here in the profiler that I’m using, so we can’t see how long each part of our executed code is taking, but we can still see that the total time taken by all 10k runs is now 253ms, and our test program’s total execution time is now 304ms. So not only is our database code now executing at a similar speed to the async implementation, but we’re spending 10x less time in total!

Production analogues

Now, I’m sure some of you will be unsatisfied with this result so far, as you’d like to argue that threading and concurrency are not the same. I agree with you, they are not the same. Concurrency is collaborative (requiring tasks to be good neighbours), and Threading relies on the OS and CPU to handle the multitasking via TDM and multicore computing.

So let’s get back to the original benchmark code and apply the improvements from our profiling session. If we’re right then we should now have as good, if not better (due to lower overheads) performance than the FastAPI / Async code.

To keep things as tightly scoped as possible we’ll keep everything as close to identical as possible except for 3 key things:

The web Framework: Obviously; it is a Sync (Flask) vs Async (FastAPI) comparison after all.
The database library: in the Flask corner we’ll use psycopg2, and in the FastAPI corner we’ll use asyncpg
The (WSGI / ASGI) webserver: here we’ll use a selection of choices for Flask as I don’t know what the best combination here will be, and for FastAPI we’ll use exclusively Gunicorn + Uvicorn (The configuration preferred by FastAPI’s developer).

Our test bench is configured as follows:

DB Server - 8GiB RAM, 4 Cores (pinned, physical not hyper-threads) - Running: PostgreSQL 13.3 / Debian 9
Source Server - 8GiB RAM, 4 Cores (pinned, physical not hyper-threads) - Running: WRK2 / Debian 10
Target Server - 8GiB RAM, 4 Cores (pinned, physical not hyper-threads) - Running: Python3.7.3 / Debian 10

This leaves 4 cores and 8GiB of RAM available for the Host OS.

Final results - Comparing webservers using the fixed code

In the chart, the Bjoern, Gunicorn+Meinheld, and Gunicorn are all running the same Flask code (with our fix applied), whereas the Gunicorn+Uvicorn is running the original FastAPI code using the settings from TechEmpower’s benchmark. So what can we see here? Well at 4 threads, the improved Flask code running under Bjoern is on par with FastAPI running under Gunicorn+Uvicorn. But, if we increase the process count then the OS can start engaging in preemptive multitasking allowing our improved Flask code to pull ahead a significant lead (whereas FastAPI actually takes a significant performance hit when we add more processes than there are cores, due mostly to the additional overheads of running Python async code).

Key Takeaways

Operating systems as we know them today are complex TDM (Time Division Multiplexing) solutions for multitasking and they are very good at that (Preemptive multitasking has been around since the 60s). Since the mass adoption of multicore computing and SMT (Simultaneous Multi-Threading), they have only gotten better, adding abilities like Turbo Boost; overclocking individual cores and directing the heavy load to them. A modern CPU and Kernel can recognise when a task is IOBound, and swap out any waiting tasks for something ready.

What these results show is that profiling and fixing the root cause of poor code has more impact than whether you use Async or not.

The difference in the TechEmpower benchmarks has more to do with good code vs bad code than Library A vs Library B (or even Sync vs Async).

Now, I put a lot of emphasis on the fact that you don’t need Async in this article, but I’m not saying that it’s always bad. There are times when it’s useful. As I mentioned in the opening it is a good bandaid, use it while you fix the root cause, but make sure you do fix the root cause! If you don’t then, well, prepare for unforeseen consequences…

If you still feel that you need Async after solving all of the performance issues you can find, I can recommend Gevent as a faster and cleaner alternative to python’s built-in Async.

Now, if you’ll excuse me I have a PR that I feel compelled to write that fixes the myriad of issues with various TechEmpower benchmarks.

Post-Script

None of this is intended as an attack towards anyone. While I disagree with Sebastián Ramírez’s use of the TechEmpower benchmarks as a marketing device; especially given that he is clearly a very capable programmer, I can’t imagine that he didn’t notice the flaws in the competing examples he references. Mike Wolfe’s post makes a solid argument for why FastAPI is a good library, and I agree with the vast majority of his points. I can honestly say that I would use FastAPI, and have even written tutorials using it (Links will follow when published). In principle, I am also a fan of what TechEmpower is trying to achieve with their benchmarks. However, they need to engage in stricter quality control and peer review to avoid obvious flaws in the benchmark code. I’ve pushed a PR correcting some of the issues identified in the Flask benchmarks And followed that up with some additional fine-tuning after seeing the preliminary test results.

Articles overview