Building a database from scratch - 3

date July 15, 2021

date 6 min read

An introduction to database fundamentals

Tim Armstrong, Founder, Consultant Engineer, Expert in Performance & Security

Hi, Welcome to Part 3 of this series on building a database from scratch!

In Part 1, we discussed the high-level architecture and requirements for our custom database. Part 2 walked us through the data structures and how to writing & reading objects from the backing store - which is pretty cool, but in part 1 we said we were going to store the samples by identifier+timestamp because neither timestamp nor identifier is unique on their own. So we’ll need an index mechanism in order to keep track of all of the keys relating to an identity, and all the keys relating to a timestamp.

Needless to say, If you haven’t read the first two parts yet, then this part might be a bit of an odd place to jump in. So I highly recommend reading those first. Alternatively, you could pick-up the code from so far here.

A good indexing solution is critical, to building an efficient database. The fundamental goal of indexing is to reduce search complexity - Commonly with the goal of hitting O(1) time. A common upper limit for acceptable complexity in a database indexing system is O(N) - To be more specific this is in the context of N referring to the number of returned records (not stored records).

So how do we do this in LMDB?

Well, one of the first solutions that comes to mind when we talk about O(1) lookup times is a HashMap, which is rather fortunate since that’s effectively what we’re using as our backing store. What’s even better is that it supports something called a “sub-database” and supports duplicate records (meaning we can support non-unique indexes).

So, let’s start by creating two sub-databases within the environment:

keys_by_identity = env.open_db('idenity_index', dupsort=True)
keys_by_timestamp = env.open_db('timestamp_index', dupsort=True)

Here the dupsort argument instructs the database that we want to allow duplicate entries (and have them sorted), meaning that we don’t need to deal with building and managing our own extensible array structure for the indexes.

So then to add an object now we’ll need to insert it, into all three places in a single transaction (otherwise we would risk falling out of sync). To do this, let’s modify the put / write code from the previous tutorial.

ident_encoded = ident.encode()
input_sample = Sample(ident=ident_encoded, sample=sample, timestamp=timestamp)
with env.begin(write=True) as txn:
    key = input_sample.ident + b"/" + int(timestamp*100).to_bytes(8, byteorder="big")
    txn.put(key, bytes(input_sample))
    txn.put(input_sample.ident, key, db=keys_by_identity)
    txn.put(int(timestamp).to_bytes(8, byteorder="big"), key, db=keys_by_timestamp)

Now if we wanted retrieve all samples that originated from the apparatus with the identity "37d5edb2-ebce-429f-82bb-631f161f0ab5", all we need to do is get a cursor that points to the first of the duplicates in the keys_by_identity index, and then iterate over that, retrieving the objects as we go.

samples = []
with env.begin() as txn, txn.cursor() as cursor:
    cursor.set_key(ident.encode())
    for key in cursor.iternext_dup():
        samples.append(Sample.from_buffer_copy(txn.get(key))

Likewise, we could retrieve all the samples that originated from all apparatus at a specific time slice by using a very similar construct:

  samples = []  
  with env.begin() as txn:
        with txn.cursor(keys_by_timestamp) as cursor:
            cursor.set_key(int(timestamp).to_bytes(8, byteorder="big"))
            for key in cursor.iternext_dup():
                samples.append(Sample.from_buffer_copy(txn.get(key)))

Optimising access

So as it stands, writes are in O(1) time and reads are in O(N) time; where N in this case is the number of records with that identity.

Hold on a second O(N) for reads? isn’t there a faster way to beat O(N) time for the reads?

So, the answer here is one of those “Yes, but actually no” situations.

We could use getmulti to pull all of the matching keys in one request, but then we’d still need to iterate over that and pull the records. We could potentially fan this collection process out to a thread pool, but then we have the management overheads for the ThreadPoolExecutor, and we’d still need to serialise the final reduction process, so any potential improvement claimed by performing a parallel read is effectively lost on thread management and serialisation. Furthermore, since we’re dealing with an in-memory database, the wall-clock time consumed by this O(N) task is actually so close to line-speed (the physical limits of the hardware), that inserting more Python to the situation will only slow it down.

So yes it’s possible to make this better than O(N) on paper, but in practice this will be slower - at least in python anyway, it could be argued that it’s technically possible in C/C++. If we were doing this in C or C++ however, then we wouldn’t need to do this manual re-casting of the data in the first place (meaning getmulti would be more than sufficient).

Given that we are in python however, there is one way that we can squeeze out a little more speed, and that is to push the responsibility for managing the loop from Python to C by using the map function:

with env.begin() as txn:
    with txn.cursor(keys_by_identity) as cursor:
        keys = map(itemgetter(1), cursor.getmulti((ident.encode(),), dupdata=True))
        samples = tuple(map(Sample.from_buffer_copy, map(txn.get, keys))))

But this is massively sacrificing readability for very limited gains (approximately 3-4% ) in a section of code that doesn’t need that level of optimisation. If we did need to this level of optimisation, then it’s possibly time to consider either using cython (approximately 10% speed improvement in this case) or, more-likely, just completely migrating to C++.

So what optimisations could we make?

Well, we could potentially see an improvement in read and write time if we stopped using the backing store for indexing. But then we’d have to rebuild them every time we restart the server, which will take some time (although at 1.7 seconds per million records when tested on a Ryzen 9 5900x, it’s not that painful). Perhaps the more important issue however is that we’d have to serialise access to the dictionaries used for our “short-lived” indexes to prevent corruption caused by modification during iteration.

Another avenue one might consider is improving the write speed by persisting a write transaction. This is because committing the write transaction is by far the most time-consuming part of our write operation. Unfortunately, if we tried to do this we’d end up never being able to read the written data and the data would never be persisted to disk.

However, if we’re willing to take the risk of potential data loss, then there is one optimisation that we can make that doesn’t actually require any significant code changes - Setting the backing store to use Async writes. This means that in the event of a system crash we would lose all records that were written since the last time ENV.sync() was called or the OS has opportunistically written the dirty pages to disk on our behalf.

To do this you would set the metasync, sync, and writemap arguments of lmdb.Environment to False. However, by nature, this fundamentally violates the requirements set out for our project in Part 1 of this series, so unfortunately we cannot take the >10x speed improvement this would provide. But it’s a good thing to be aware if we were able to relax these requirements in the future.

That’s it for Part 3, in the next instalment we’ll build the UDP service.

The code

If you want to take a look at the final code from this part of the tutorial it’s available here.

Articles overview