Keeping database queries fast with Python
- Open Source Software
- Software Engineering
Written by Alexander G. Rakowski, a former member of our Software Engineering team at G-Research. He talks through how we updated our relational database to a custom NoSQL interface accessible via several easy-to-learn Python commands.
21st Century NoSQL solutions
For many years, the team I worked in relied on a relational database (RDBMS) as its long-term storage technology. Despite not being trendy, RDBMSes have several advantages: various levels of transactional semantics; strong ACID properties; mature and efficient query optimisers; and perhaps above all, they offer an easy-to-learn yet powerful interface in the form of SQL.
However, as our systems grew, and more was demanded of them, scaling became the key factor holding us back.
To overcome this, we introduced a high-throughput, low-latency distributed datastore as our shiny new backend to replace the beleaguered SQL solution and reduce waiting time for users who had to access the database.
We’d done it! We were using NoSQL technology and had sprung into the 21st Century, where everything happens at the speed of thought, if not before.
Then the problems started…
NoSQL databases, like the one we are using, are generic and offer only a low-level API that concerns itself with how to interact with the backend, but not what those interactions mean.
In other words, those APIs lack the domain context to convey semantics, so we need to build that ourselves on top, and more layers meant more time – more time my team would take coding it all up.
We focus on a JVM stack, so it was natural that we would write our domain API for the JVM. G-Research has a long history with C#, so many of the downstream processing pipelines consuming our data wanted DotNET clients as well. It was going to be complicated.
There were some notable differences in terms of both the low-level client libraries for the NoSQL backend and the behaviour of the JVM and CLR, but Java and C# are similar in many regards. As such, it was not an inordinate burden to write and maintain two separate domain-level client libraries.
Those two client libraries covered our most pressing use cases, but we need to run data quality validations. This helps us to ensure a consistently high standard of output and to identify systematic issues as early as possible.
And, we sometimes need to perform ad-hoc analyses on particular subsets of data because our researchers and analysts may need easy access to our outputs for various purposes.
Although it’s perfectly possible to use JVM or DotNET languages for these sorts of tasks, many people find it more convenient to spin up an interactive Python notebook for rapid, iterative development. So, I set out to write a client library in Python.
Avoid unnecessary repetition and redundancy – again
There’s an obvious answer to the problem at hand – write the Python client as a wrapper around the underlying datastore access layer. A C client is provided by the datastore vendors, and many Python modules use bindings into C libraries for performance reasons.
Yet, this didn’t seem to be the best decision.
Our proprietary format is non-trivial to implement, requiring a good deal of care to ensure correct semantics and consistency with the existing clients. Further, we would need to maintain all three of these clients, accounting for changes in the underlying data access libraries as well as in our own logic.
Finally, there was the question of whether our domain-level client should be written in C for performance reasons, with simple Python hooks; or written as much in Python as possible for maintainability since we are not a team of C experts.
There’s another factor to throw into the mix. Whilst some of the use cases for Python could be handled in simple notebooks – Jupyter, for example – others would want access from PySpark sessions. From some earlier demand for PySpark access into JVM-based functionality, I was familiar with Py4J and we already had a Java client.
A possible solution: Py4J
So we wanted to build a (C)Python wrapper hooking into our JVM client via Py4J.
Py4J works by running a Java server in one process and communicating with that (local) port from its Python component, which then exposes Java members and classes to the rest of the Python application.
If we only wanted access from PySpark that would make things simpler, as we could simply load a JAR into the Spark session and hook into the Py4J gateway provided by the SparkContext
. However, we also needed this client to run in stand-alone mode. To this end, we needed to create our own JVM instance to run the Py4J gateway.
On its own, this would be sufficient (but awkward) for writing the Python API. Py4J provides converters for basic types like arrays, lists, and maps, while user-defined types need to be handled explicitly.
Digging into and interacting with those custom types from Python is verbose, painful, and error prone. This is especially evident when it comes to ensuring the right types are used and when referencing entities by fully qualified paths. To get a sense of what this is like, imagine trying to write Java in Notepad!
I wrote a small adapter to simplify these inter-process interactions. I put it on the JVM side to handle the conversions from custom to Py4J-supported types, whilst maintaining the strong typing guarantees available on the JVM. This adapter is registered with the Py4J gateway, making it immediately available to Python callers without having to ensure package alignment between the two sides.
Although Py4J handles the conversion of many common types, it needed a little help to accommodate some extra types like sets. Both processes (Python and JVM) also need to agree on a port number, or they won’t know where to talk to: hard-coded ports are a simple solution, but can we guarantee that the selected port will always be available? Instead, we chose to use a randomly selected port passed to the JVM gateway as a command-line parameter when we spawn it from Python.
In its alpha-testing phase, we noticed that Python would sometimes complain that the JVM gateway was unavailable, or more accurately that it was receiving no response from the expected port.
An interesting symptom was that this rarely happened on my workstation, hence having gone unnoticed thus far, but frequently afflicted one of my co-workers. The JVM process was definitely being created and we could interact with it manually after we saw this error being thrown, so what was going on?
It turned out the JVM could be a little slow to start up, perhaps taking a few seconds, but the Py4J probe to ensure connectivity would happen very shortly after the JVM process was launched, crucially before the gateway was actually listening on the specified port. This was solved by retrying the initial connection attempt for a few seconds with tenacity, a script designed to simply re-try the same thing over and over again.
Making progress
The client’s architecture is split into three distinct concerns. Let’s cover each of these in turn.
How to interact with the Java server
The JVM interaction layer is responsible for launching the JVM instance with whatever arguments are necessary for mediating communications, and for handling communication errors. Essentially it is an application-specific wrapper around the Py4J gateway.
How to delegate from Python to the JVM
The delegation layer is perhaps the most interesting in terms of its responsibilities and logic.
I chose to define a decorator for any functions that wish to call into the backing JVM, rather than having those functions explicitly request those interactions and handle the results. This works nicely because the logic really is always the same: make the request; await the response; allow Py4J to perform any type conversions it needs to; and return the result without any further modifications.
Using decorators has a further advantage. We can use the name of the decorated function as the name of the JVM function to call, with a little work to switch from snake-case to camel-case. In turn, this simplifies the maintenance of the entire application, as the Python and JVM adapter APIs are identical, the aforementioned case conversion notwithstanding.
Because Python allows both positional and keyword parameters for function arguments, whereas Java only permits positional arguments (although other JVM languages like Kotlin do permit named parameters), the Py4J gateway itself was written in Java. My delegating decorator accounts for this by allowing the parameter order to be explicitly specified and reordering arguments as appropriate.
The user-facing API
As for the user-facing API, it is simply empty-bodied, decorated functions of the form:
{python} @call_jvm(arg_order) def do_something(): ''' Docstring. ''' pass
This approach serves multiple purposes: documentation and code do not clutter one another up; there is minimal domain logic to understand and maintain; and it should be obvious how to make new data access functionality available to end users without needing to understand the inner workings of the client.
Of course, this last point does rely on the functionality existing in both the core JVM client library and its adapter.
Keeping queries speedy
I had created a practical and production-ready Python client but felt I could make improvements in the system’s usability.
To retrieve records, users need to request them by one or more key fields, where each key field has a well-known, but potentially large, number of valid values. Asking users to either type in these (long) identifiers, or find a catalogue of them from which to copy and paste, is inconvenient. Explicitly enumerating all the possible values as part of the library would conflict with the design goal of simplicity.
Fortunately, Python is very much a dynamic language, so we can request the full list of field values from the underlying JVM client at runtime and present this in some form of Python object, whether a list, enum, or something else entirely. As before, this use of delegation aligns with our design goals.
The question becomes how to represent those values through the API. To avoid users inadvertently attempting to request something the JVM layer does not know how to handle, I chose to enforce as much immutability as Python affords through its language features. At the same time, I wanted to avoid using eval
, exec
, or other arbitrary code execution tools for safety and security reasons.
Instead, I used the following advanced features of Python to create what is essentially a dynamically-generated enum class. The API object storing the values is defined as a singleton using a metaclass. This is not strictly necessary, but given that the API object is effectively immutable, without concerted effort to contravene this, there is no reason to have more than one.
An additional metaclass calls the JVM client to retrieve the valid field values, transforms these automatically into descriptors, and stores these as slots on the class. Specifically, the field values are defined as data descriptors, as they implement the __set__
dunder method in order to throw a custom error message on attempts to set the descriptors’ values.
Using this metaprogramming-heavy approach, which performs all the requisite magic at class-creation time, users can start typing a field name into their notebook or IDE and hit TAB, or another appropriate hotkey, to trigger code-completion.
Example usage:
from datastoreclient import validfields validfields.na # User hits TAB at this point
Result:
from datastoreclient import validfields validfields.name of field1 # Code completion offers a suitable suggestion
Final thoughts
There’s always a higher mountain to climb and this project is no exception. The Python client works well, but it can still be better shaped to certain use-cases.
Consider a PySpark session, wherein a JVM context already exists. Spawning a separate JVM is slower and less resource-efficient than hooking into the Spark context by loading a JAR. Not to mention the potential for issues caused by running on cluster nodes but outside any Spark container.
Solving this in a way that is transparent to users will involve applying a strategy pattern to detect which mode the client is running in and spawn the JVM client accordingly. Fortunately, this can all be neatly hidden away behind the aforementioned JVM interaction layer.
Whilst Py4J is easy to interact with (with its automatic conversions and ability to register handlers for the server), the hard boundary between language-specific processes hurts performance and requires some care to get right. We can leverage Apache Arrow to alleviate these concerns, which should be particularly convenient in PySpark, given its existing integration there.
I’m sure more requirements will be identified, but that’s what makes engineering fun. Using the advanced features of Python helped me stay in front of the ever-larger database storage solutions my team had to adopt and helped me keep the system speedy.