Distinguished Speaker Series: Insights from Wes McKinney
Last month, Wes McKinney, creator (and benevolent dictator for life) of the Python Pandas project did a talk covering his work on Apache Arrow, an open source, multi-language developer framework for accelerating tabular data access and in-memory processing. Guillaume Rame, a Software Engineer at G-Research, gives us an insight into the value of the talk and what he has taken from it.
As a Software Engineer at G-Research, I am always looking for performance improvements in the tools I use because it means I can provide results for our quants to use more quickly, which is incredibly valuable to the business.
I am a heavy Pandas user. I often work with billion-row data frames stored in parquet files and without it, processing this volume of data would be incredibly difficult – so Wes/Pandas has already saved me a lot of time and spared me many headaches in my career.
That’s why I was pretty excited about hearing from the man behind Pandas, as well as learning about any potential new tools that could make the types of data processing I do faster and more efficient.
While Pandas is something I know a lot about and use frequently, before the talk I had not heard of Apache Arrow or how it could potentially help me and my team in the work we do. After the overview from Wes, that’s changed and Arrow is something that I am definitely going to investigate further.
One aspect that particularly jumped out to me during the talk was the Polars library. Polars is a DataFrame library similar to Pandas, but it is implemented in Rust (and who doesn’t like a bit of Rust!) using the Apache Arrow memory model.
In benchmarks, Polars is consistently multiple times faster than Pandas, while also being more memory efficient. It achieves this performance while also having a more uniform and type safe API.
Based on those improvements, it is something that I can see replacing Pandas in my dataflows, especially as migrating from Pandas to Polars should be straightforward given the APIs are similar!
What I took from the talk
Within G-Research there is wide application potential for Arrow, particularly within the work my team does. This is especially true when we consider current industry trends to migrate to more parallel hardware, including GPUs, which Arrow supports natively.
But is Arrow going to replace our existing applications, like the industry standard Apache Spark, or hand-rolled (and optimised) implementations of a lot of data processing pipelines that already exist? At the moment, I think it is unlikely but it is something we will certainly explore further.
Overall, the talk was really insightful and we are fortunate to have speakers of Wes’ calibre come and talk to us, and give us lots of food for thought.