Wes McKinney – Apache Parquet & Arrow
Written by Alexander Scammon, Head of Open Source Development, G-Research.
When you’re known for a hit, the fans always want to hear you play it. It’s been some 57 years since The Rolling Stones released “Satisfaction,” and not even the death of its drummer or a global health crisis could keep the band from touring and playing it to sold-out stadiums.
When Wes McKinney came to G-Research for his Distinguished Speaker Series talk, everyone wanted to hear him talk about what he’s best known for – Pandas. It’s definitely a Beast of Burden for G-Research – we spend a lot of time working with it.
For us data-minded crew, the open-source library of data analysis tools for Python he created is roughly equivalent to the opening chords of Keith Richards guitar. It accelerates our pulses, and we want to hear him expound on that. That’s the hit we know and love.
However, McKinney has recently developed something that may come to play an even larger role in the daily lives of our team. We work increasingly more with columnar data from different disparate environments. This development potentially initiates a more powerful and useful way of working with big data.
We were listening.
McKinney talked to us about his work heading the Apache Parquet, and now the Apache Arrow project, which aims to make it easy for data-heavy engineers to work with columnar data in memory.
According to McKinney, Parquet was initially created to overcome networking limitations. The pipes weren’t wide enough to stuff such big data files down, he explained. But as file sizes have exploded with big data getting bigger, the CPUs started to become the problem.
McKinney imagines that we may soon see a rise in specialised silicon explicitly fabricated to deal with big data application processing.
Apache Arrow makes it easy to move data across different environments without having to reformat it. McKinney talked about taking data from a variety of databases: Spark, Pandas, Drill, Impala, HBase, Kudu, Cassandra and Parquet.
Then, instead of copying and converting it each time you needed to move it, you can transfer it into Arrow Memory. It would certainly seem to reduce the number of read-write operations, thereby upping both processing speed and improving data efficiency.
One of the greatest advantages of Arrow, McKinney points out, is that it allows for chunking; so big data can be broken into component parts and processed in smaller batches. That allows for greater flexibility for application development.
What does the future hold?
The Rolling Stones released “Satisfaction” in the summer of 1965, and it was an immediate hit, holding the number one spot on the charts for four weeks in August and remaining a perennial favourite. But if the Rolling Stones had stopped there, we never would have gotten “Gimme Shelter,” three years later, which is, if not musically superior, than at least its equivalent.
Wes McKinney may be known best for his work with Pandas. It was a big hit for him, after all. But much like the Rolling Stones, we may now be witnessing an even bigger development in the future of data storage and processing.