Skip to main content
We're enhancing our site and your experience, so please keep checking back as we evolve.
Back to News
ParquetSharp – What we’ve learned and how it’s even better

ParquetSharp – What we’ve learned and how it’s even better

16 August 2023
  • Open Source Software

By Martin Finkel and the GR Open Source Software team

G-Research has been using ParquetSharp in production since 2018. We’ve made numerous improvements to the product thanks to our learnings from running it at scale inside of the datacenter, and thanks to the contributions of the community of engaged and enthusiastic users.

We initially wanted a Parquet implementation that would be:

  • Performant: It had to be high-speed for our application. When we started working on ParquetSharp, we saw speeds increase from 4x to 10x, depending on the data shape and whether the data was being read or written
  • Callable from .NET Core to enable closer integration with a wider variety of applications
  • Well maintained: ParquetSharp now has a community of contributors that help to keep it vibrant and useful
  • Close to official Parquet reference implementations: We wanted to make it as close to the reference specifications as possible to ensure interoperability
  • Flexible enough to work on a variety of common computer programming languages and operating platforms

Today we’re focused on building on this momentum and broadening the reach of ParquetSharp by enhancing its compatibility, ensuring its portability into other common programming languages, benchmarking its performance, and writing up extensive documentation.

It’s a lot of work, but we’re happy to contribute to this important component of the open source ecosystem.

Compatibility

If you’ve been using ParquetSharp for some time, you’ll no doubt have noticed the cross-platform compatibility improvements. It was important for us to open the benefits of Parquet file processing to programmers using a variety of different development platforms.

Now – whether you’re working on a Mac, Linux machine, or Windows system – ParquetSharp offers a consistent and reliable experience, unlocking the benefits of high-performance Parquet data manipulation in .NET. And ParquetSharp has made significant progress in achieving cross-platform compatibility, enabling it to run seamlessly on Mac (ARM64 and x64), Linux (ARM64 and x64), and Windows (x64) operating systems.

Our hope is that it will enable greater collaboration and more progress in the ParquetSharp ecosystem by lowering barriers and reducing any friction that coders may experience when working together.

Portability

With this expansion, developers can leverage the ease of use of ParquetSharp for efficient Parquet file processing.

As for languages, ParquetSharp is .NET, so it has already been successfully used with both F# and PowerShell in addition to C#.

Performance

As well as constant development efforts to support the latest version of the native Parquet library, ParquetSharp has been using and surfacing more of the latest performance-oriented .NET APIs in the past few stable releases.

Since memory layout and management is intimately linked to the Parquet data format, the latest, faster .NET memory APIs are a great match for ParquetSharp.

Adam Reeve has been leading the effort on that task, with various PRs improving the performance of many APIs, such as:

  • Read whole leaf-level arrays at once, when possible, rather than growing lists
  • Add WriteRowSpan method to ParquetRowWriter

Performance is now measured using BenchmarkDotNet, the standard benchmarking tool in .NET.

Documentation

Since ParquetSharp uses the underlying Parquet data format, the upstream Parquet documentation is always relevant. However, .NET specific APIs and usage sometimes require guidance and tips for best practices.

We have a growing set of documentation files on GitHub detailing various operations such as reading and writing parquet files, working with nested data, using the row-oriented API and more.

ParquetSharp.DataFrame

The .NET ecosystem provides a data structure named DataFrame in the namespace Microsoft.Data.Analysis. It supports “indexing, binary operations, sorting, selection and other APIs”, and is a popular structure in data-oriented .NET code. G-Research authored an extension to ParquetSharp that integrates seamlessly with the .NET DataFrame type. As always, you may download it directly from NuGet.

Support our work, share your learnings

You can help make ParquetSharp better by contributing your own code to the project, by hunting for bugs or by running it yourself inside your own datacenter and submitting your feedback.

We learn from each other – so please, share what you find out, so that we all may continue to develop this technology and drive the future forward. Visit the project, give it a star, and join the community.

Stay up to date with
G-Research