Data and Programming with Will Jones - Reflecting on Apache Arrow in 2022

In this post, I reflect on the recent history of Arrow and potential future directions. I’ve been working within the Arrow project for a little over a year and was only recently accepted as a committer. Other contributors, such as those on the the Arrow Project Management Committee (PMC), might have more comprehensive perspectives. But I’ve seen few public posts from them about their visions for the project. So in their absence I hope these candid thoughts are helpful. Few of them are unique, but most haven’t been discussed. If nothing else, I hope the Arrow community recognizes the value of self-criticism and reflection of the project from time to time.

I’ll provide a short history of Arrow, as I’ve seen it. (The project is large, so it focuses on only a few pieces that have been relevant to my work.) Then I’ll describe some potential problems in the project and how the community might be able to address those.

A short history of Arrow

When Apache Arrow was first started in 2016, it began simply as an in-memory and IPC format. Basic implementations were created for various languages, such as C++ and Java.

Then, to encourage developers to build libraries and systems using Arrow, the C++ library added more components. This started with basic compute functions and file readers. Part of the motivation here was that if the in-memory format is always the same, perhaps we could all collaborate on a common library for algorithms too. It would be more maintainable, after all, if R and Python shared the same CSV reader. Improvements to it could benefit both user bases, with minimal maintainer effort. Arrow became not only a format, but also “a multi-language toolbox for accelerated data interchange and in-memory processing”.

This logic continued to expand, adding remote filesystems, multi-file datasets, JIT-compiled compute functions (Gandiva), and eventually a query engine (later named Acero).

Up until now, the primary audience for the Arrow libraries such as the C++ libarrow and Python PyArrow were other library developers. Data scientists were expected to use them indirectly. For example, they would use Pandas read_parquet() rather than using pyarrow.parquet.read_table(), even though the former calls the latter under the hood. That changed with the R library, which used the arrow name but aimed at directly serving data scientists. At first, it was the most convenient way to read Parquet files, given that before it required installing Sparkr. Then when datasets and later Acero (query engine) support was added, users could use dplyr APIs to query multi-file datasets. Whereas previously “Arrow is fast” meant the memory layout was designed to support efficient computation, now it seems to mean Arrow does efficient computation. This at times led to some confusion in the user base about what exactly Arrow is ¹.

Meanwhile, the Rust implementation of Arrow took off quickly, starting in 2019. A query engine DataFusion was developed and donated ². While Arrow C++ took a monorepo approach, the Rust community maintained separate crates for Arrow, Parquet, and DataFusion, and later split Arrow into smaller pieces. This modularity allowed components to be reused in new ways. For example, the SQL parser and query plan optimizer are decoupled enough from DataFusion’s internals that it could be used by Dask SQL ³. Additionally, whereas Acero initially didn’t have a separate identity from the C++ library, DataFusion had a separate name from the beginning, partly because it was donated rather than started within the project.

Not all of the Rust implementation went smoothly. When a rewrite of arrow-rs, known as arrow2, was created, an agreement couldn’t be found to migrate the older version or donate the new version to the ASF ⁴. Thus the arrow2 and parquet2 crates are still developed in parallel to the ASF crates, though they still provide inspiration for improvements to the original libraries.

Outside of the Arrow libraries, other C++ projects were excited to adopt Arrow. DuckDB implemented a “SQLite for analytics”, aimed at accelerating data science workloads that fit on their laptops. Velox was released as a distributed engine accelerator, designed to be plugged into Presto and Spark. Both of these use the Arrow in-memory format, but with some additional modifications to allow for efficient encodings ⁵. The Arrow project has since started efforts to incorporate those improvements into the official format.

While both DuckDB and Velox are C++ projects, both use relatively little of the C++ Arrow libraries. Both implement their own Parquet to Arrow readers, their own remote filesystems, and their own compute function libraries. ⁶

Another Arrow-native engine was created in Rust: Polars. While DuckDB and Velox implement their own arrays, Polars uses arrow2 arrays. Its development initially started as a project to create a Rust DataFrame library, but quickly has been making motions at being a more performant Pandas replacement for ML data prep workloads with its Python bindings ⁷.

Finally, I should not forget to mention that while the libraries have been regularly expanding, we have also built out a range of protocols for exchanging data: the C data interface, the C streaming interface, Flight, Flight SQL, and most recently ADBC. In a world of multiple implementations of Arrow, both in libraries (libarrow, arrow-rs, arrow2) and engines (DuckDB, Velox, Polars) these protocols are becoming especially important.

The current state and future of Arrow C++

When thinking about the arrow-native engines and the arrow libraries, I can’t help but feel like the C++ ecosystem is a little too fractured right now. DuckDB and Velox implement their own Arrow utilities while Acero is embedded within libarrow (Arrow C++). In the Rust ecosystem, DataFusion depends on arrow-rs and Polars on arrow2; so any improvements to the Arrow libraries are experienced in their upstream engines. While they all support efficient data interchange via Arrow (great!), a dream of shared algorithms and IO utilities seems far off. After all, how many Parquet to Arrow readers do we really need to be maintaining?

It’s possible this is an optimal outcome for now. After all, by deviating from the Arrow spec DuckDB and Velox were able to innovate on the array encodings. One could argue a similar thing happened with arrow2, where safer and more efficient implementation of Arrow was created. But in the long-term, it does seem wasteful for these projects to maintain entirely separate algorithm libraries. The Arrow community should celebrate innovation when it happens and work hard to bring that innovation into the mainstream as it stabilizes.

Though some of the fracturing may have been a natural result of innovation, part of me thinks there still is more the C++ project could do to be more attractive as a dependency. Having Acero within libarrow is indeed convenient for rapid development, but I’m skeptical that it’s for the best. It lets us avoid challenges like build issues, missing public APIs, and misaligned release cycles. Yet those are all the same challenges our users experience; would it not be better if we felt those pains ourselves and had incentive to address them? I tend to think we would design better public APIs if we had to use them ourselves for our own query engine.

While I don’t think it’s likely we’ll get all three of those C++ engines all depending fully on libarrow, it seems a worthwhile goal to get at least two of them using some of the basic components. Acero simply pulled out into its own library.

For Velox and DuckDB, their unique array implementations would need to be brought over into libarrow. At the format level, this is in-progress. RLE arrays were implemented in libarrow and approved to be added to the format spec recently ⁸. Work has started on the string view ⁹. Though even if these are implemented in libarrow, there may still be differences in the type systems that each engine wishes to keep (such as the separation of the concept of data type and encoding). At the very least, this will make them more compatible over the C Data Interface, improving the possibility that they could reuse other parts of libarrow, such as the file readers and writers. For example, Velox currently uses libarrow’s Parquet writer, but chose to write their own reader so they could parse the data directly into their own array formats ¹⁰. If libarrow supported all the same array types, perhaps this would be unnecessary.

DuckDB might be unlikely to take the dependency, given their commitment to have as few dependencies as possible. Though it’s not impossible they could vendor a limited subset of Arrow that has few dependencies (arrays, datatypes, compute functions). Other parts like filesystems and file readers have far more dependencies, so they would be less appealing.

Who is libarrow’s and Acero’s audience?

Each of the libraries and engines I’ve mentioned have a specific audience:

DuckDB: data scientists who are currently using pandas, dplyr, or data.table to do dataframe manipulation. Mark Raasveldt discussed this audience in his 2020 talk for CMU Database Group talks ¹¹. He also emphasized keeping dependencies at a minimum, suggesting they also target those who wish to embed a SQL analytics engine into their C++ application, similar to how SQLite is used. The former market seems to be larger, though. The C++ API is limited, exposing mostly an interface for running query, rather than using components to manipulate arrays and read files. Emphasizes speed, ease of use, and a SQL interface.
Velox: developers of engines like Presto or Spark that want a highly optimized execution engine. Unlike DuckDB, Velox exposes a wider API and is meant to be used as a toolkit for those implementing distributed query engines. Emphasizes extensibility and speed.
Polars: data scientists who want a faster dataframe library, and Rustaceans who want a dataframe library. Emphasizes speed and a dataframe interface.
DataFusion: Rust developers who want to embed a SQL engine, or those who want to build a distributed engine. Sort of a combination of DuckDB and Velox’s audiences. Very modular design and extensible.

Yet for Acero and libarrow (and it’s bindings), the audience hasn’t yet been made clear. Acero needs to define itself relative to other Arrow-based engines. And libarrow needs to focus more on developers rather than end-users.

Engine	Language	Primary Audience	Primary Interface
DuckDB	C++	Data scientists	SQL
Polars	Rust	Data scientists	DataFrame
Velox	C++	Developers	?
DataFusion	Rust	Developers	SQL
Acero	C++	?	DataFrame

Acero’s audience isn’t well defined yet, but it seems similar to Velox’s. Both are aimed at C++ developers, have open APIs for low-level functionality, plan for Substrait integration, and are designed to be extended. So how is Acero differentiated from Velox? A few possible answers:

An emphasis on ordered execution (as-of joins, time series operations)
Integration with C++ libarrow, and all the utilities that comes with. While any Arrow implementation is compatible across the C data interface, it’s not without overhead and sticking just to the libarrow implementation may mean smaller binaries.

I’m probably not the one who will figure this out, but I hope a clearer answer will come forth in the next year.

What about libarrow, and its Python and R bindings? While the R library and to some degree the Python library have started to drift into serving end-users (e.g. data scientists), we should probably avoid this. The Arrow libraries should be used to build things that are end-user-facing, not end-user-facing themselves. When we combine these two purposes we can create coupling that is bad for the ecosystem: as we innovate our user-facing APIs we keep the utilities they use private, preventing other developers from benefiting from them. Keeping our end-user-facing packages separate from our developer-facing ones means we get to “dog-food” our developer-facing APIs while we build our user-facing ones. Combining the two also confuses the Arrow brand, creating a set of users that think that “Arrow” is a computation engine, rather than a set of formats and protocols.

This is not to say that we shouldn’t have subprojects within Arrow that focus on end-users. Rather, we specifically shouldn’t use the “Arrow” brand for them. For example, I don’t think we should get rid of or regret the arrow R package. It’s been a very popular package and been useful as a testbed for Acero. But we should consider moving the user-facing functionality (such as the dplyr queries) into a separate package (arrow-dplyr?) to give it a distinct identity. We could even consider working with other packages to use arrow as a backend; for example, I’ve heard it suggested that one day the readr package could use arrow::read_csv_arrow under the hood of the readr::read_csv.

Practically, we should care less about the number of downloads and more about how many user-facing libraries are using libarrow and the quality of our relationships with those downstream projects.

Conclusion

I hope the future ahead is one in which data moves freely between systems. One with both healthy competition between libraries and vendors as well as strong compatibility, so that the comparative advantage of each can be realized. We want to both encourage innovation in the community, but also to prioritize consolidation when it’s time to bring that innovation into the mainstream.

Footnotes

H20 AI db-benchmark discussion: “Consider renaming ‘Arrow’ case?”↩︎
Blog post: “DataFusion: A Rust-native Query Engine for Apache Arrow”↩︎
Dask SQL: How it Works ↩︎
Dev mailing list thread “[Discuss] [Rust] Arrow2/parquet2 going foward”↩︎
Velox, for example, supports 5 different encodings: flat, constant, dictionary, bias, and sequence. See Velox vectors documentation.↩︎
I’m less familiar with them, but it’s also worth mentioning Vaex and ClickHouse, two other Arrow-compatible engines written in C++. Vaex doesn’t use Arrow C++ in its codebase, but does rely on PyArrow for file reading, filesystem support, and NumPy integration. ClickHouse on the other hand does use libarrow for reading and writing Parquet and Arrow IPC, but isn’t Arrow-native so doesn’t use any of the compute functionality in Arrow.↩︎
Polars user guide: “Coming from Pandas”↩︎
Mailing list thread: “Re: [VOTE] Add RLE Arrays to Arrow Format”↩︎
Mailing list thread: “[DISCUSS][Format] Starting to do some concrete work on the new”StringView” columnar data type”↩︎
Velox GitHub discussion: [Design] Native Parquet Reader.↩︎
DuckDB – The SQLite for Analytics (Mark Raasveldt, CWI) 2020↩︎