Arrow Development Environment Part 1: C++

arrow
dev-env
Author

Will Jones

Published

August 13, 2022

Getting set up to work on the Arrow C++ library and it’s bindings for R and Python can be a bit tricky. Arrow’s developer guide provides general shell commands to build Arrow and run some tasks, but it’s left to the reader to build a productive workflow around those commands. This post shows how to set up a complete working environment–ready to build, lint, test, and debug Arrow–using Conda and VS Code.

Why Conda? As part of my work on Arrow C++ libraries, I often want to create independent C++ environments for test projects, like what we can do for Python with virtual environments. That’s not something that comes built-in with C++ toolchains, but it’s made easy with Conda. Conda also allows you to manage environments with different versions of Python, which is helpful when debugging some issues.

Why VS Code? First, it’s popular and has a reasonable number of well-supported plugins. Second, it’s cross platform, so I can use essentially the same setup on Windows, Mac, and Linux. Though people are understandably attached to their editor, so I provide the shell command equivalent of most VS Code operations in this tutorial.

This tutorial comes in four parts:

  1. Setting up the C++ environment. This is the one you are reading, and needs to be completed prior to either of the other parts.
  2. Setting up the Python environment.
  3. Setting up the R environment. You can skip part 2 if you don’t need to work on Python.
  4. Using the debugger.

For each, I’ll first show how to setup the environment and then how to use it. Once you are done, you’ll know how to:

  1. Build the C++, Python, and R Arrow libraries;
  2. Run the unit tests;
  3. Run the code formatters and linters;
  4. Attach the LLDB debugger to unit tests and interactive sessions.

These instructions are primarily written for and tested on MacOS, but should work similarly on Linux.

Installation

First, install the following:

  • Some Conda installation. I recommend mambaforge, which uses Mamba (a faster dependency resolver) and conda-forge (a community supported channel) by default.
  • Visual Studio Code, if you are using that editor.
  • direnv, a tool we’ll use to load up environment variables.

If you’ve installed VS Code, install the following extensions:

  • C/C++
  • CMake
  • CMake Tools
  • CodeLLDB

Next, clone the Arrow repo:

git clone https://github.com/apache/arrow.git
cd arrow
git remote rename origin upstream
# Now or later, add your fork as origin:
# git remote add origin git@github.com:<your_username>/arrow.git
git submodule update --init --recursive

Then create the Conda environment:

mamba remove -n arrow-dev --all
mamba create -y -n arrow-dev \
       --channel=conda-forge \
       --file ci/conda_env_unix.txt \
       --file ci/conda_env_cpp.txt \
       --file ci/conda_env_python.txt \
       --file ci/conda_env_sphinx.txt \
       clang_osx-arm64=14 \
       clang-tools=14 \
       python=3.10 \
       pandas
Note

On Linux, you will need to swap clang_osx-arm64=14 for clang=14 gxx_linux-64.

It’s important we install clang-tools version 14, since that is the standard Arrow uses for linting and formatting.

touch .envrc
code .envrc

Add the following to .envrc:

eval "$(conda shell.zsh hook)"
conda activate arrow-dev

export CC=$(which clang)
export CXX=$(which clang++)
export CLANG_TOOLS_PATH=$CONDA_PREFIX/bin

export ARROW_HOME=$CONDA_PREFIX

export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"
export ARROW_TEST_DATA="${PWD}/testing/data"

Close the editor and run:

direnv allow

The Conda environment should now be active. From now on, whenever you cd into your arrow directory, the environment will automatically activate and you will get the necessary environment variables activated. You should open VS Code from the CLI with these environment variables active using code ..

CMakeUserPresets.json

To configure how the Arrow C++ project will be built, we’ll create a file cpp/CMakeUserPresets.json. By using a configuration file instead of manually passing command line args, you will be able to easily switch between several build presets within VS Code (or other editors with CMake integration).

Add the following contents to the file cpp/CMakeUserPresets.json:

{
    "version": 3,
    "cmakeMinimumRequired": {
      "major": 3,
      "minor": 21,
      "patch": 0
    },
    "configurePresets": [
        {
            "name": "user-base",
            "hidden": true,
            "binaryDir": "${sourceDir}/build/${presetName}"
        },
        {
            "name": "user-main",
            "inherits": ["ninja-debug-python", "features-filesystems", "user-base"],
            "cacheVariables": {
                "CMAKE_INSTALL_PREFIX": "<Replace with values of `echo $CONDA_PREFIX`>",
                "CMAKE_CXX_STANDARD": "17",
                "ARROW_BUILD_EXAMPLES": "ON"
            }
        }
    ]
}

Modify the contents of CMAKE_INSTALL_PREFIX with the result of echo $CONDA_PREFIX run in your shell.

The CMakeUserPresets.json file defines one or more build configurations. The user-base one is a hidden preset that defines common configuration across your presets that each can inherit. Here, it sets the build directory location based on the preset name. The user-main build is one that is general enough to be used to build PyArrow, R arrow package, and C++ examples and unit tests. You may wish to create smaller builds while working on specific parts of the C++ codebase, though most of time that will be unnecessary.

Configuring VS Code

Finally, to configure VS Code you’ll need to add three new files:

  1. .vscode/settings.json: Tell VS Code where the C++ source is located.
  2. .vscode/tasks.json: Tell VS Code about the build, test, and lint tasks.
  3. .vscode/launch.json: Tell VS Code how to launch and attach LLDB.

Create .vscode/settings.json with the contents:

{
    "cmake.sourceDirectory": "${workspaceFolder}/cpp",
    "cmake.buildDirectory": "${workspaceFolder}/cpp/build/"
}

Create .vscode/tasks.json with the contents:

{
    "version": "2.0.0",
    "tasks": [
        {
            "type": "process",
            "label": "Build C++",
            "command": "cmake",
            "args": [
                "--build",
                "cpp/build/user-main",
                "--target",
                "install",
            ],
            "group": "build",
        },
        {
            "type": "process",
            "label": "Test C++",
            "command": "ctest",
            "args": ["-j16"],
            "group": "test",
            "options": {"cwd": "${workspaceFolder}/cpp/build/user-main/"}
        },
        {
            "type": "process",
            "label": "Check C++",
            "command": "cmake",
            "args": [
                "--build",
                "cpp/build/user-main",
                "--target",
                "format",
                "lint",
                "clang-tidy",
                "lint_cpp_cli"
            ],
            "group": "test",
        },
    ]
}

Create .vscode/launch.json with the contents:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
      {
        "name": "Debug C++ Unit Test",
        "type": "lldb",
        "request": "launch",
        // You must change this manually to the desired test file
        "program": "${command:cmake.buildDirectory}/debug/parquet-arrow-test",
        // You can use this to set to a specific test
        "args": ["--gtest_filter=TestArrowReaderAdHoc.OldDataPageV2"],
        "stopAtEntry": false,
        "cwd": "${workspaceFolder}",
        "env": {},
        "externalConsole": true,
        "MIMode": "lldb",
        "setupCommands": [
          {
            "description": "Enable pretty-printing for gdb",
            "text": "-enable-pretty-printing",
            "ignoreFailures": true
          }
        ],
      }
    ]
}

How to Use the Environment

Within VS Code

First, you should select the configure preset if not already selected (VS Code may prompt you automatically too):

   CMD + SHIFT + P > CMake: Select Configure Preset > user-main

Then, do a clean reconfigure:

   CMD + SHIFT + P > CMake: Delete Cache and Reconfigure

Next, you can build:

   CMD + SHIFT + B > Build C++

Run all tests:

   CMD + P, type “task”, then select Test C++

Format and lint:

   CMD + P, type “task”, then select Check C++

For using the debugger, see part 4.

From CLI

First, you will want to configure the build:

cmake --preset user-main cpp

Then you can run the build:

cmake --build cpp/build/user-main --target install -j16

You can then run the unit tests by going up to the build directory and running ctest:

pushd cpp/build/user-main
ctest -j16

You can run a single test file by directly accessing the binary (useful to know when you want to attach a debugger!):

./debug/arrow-array-test

Or also run a single test within there:

./debug/arrow-array-test --gtest_filter=TestArrayView.StructAsStructNested

Format and lint the C++ codebase (after building) with:

cmake --build cpp/build/user-main --target format lint lint_cpp_cli

Next Steps

Now that you’ve build the C++ Arrow library, you can move on to either the Python or R libraries: