Binding Apache Arrow to R – Notes from a data witch

So I have a new job.

In my previous job as an academic, a large part of my work – my favourite part, if I’m honest – involved creating open access resources to help people use modern open source tools for data analysis. In my totally different role in developer relations at Voltron Data, a large part of my work involves, um … [checks job description] … creating open access resources to help people use modern open source tools for data analysis. Well okay then!

I’d better get on that, I suppose?

Yes I have been binge watching *The Magicians* lately. My preemptive apologies to everyone for the gif spam. This is Eliot Waugh. All I can say at this point is that thanks to his magnificent performance I have developed a terribly awkward crush on Hale Appleman. Image via giphy, copyright syfy

I’ve been in my current role for a little over a week (or had been when I started writing this post!), and today my first contribution to Apache Arrow was merged. It was very modest contribution: I wrote some code that determines whether any given year is a leap year. It precisely mirrors the behaviour of the leap_year() function in the lubridate package, except that it can be applied to Arrow data and it will behave itself when used in the context of a dplyr pipeline (more on that later). The code itself is not complicated, but it relies on a little magic and a deeper understanding of Arrow than I possessed two weeks ago.

Throughout this post I’ll use boldface to refer to specific R packages like dplyr or C++ libraries like libarrow

This post is the story of how I learned Arrow magic. ✨ 🏹 ✨

Why am I writing this?

The danger of sublimated trauma is a major theme in our story
– The Great God Ember (The Magicians: Season 2, Episode 3)

It might seem peculiar that I’m writing such a long post about such a tiny contribution to an open source project. After all, it doesn’t actually take a lot of work to figure out how to detect leap years. You can do it in one line of R code:

(year %% 4 == 0) & ((year %% 100 != 0) | (year %% 400 == 0))

This is a logical expression corresponding to the following rules. If the year is divisible by 4 then it is a leap year (e.g., 1996 was a leap year). But there’s an exception: if year is divisible by 100 then it isn’t a leap year (e.g., 1900 wasn’t a leap year). But there’s also an exception to the exception: if year is divisible by 400 then it is a leap year (e.g., 2000 was a leap year). Yes, the process of mapping the verbally stated rules onto a logical expression is kind of annoying, but it’s not conceptually difficult or unusual. There is no magic in leap year calculation, no mystery that needs unpacking and explaining.

All this assumes years are counted using the Gregorian calendar. There are, of course, other calendars

The magic comes in when you start thinking about what the arrow package actually does. It lets you write perfectly ordinary R code for data manipulation that returns perfectly ordinary R data structures, even though the data have never been loaded into R and all the computation is performed externally using Apache Arrow. The code you write with arrow looks and feels like regular R code, but almost none of the work is being done by R. This is deep magic, and it is this magic that needs to be demystified.

Two key moments in “The Magicians” when Julia Wicker discovers she can do magic, defying the expectations of others. One moment occurs at the start of Season 1 as a novice, after she had been told she failed the magic exams at Brakebills University; another moment occurs at the end of Season 2 after all magic has supposedly been turned off by the Old Gods or something. The parallel between the two moments is striking. Oh and Quentin Coldwater is in both scenes too I guess. Whatevs. Image via giphy, copyright syfy

I have three reasons for wanting to unpack and explain this magic.

The first reason is personal: I’ve been a professional educator for over 15 years and it has become habit. The moment I learn a new thing my first impulse is to work out how to explain it to someone else.
The second reasons is professional: I work for Voltron Data now, and part of my job is to make an educational contribution to the open source Apache Arrow project. Arrow is a pretty cool project, but there’s very little value in magnificent software if you don’t help people learn how to take advantage of it!
The third reason is ethical: a readable tutorial/explanation lowers barriers to entry. I mean, let’s be honest: the only reason I was able to work up the courage to contribute to Apache Arrow is that I work for a company that is deeply invested in open source software and in the Arrow project specifically. I had colleagues and friends I could ask for advice. If I failed I knew they would be there to help me. I had a safety net.

The last of these is huuuuuuugely important from a community point of view. Not everyone has the safety net that I have, and it makes a big difference. In a former life I’ve been on the other side of this divide: I’ve been the person with no support, nobody to ask for help, and I’ve run afoul of capricious gatekeeping in the open source world. It is a deeply unpleasant experience, and one I would not wish upon anyone else. We lose good people when this happens, and I really don’t want that!

The quote from the beginning of this section, the one about the danger of sublimated trauma, is relevant here: if we want healthy user communities it is our obligation on the inside to provide safe environments and delightful experiences. Our job is to find and remove barriers to entry. We want to provide that “safety net” that ensures that even if you fall (because we all fall sometimes), you don’t get hurt. Failing safely at something can be a learning experience; suffering trauma, however, is almost never healthy. So yeah, this matters to me. I want to take what I’ve learned now that I’m on the inside and make that knowledge more widely accessible.

Everyone deserves a safety net when first learning to walk the tightropes. It’s not a luxury, it’s a necessity

Before diving in, I should say something about the “assumed knowledge” for this post.

I’ll do my best to explain R concepts as I go, but the post does assume that the reader is comfortable in R and knows how to use dplyr for data manipulation. If you need a refresher on these topics, I cannot recommend “R for data science” highly enough. It’s a fabulous resource!
On the Arrow side it would help a little if you have some vague idea of what Arrow is about. I will of course explain as I go, but if you’re looking for a post that starts at the very beginning, I wrote a post on “Getting started with Apache Arrow” that does exactly this and discusses a lot of the basics.
Finally, a tiny R warning: later in the post I will do a little excursion into object oriented programming and metaprogramming in R, which will be familiar to some but not all readers. If you’re not comfortable with these topics, you should still be okay to skim those sections and still get the important parts of this post. It’s not essential to understand the main ideas.

The Great God Ember. Capricious, chaotic, and utterly unreliable unless what you’re looking for is a whimsical death. Pretty much the opposite of what we’d hope for in a healthy open source community really! He is, however, a very entertaining character. Image via giphy, copyright syfy

What is Arrow?

In case you decided not to read the introductory “Getting started with Apache Arrow” post, here’s an extremely condensed version. Apache Arrow is a standard and open source library that represents tabular data efficiently in memory. More generally it refers to a collection of tools used to work with Arrow data. There are libraries supporting Arrow in many different programming languages, including C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. It’s pretty cool.

Using Arrow in R

A fundamental thing to understand about the arrow package in R is that it doesn’t implement the Apache Arrow standard directly. In fact, it tries very hard not to do any of the heavy lifting itself. There’s a C++ library that does that in a super efficient way, and the job of the R package is to supply bindings that allow the R user to interact with that library using a familiar interface. The C++ library is called libarrow. Although the long term goal is to make the integration so seamless that you can use the arrow R package without ever needing to understand the C++ library, my experience has been that most people want to know something about what’s happening under the hood. It can be unsettling to find yourself programming with tools you don’t quite understand, so I’ll dig a little deeper in this post.

Let’s start with the C++ library. The role of libarrow is to do all the heavy computational work. It implements all the Arrow standards for representing tabular data in memory, provides support for the Apache “Inter-Process Communication” (IPC) protocol that lets you efficiently transfer data from one application to another, and supplies various compute kernels that allow you to do some data wrangling when your data are represented as an Arrow table.¹ It is, fundamentally, the engine that makes everything work.

What about the R package? The role of arrow is to expose the functionality of libarrow to the R user, to make that functionality feel “natural” in R, and to make it easier for R users to write Arrow code that is smoothly interoperable with Arrow code written in other languages (e.g., Python). In order to give you the flexibility you need, the arrow package allows you to interact with libarrow at three different levels of abstraction:

There’s a heavily abstracted interface that uses the dplyr bindings supplied by arrow. This version strives to make libarrow almost completely invisible, hidden behind an interface that uses familiar R function names.
There’s a lightly abstracted interface you can access using the arrow_*() functions. This version exposes the libarrow functions without attempting to exactly mirror any particular R functions, and provides a little syntactic sugar to make your life easier.
Finally, there’s a minimally abstracted interface using call_function(). This version provides a bare bones interface, without any of the syntactic sugar.

Over the next few sections section I’ll talk about these three levels of abstraction. So let’s load the packages we’re going to need for this post and dive right in!

library(tibble)
library(dplyr)
library(lubridate)
library(arrow)

Penny Adiyodi in the Neitherlands, diving head first into a fountain that transports him to a new and magical world. I cannot stress enough that Penny does not, by and large, make good choices. Impulse control is a virtue, but not one that he possesses in abundance. Image via giphy, copyright syfy

Using “arrowplyr” bindings

I think I have some special fish magicks.
– Josh Hoberman (The Magicians: Season 4, Episode 13)

When I wrote my Getting started with Apache Arrow post, I concluded with an illustration of how you can write dplyr code that will work smoothly in R even when the data themselves are stored in Arrow. Here’s a little recap of how that works, using a tiny data set I pulled from The Magicians Wikipedia page. Here’s what that data set looks like:

magicians <- read_csv_arrow("magicians.csv", as_data_frame = TRUE)
magicians

# A tibble: 65 × 6
   season episode title                                air_date   rating viewers
    <int>   <int> <chr>                                <date>      <dbl>   <dbl>
 1      1       1 Unauthorized Magic                   2015-12-16    0.2    0.92
 2      1       2 The Source of Magic                  2016-01-25    0.4    1.11
 3      1       3 Consequences of Advanced Spellcasti… 2016-02-01    0.4    0.9 
 4      1       4 The World in the Walls               2016-02-08    0.3    0.75
 5      1       5 Mendings, Major and Minor            2016-02-15    0.3    0.75
 6      1       6 Impractical Applications             2016-02-22    0.3    0.65
 7      1       7 The Mayakovsky Circumstance          2016-02-29    0.3    0.7 
 8      1       8 The Strangled Heart                  2016-03-07    0.3    0.67
 9      1       9 The Writing Room                     2016-03-14    0.3    0.71
10      1      10 Homecoming                           2016-03-21    0.3    0.78
# ℹ 55 more rows

In the code above I used the read_csv_arrow() function from the arrow package. If you’ve used the read_csv() function from readr this will seem very familiar: although Arrow C++ code is doing a lot of the work under the hood, the Arrow options have been chosen to mirror the familiar readr interface. The as_data_frame argument is specific to arrow though: when it is TRUE the data are imported into R as a data frame or tibble, and when it is FALSE the data are imported into Arrow. Strictly speaking I didn’t need to specify it in this example because TRUE is the default value. I ony included here so that I could draw attention to it.

Okay, now that we have the data let’s start with a fairly typical data analysis process: computing summary variables. Perhaps I want to know the average popularity and ratings for each season of The Magicians, and extract the year in which the season aired. The dplyr package provides me with the tools I need to do this, using functions like mutate() to create new variables, group_by() to specify grouping variables, and summarise() to aggregate data within group:

magicians %>% 
  mutate(year = year(air_date)) %>% 
  group_by(season) %>% 
  summarise(
    viewers = mean(viewers),
    rating = mean(rating), 
    year = max(year)
  )

# A tibble: 5 × 4
  season viewers rating  year
   <int>   <dbl>  <dbl> <dbl>
1      1   0.776  0.308  2016
2      2   0.788  0.323  2017
3      3   0.696  0.269  2018
4      4   0.541  0.2    2019
5      5   0.353  0.111  2020

All of these computations take place within R. The magicians data set is stored in R, and all the calculations are done using this data structure.

What can we do when the data are stored in Arrow? It turns out the code is almost identical, but the first thing I’ll need to do is load the data into Arrow. The simplest way to do this is to set as_data_frame = FALSE when calling arrow_read_csv()

arrowmagicks <- read_csv_arrow("magicians.csv", as_data_frame = FALSE)
arrowmagicks

Table
65 rows x 6 columns
$season <int64>
$episode <int64>
$title <string>
$air_date <date32[day]>
$rating <double>
$viewers <double>

The “arrowmagicks” variable name is a reference to the quote at the start of the section. For a while Josh was convinced he had been gifted with special magic because he had been a fish. It made sense at the time, I guess? It’s a weird show

When I do this, two things happen. First, a data set is created outside of R in memory allocated to Arrow: all of the computations will be done on that data set. Second, the arrowmagicks variable is created inside R, which consists of a pointer to the actual data along with some handy metadata.

The most natural way to work with this data in R is to make sure that both the arrow and dplyr packages are loaded, and then write regular dplyr code. You can do this because the arrow package supplies methods for dplyr functions, and these methods will be called whenever the input data is an Arrow Table. I’ll refer to this data analyses that use this workflow as “arrowplyr pipelines”. Here’s an example of an arrowplyr pipeline:

I’ve chosen not to boldface the “arrowplyr” terminology. arrow is a package and dplyr is a package, but arrowplyr isn’t. It’s simply a convenient fiction

arrowmagicks %>% 
  mutate(year = year(air_date)) %>% 
  group_by(season) %>% 
  summarise(
    viewers = mean(viewers),
    rating = mean(rating), 
    year = max(year)
  )

Table (query)
season: int64
viewers: double
rating: double
year: int64

See $.data for the source Arrow object

It looks like a regular dplyr pipeline, but because the input is arrowmagicks (an Arrow Table object), the effect of this is construct a query that can be passed to libarrow to be evaluated.

It’s important to realise that at this point, all we have done is define a query: no computations have been performed on the Arrow data. This is a deliberate choice for efficiency purposes: on the C++ side there are a lot of performance optimisations that are only possible because libarrow has access to the entire query before any computations are performed. As a consequence of this, you need to explicitly tell Arrow when you want to pull the trigger and execute the query.

Later in the post I’ll talk about Arrow Expressions, the tool that powers this trickery

There are two ways to trigger query execution, one using the compute() function and the other using collect(). These two functions behave slightly differently and are useful for different purposes. The compute() function runs the query, but leaves the resulting data inside Arrow:

arrowmagicks %>% 
  mutate(year = year(air_date)) %>% 
  group_by(season) %>% 
  summarise(
    viewers = mean(viewers),
    rating = mean(rating), 
    year = max(year)
  ) %>% 
  compute()

Table
5 rows x 4 columns
$season <int64>
$viewers <double>
$rating <double>
$year <int64>

This is useful whenever you’re creating an intermediate data set that you want to reuse in Arrow later, but don’t need to use this intermediate data structure inside R. If, however, you want the output to be pulled into R so that you can do R computation with it, use the collect() function:

arrowmagicks %>% 
  mutate(year = year(air_date)) %>% 
  group_by(season) %>% 
  summarise(
    viewers = mean(viewers),
    rating = mean(rating), 
    year = max(year)
  ) %>% 
  collect()

# A tibble: 5 × 4
  season viewers rating  year
   <int>   <dbl>  <dbl> <int>
1      1   0.776  0.308  2016
2      2   0.788  0.323  2017
3      3   0.696  0.269  2018
4      4   0.541  0.2    2019
5      5   0.353  0.111  2020

The nice thing for R users is that all of this feels like regular R code. Under the hood libarrow is doing all the serious computation, but at the R level the user really doesn’t need to worry too much about that. The arrowplyr toolkit works seamlessly and invisibly.

In our ideal world, the arrowplyr interface is all you would ever need to use. Internally, the arrow package would intercept all the R function calls you make, and replace them with an equivalent function that performs exactly the same computation using libarrow. You the user would never need to think about what’s happening under the hood.

Real life, however, is filled with leaky abstractions, and arrowplyr is no exception. Because it’s a huge project that is under active development, there’s a lot of functionality being introduced. As an example, the current version of the package (v6.0.1) has limited support for tidyverse packages like lubridate and stringr. It’s awesome that this functionality is coming online, but because it’s happening so quickly there are gaps. The small contribution that I made today was to fill one of those gaps: currently, you can’t refer to the leap_year() function from lubridate in an arrowplyr pipeline. Well, technically you can, but whenever arrow encounters a function it doesn’t know how to execute in Arrow it throws a warning, pulls the data into R, and completes the query using native R code. Here’s what that looks like:

arrowmagicks %>% 
  mutate(
    year = year(air_date), 
    leap = leap_year(air_date)
  ) %>% 
  collect()

# A tibble: 65 × 8
   season episode title                    air_date   rating viewers  year leap 
    <int>   <int> <chr>                    <date>      <dbl>   <dbl> <int> <lgl>
 1      1       1 Unauthorized Magic       2015-12-16    0.2    0.92  2015 FALSE
 2      1       2 The Source of Magic      2016-01-25    0.4    1.11  2016 TRUE 
 3      1       3 Consequences of Advance… 2016-02-01    0.4    0.9   2016 TRUE 
 4      1       4 The World in the Walls   2016-02-08    0.3    0.75  2016 TRUE 
 5      1       5 Mendings, Major and Min… 2016-02-15    0.3    0.75  2016 TRUE 
 6      1       6 Impractical Applications 2016-02-22    0.3    0.65  2016 TRUE 
 7      1       7 The Mayakovsky Circumst… 2016-02-29    0.3    0.7   2016 TRUE 
 8      1       8 The Strangled Heart      2016-03-07    0.3    0.67  2016 TRUE 
 9      1       9 The Writing Room         2016-03-14    0.3    0.71  2016 TRUE 
10      1      10 Homecoming               2016-03-21    0.3    0.78  2016 TRUE 
# ℹ 55 more rows

This is a bit of an oversimplification. The “warn and pull into R” behaviour shown here is what happens when the data is a Table object. If it is a Dataset object, arrow throws an error

An answer has been calculated, but the warning is there to tell you that the computations weren’t performed in Arrow. Realising that it doesn’t know how to interpret leap_year(), the arrow package has tried to “fail gracefully” and pulled everything back into R. The end result of all this is that the code executes as a regular dplyr pipeline and not as an arrowplyr one. It’s not the worst possible outcome, but it still makes me sad. 😭

Quentin from timeline 40 talking to Alice from timeline 23. Communication across incommensurate universes is difficult. In the show it requires a Tesla Flexion. In Arrow, we use dplyr bindings. Image via giphy, copyright syfy”

Calling “arrow-prefix” functions

Okay, let’s dig a little deeper.

In the last section I talked about arrowplyr, a collection of dplyr bindings provided by the arrow package. These are designed to mimic their native R equivalents as seamlessly as possible to enable you to write familiar code. Internally, there’s quite a lot going on to make this magic work. In most cases, the arrow developers – which I guess includes me now! 🎉 – have rewritten the R functions that they mimic. We’ve done this in a way that the computations rely only the C++ compute functions provided by libarrow, thereby ensuring that the data never have to enter R. The arrowplyr interface is the way you’d usually interact with Arrow in R, but there are ways in which you can access the C++ compute functions a little more more directly. There are two different ways you can call these compute functions yourself. If you’re working within an arrowplyr pipeline it is (relatively!) straightforward, and that’s what I’ll talk about in this section. However, there is also a more direct method which I’ll discuss later in the post.

To see what compute functions are exposed by the C++ libarrow library, you can call list_compute_functions() from R:

list_compute_functions()

  [1] "abs"                             "abs_checked"                    
  [3] "acos"                            "acos_checked"                   
  [5] "add"                             "add_checked"                    
  [7] "all"                             "and"                            
  [9] "and_kleene"                      "and_not"                        
 [11] "and_not_kleene"                  "any"                            
 [13] "approximate_median"              "array_filter"                   
 [15] "array_sort_indices"              "array_take"                     
 [17] "ascii_capitalize"                "ascii_center"                   
 [19] "ascii_is_alnum"                  "ascii_is_alpha"                 
....

The actual output continues for quite a while: there are currently 251 compute functions, most of which are low level functions needed to perform basic computational operations.

Let’s imagine you’re writing dplyr code to work with datetime data in a Table object like arrowmagicks. If you were working with native R data like magicians, you can do something like this:

start_date <- as.Date("2015-12-16")

magicians %>% 
  mutate(days = air_date - start_date)

# A tibble: 65 × 7
   season episode title                          air_date   rating viewers days 
    <int>   <int> <chr>                          <date>      <dbl>   <dbl> <drt>
 1      1       1 Unauthorized Magic             2015-12-16    0.2    0.92  0 d…
 2      1       2 The Source of Magic            2016-01-25    0.4    1.11 40 d…
 3      1       3 Consequences of Advanced Spel… 2016-02-01    0.4    0.9  47 d…
 4      1       4 The World in the Walls         2016-02-08    0.3    0.75 54 d…
 5      1       5 Mendings, Major and Minor      2016-02-15    0.3    0.75 61 d…
 6      1       6 Impractical Applications       2016-02-22    0.3    0.65 68 d…
 7      1       7 The Mayakovsky Circumstance    2016-02-29    0.3    0.7  75 d…
 8      1       8 The Strangled Heart            2016-03-07    0.3    0.67 82 d…
 9      1       9 The Writing Room               2016-03-14    0.3    0.71 89 d…
10      1      10 Homecoming                     2016-03-21    0.3    0.78 96 d…
# ℹ 55 more rows

Here I’ve created a new days column that counts the number of days that have elapsed between the air_date for an episode and the start_date (December 16th, 2015) when the first episode of Season 1 aired. There are a lot of data analysis situations in which you might want to do something like this, but right now you can’t actually do this using the arrow dplyr bindings because temporal arithmetic is a work in progress. In the not-too-distant future users should be able to expect code like this to work seamlessly, but right now it doesn’t. If you try it right now, you get this error:

Improving support for date/time calculations is one of the things I’m working on

arrowmagicks %>% 
  mutate(days = air_date - start_date) %>% 
  collect()

# A tibble: 65 × 7
   season episode title                          air_date   rating viewers days 
    <int>   <int> <chr>                          <date>      <dbl>   <dbl> <drt>
 1      1       1 Unauthorized Magic             2015-12-16    0.2    0.92     …
 2      1       2 The Source of Magic            2016-01-25    0.4    1.11 3456…
 3      1       3 Consequences of Advanced Spel… 2016-02-01    0.4    0.9  4060…
 4      1       4 The World in the Walls         2016-02-08    0.3    0.75 4665…
 5      1       5 Mendings, Major and Minor      2016-02-15    0.3    0.75 5270…
 6      1       6 Impractical Applications       2016-02-22    0.3    0.65 5875…
 7      1       7 The Mayakovsky Circumstance    2016-02-29    0.3    0.7  6480…
 8      1       8 The Strangled Heart            2016-03-07    0.3    0.67 7084…
 9      1       9 The Writing Room               2016-03-14    0.3    0.71 7689…
10      1      10 Homecoming                     2016-03-21    0.3    0.78 8294…
# ℹ 55 more rows

Right now, there are no general purpose arithmetic operations in arrow that allow you to subtract one date from another. However, because I chose this example rather carefully to find an edge case where the R package is missing some libarrow functionality, it turns out there is actually a days_between() function in libarrow that we could use to solve this problem, and it’s not too hard to use it. If you want to call one of the libarrow functions inside your dplyr pipeline, all you have to do is add an arrow_ prefix to the function name. For example, the C++ days_between() function becomes arrow_days_between() when called within the arrow dplyr pipeline:

arrowmagicks %>% 
  mutate(days = arrow_days_between(start_date, air_date)) %>% 
  collect()

# A tibble: 65 × 7
   season episode title                          air_date   rating viewers  days
    <int>   <int> <chr>                          <date>      <dbl>   <dbl> <int>
 1      1       1 Unauthorized Magic             2015-12-16    0.2    0.92     0
 2      1       2 The Source of Magic            2016-01-25    0.4    1.11    40
 3      1       3 Consequences of Advanced Spel… 2016-02-01    0.4    0.9     47
 4      1       4 The World in the Walls         2016-02-08    0.3    0.75    54
 5      1       5 Mendings, Major and Minor      2016-02-15    0.3    0.75    61
 6      1       6 Impractical Applications       2016-02-22    0.3    0.65    68
 7      1       7 The Mayakovsky Circumstance    2016-02-29    0.3    0.7     75
 8      1       8 The Strangled Heart            2016-03-07    0.3    0.67    82
 9      1       9 The Writing Room               2016-03-14    0.3    0.71    89
10      1      10 Homecoming                     2016-03-21    0.3    0.78    96
# ℹ 55 more rows

Notice there’s no warning message here? That’s because the computations were done in Arrow and the data have not been pulled into R.

A slightly-evil digression

Marina, blatantly lying:
    “Hi. I’m Marina. I’m here to help.”
Josh, missing important memories:
    “So you’re like some powerful, benevolent White Witch?”
Marina, comically sincere:
    “Uh-huh.”

– The Magicians: Season 4, Episode 2

At this point in the show everybody except the currently-amnesic main characters knows that Marina has no interest in helping anyone except Marina. I love Marina so much

Okay, here’s a puzzle. In the previous section I used the arrow_days_between() function in the middle of a dplyr pipe to work around a current limitation in arrow. What happens if I try to call this function in another context?

today <- as.Date("2022-01-18")

arrow_days_between(start_date, today)

Error in arrow_days_between(start_date, today): could not find function "arrow_days_between"

It turns out there is no R function called arrow_days_between(). This is … surprising, to say the least. I mean, it really does look like I used this function in the last section, doesn’t it? How does this work? The answer to this requires a slightly deeper understanding of what the dplyr bindings in arrow do, and it’s kind of a two-part answer.

Part one: Object oriented programming

Let’s consider the mutate() function. dplyr defines mutate() as an S3 generic function, which allows it to display “polymorphism”: it behaves differently depending on what kind of object is passed to the generic. When you pass a data frame to mutate(), the call is “dispatched” to the mutate.arrow_dplyr_query() methods supplied by (but not exported by) dplyr. The arrow package builds on this by supplying methods that apply for Arrow objects. Specifically, there are internal functions mutate.ArrowTabular(), mutate.Dataset(), and mutate.arrow_dplyr_query() that are used to provide mutate() functionality for Arrow data sets. In other words, the “top level” dplyr functions in arrow are S3 methods, and method dispatch is the mechanism that does the work.

Part two: Metaprogramming

Now let’s consider the leap_year() function that my contribution focused on. Not only is this not a generic function, it’s not even a dplyr function. It’s a regular function in the lubridate package. So how is it possible for arrow to mimic the behaviour of lubridate::leap_year() without messing up lubridate? This is where the dplyr binding part comes in. Let’s imagine that I’d written an actual function called arrowish_leap_year() that performs leap year calculations for Arrow data. If I’d done this inside the arrow package² then I’d include a line like this to register a binding:

I’ll show you how to write your own “arrowish” functions later in the post

register_binding("leap_year", arrowish_leap_year)

Once the binding has been registered, whenever leap_year() is encountered within one of the arrow-supplied dplyr functions, R will substitute my arrowish_leap_year() function in place of the lubridate::leap_year() function that would normally be called. This is only possible because R has extremely sophisticated metaprogramming tools: you (the developer) can write functions that “capture” the code that the user input, and if necessary modify that code before R evaluates it. This is a very powerful tool for constructing domain-specific languages within R. The tidyverse uses it extensively, and the arrow package does too. The dplyr bindings inside arrow use metaprogramming tricks to modify the user input in such a way that – in this example – the user input is interpreted as if the user had called arrowish_leap_year() rather than leap_year().

Cooperative magic

Taken together, these two pieces give us the answer to our puzzle. The call to arrow_days_between() works in my original example because that call was constructed within the context of an arrow-supplied mutate() function. The interpretation of this code isn’t performed by dplyr it is handled by arrow. Internally, arrow uses metaprogramming magic to ensure that arrow_days_between() is reinterpreted as a call to the libarrow days_between() function. But that metaprogramming magic doesn’t apply anywhere except the arrowplyr context. If you try to call arrow_days_between() from the R console or even in a regular dplyr pipeline, you get an error because technically speaking this function doesn’t exist.

I guess there’s a connection between slightly-evil-Julia burning down the talking trees and my slightly-evil digression? Sort of. I mean the truth is just that I just love this scene and secretly wish I was her. Of all the characters Julia has the most personally transformative arc (in my opinion), in both good ways and bad. There’s a lot going on with her life, her person, and her body. I relate to that. Image via giphy, copyright syfy

Calling libarrow directly

The weirdness of that digression leads naturally to a practical question. Given that the “arrow-prefix” function don’t actually exist in the usual sense of the term, and the corresponding bindings can only be called in an arrowplyr context, how the heck does an R developer call the libarrow functions directly? In everyday data analysis you wouldn’t want to do this very often, but from a programming perspective it matters: if you want to write your own functions that play nicely with arrowplyr pipelines, it’s very handy to know how to call libarrow directly.

So let’s strip back another level of abstraction!

Should you ever find yourself wanting to call libarrow compute functions directly from R, call_function() will become your new best friend. It provides a very minimal interface that exposes the libarrow functions to R. The “bare bones” nature of this interface has advantages and disadvantages. The advantage is simplicity: your code doesn’t depend on any of the fancy bells and whistles. Those are fabulous from the user perspective, but from a developer point of view you usually want to keep it simple. The price you pay for this is that you must pass appropriate Arrow objects. You can’t pass a regular R object to a libarrow function and expect it to work. For example:

call_function("days_between", start_date, today)

Error: Argument 1 is of class Date but it must be one of "Array", "ChunkedArray", "RecordBatch", "Table", or "Scalar"

This doesn’t work because start_date and today are R-native Date objects and do not refer to any data structures in Arrow. The libarrow functions expect to receive pointers to Arrow objects. To fix the previous example, all we need to do is create Arrow Scalars for each date. Here’s how we do that:

arrow_start_date <- Scalar$create(start_date)
arrow_today <- Scalar$create(today)

arrow_start_date

Scalar
2015-12-16

The arrow_start_date and arrow_today variables are R data structures, but they’re only thin wrappers. The actual data are stored in Arrow, and the R objects are really just pointers to the Arrow data. These objects are suitable for passing to the libarrow days_between() function, and this works:

call_function("days_between", arrow_start_date, arrow_today)

Scalar
2225

Huh. Apparently it took me over 2000 days to write a proper fangirl post about The Magicians. I’m really late to the pop culture party, aren’t I? Oh dear. I’m getting old.

I’m getting lazier with these connections. Using a Library gif because I’m talking about the C++ library? I mean really, you’d think I’d be better than that wouldn’t you? But no. I am not. Image via giphy, copyright syfy

Arrow expressions

There’s one more foundational topic I should discuss before I can show you how to write arrowplyr-friendly functions, and that’s Arrow Expressions. When I introduced arrowplyr early in the post I noted that most of your code is used to specify a query, and that query doesn’t get evaluated until compute() or collect() is called. If you want to write code that plays nicely with this workflow, you need to ensure that your custom functions return an Arrow Expression.

The basic idea behind expressions is probably familiar to R users, since they are what powers the metaprogramming capabilities of the language and are used extensively throughout tidyverse as well as base R. In base R, the quote() function is used to capture a user expression and eval() is used to force it to evaluate. Here’s a simple example where I use quote() to “capture” some R code and prevent it from evaluating:

head_expr <- quote(head(magicians, n = 3))
head_expr

head(magicians, n = 3)

If I wanted to be clever I could modify the code in head_expr before allowing R to pull the trigger on evaluating it. I could combine a lot of expressions together, change parts of the code as needed, and evaluate them wherever I wanted. As you might imagine, this is super useful for creating domain specific languages within R. But this isn’t a post about metaprogramming so let’s evaluate it now:

eval(head_expr)

# A tibble: 3 × 6
  season episode title                                 air_date   rating viewers
   <int>   <int> <chr>                                 <date>      <dbl>   <dbl>
1      1       1 Unauthorized Magic                    2015-12-16    0.2    0.92
2      1       2 The Source of Magic                   2016-01-25    0.4    1.11
3      1       3 Consequences of Advanced Spellcasting 2016-02-01    0.4    0.9

The example above uses native R code. It’s not tied to Arrow in any sense. However, the arrow package provides a mechanism for doing something similar in an Arrow context. For example, here’s me creating a character string as an Arrow Scalar:

fillory <- Scalar$create("A world as intricate as filigree")
fillory

Scalar
A world as intricate as filigree

Here’s me creating the corresponding object within an Arrow Expression:

fillory <- Expression$scalar(
  Scalar$create("A world as intricate as filigree")
)
fillory

Expression
"A world as intricate as filigree"

I suspect this would not seem particularly impressive on its own, but you can use the same idea to create function calls that can be evaluated later within the Arrow context:

ember <- Expression$create("utf8_capitalize", fillory)
ember

Expression
utf8_capitalize("A world as intricate as filigree")

So close. We are so very close to the end now.

Okay look, I’ll level with you. At this point there is absolutely no connection between the gifs and the content. This post is getting very long and my brain is fried. I need a short break to appreciate the beautiful people, and Kings Idri and Eliot are both very beautiful people. Image via giphy, copyright syfy

Writing arrowplyr functions

At long last we have all the ingredients needed to write a function that can be used in an arrowplyr pipeline. Here’s a simple implementation of the base R toupper() function

arrowish_toupper <- function(x) {
  Expression$create("utf8_upper", x)
}

As it happens arrowplyr pipelines already support the toupper() function, so there really wasn’t a need for me to write this. However, at present they don’t support the lubridate leap_year() function, which was the purpose of my very small contribution today. An Arrow friendly version of leap_year() looks like this:

arrowish_leap_year <- function(date) {
   year <- Expression$create("year", date)
  (year %% 4 == 0) & ((year %% 100 != 0) | (year %% 400 == 0))
}

Before putting our functions into action, let’s see what happens when we try to write a simple data analysis pipeline without them:

arrowmagicks %>% 
  mutate(
    title = toupper(title),
    year = year(air_date), 
    leap = leap_year(air_date)
  ) %>% 
  collect()

# A tibble: 65 × 8
   season episode title                    air_date   rating viewers  year leap 
    <int>   <int> <chr>                    <date>      <dbl>   <dbl> <int> <lgl>
 1      1       1 UNAUTHORIZED MAGIC       2015-12-16    0.2    0.92  2015 FALSE
 2      1       2 THE SOURCE OF MAGIC      2016-01-25    0.4    1.11  2016 TRUE 
 3      1       3 CONSEQUENCES OF ADVANCE… 2016-02-01    0.4    0.9   2016 TRUE 
 4      1       4 THE WORLD IN THE WALLS   2016-02-08    0.3    0.75  2016 TRUE 
 5      1       5 MENDINGS, MAJOR AND MIN… 2016-02-15    0.3    0.75  2016 TRUE 
 6      1       6 IMPRACTICAL APPLICATIONS 2016-02-22    0.3    0.65  2016 TRUE 
 7      1       7 THE MAYAKOVSKY CIRCUMST… 2016-02-29    0.3    0.7   2016 TRUE 
 8      1       8 THE STRANGLED HEART      2016-03-07    0.3    0.67  2016 TRUE 
 9      1       9 THE WRITING ROOM         2016-03-14    0.3    0.71  2016 TRUE 
10      1      10 HOMECOMING               2016-03-21    0.3    0.78  2016 TRUE 
# ℹ 55 more rows

The internal arrow function that handles this is called “abandon_ship”. No, I don’t know why I felt the need to mention this`

Yes, it returns the correct answer, but only because arrow detected a function it doesn’t understand and has “abandoned ship”. It pulled the data into R and let dplyr do all the work. Now let’s see what happens when we use our functions instead:

arrowmagicks %>% 
  mutate(
    title = arrowish_toupper(title),
    year = year(air_date),
    leap = arrowish_leap_year(air_date)
  ) %>% 
  collect()

# A tibble: 65 × 8
   season episode title                    air_date   rating viewers  year leap 
    <int>   <int> <chr>                    <date>      <dbl>   <dbl> <int> <lgl>
 1      1       1 UNAUTHORIZED MAGIC       2015-12-16    0.2    0.92  2015 FALSE
 2      1       2 THE SOURCE OF MAGIC      2016-01-25    0.4    1.11  2016 TRUE 
 3      1       3 CONSEQUENCES OF ADVANCE… 2016-02-01    0.4    0.9   2016 TRUE 
 4      1       4 THE WORLD IN THE WALLS   2016-02-08    0.3    0.75  2016 TRUE 
 5      1       5 MENDINGS, MAJOR AND MIN… 2016-02-15    0.3    0.75  2016 TRUE 
 6      1       6 IMPRACTICAL APPLICATIONS 2016-02-22    0.3    0.65  2016 TRUE 
 7      1       7 THE MAYAKOVSKY CIRCUMST… 2016-02-29    0.3    0.7   2016 TRUE 
 8      1       8 THE STRANGLED HEART      2016-03-07    0.3    0.67  2016 TRUE 
 9      1       9 THE WRITING ROOM         2016-03-14    0.3    0.71  2016 TRUE 
10      1      10 HOMECOMING               2016-03-21    0.3    0.78  2016 TRUE 
# ℹ 55 more rows

Everything works perfectly within Arrow. No ships are abandoned, the arrowplyr pipeline springs no leaks, and we all live happily ever after.

Sort of.

I mean, we’re all still alive.

That has to count as a win, right? 🎉

Eliot and Margo applaud your success. They are the best characters, and you are also the best because you have made it to the end of a long and strange blog post. Image via giphy, copyright syfy

Epilogue: Where’s the rest of the owl?

In case you don’t know the reference: how to draw an owl

The story I’ve told in this post is a little incomplete. I’ve shown you how to write a function like arrowish_leap_year() that can slot into a dplyr pipeline and operate on an Arrow data structure. But I haven’t said anything about the precise workings of how register_binding() works, in part because the details of the metaprogramming magic is one of the mysteries I’m currently unpacking while I dig into the code base.

But that’s not the only thing I’ve left unsaid. I haven’t talked about unit tests, for example. I haven’t talked about the social/technical process of getting code merged into the Arrow repository. If you’ve made it to the end of this post and are curious about joining the Arrow developer community, these are things you need to know about. I’ll probably write something about those topics later on, but in the meantime here are some fabulous resources that might be handy:

Apache Arrow New Contributors Guide (thank you to Alenka Frim!)
Developers Guide to Writing Bindings (thank you to Nic Crane!)
Apache Arrow R Cookbook (thank you to Nic Crane again)

Enjoy! 🍰

Footnotes

I didn’t quite understand what “kernels” meant in this context until Nic Crane kindly explained it to me. The compute API contains a number of functions which are divided up into “kernels”, specialised functions designed to work on a specific data type. The C++ Arrow compute documentation explains this better.↩︎
My actual code didn’t bother to name my function. It’s just an anonymous function passed to register_binding().↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{navarro2022,
  author = {Navarro, Danielle},
  title = {Binding {Apache} {Arrow} to {R}},
  date = {2022-01-18},
  url = {https://blog.djnavarro.net/binding-arrow-to-r},
  langid = {en}
}

For attribution, please cite this work as:

Navarro, Danielle. 2022. “Binding Apache Arrow to R.” January 18, 2022. https://blog.djnavarro.net/binding-arrow-to-r.