Four ways to write assertion checks in R

It’s not 50 ways to leave your lover, but spend enough time talking about assertive programming in the bedroom and you’ll only have 49 more to discover
R
Author
Published

August 8, 2023

Once upon a time in a land far, far away, I was a bright young thing who wrote my data analyses with the kind of self-assured confidence that only a bright young thing can have. I trusted myself to write analysis code that does exactly what I wanted it to do. After all, I was a smart lady who knows her data and knows her analysis tools. In those halycon days of yore, before I’d been badly burned by sequentially arriving data that don’t have precisely the same structure every single time the data updates, I had the naivete to believe that if something changed in unexpected ways, I’d notice it.

Sweet summer child.

What I have learned since then, following the well-trodden path of every embittered old data analyst whose heart has shrivelled into a dark ball of data cynicism, is that none of this is true:

Worst of all: when my assumptions fail, my code can silently do the wrong thing and never throw an error. This happens very, very easily when data structure can change over time, or when code is reused in a new context. Which… happens a lot, actually.

Real world data are horrible.

Learning my lessons the hard way has taught me the importance of writing assertion checks. The idea behind an assertion check is very simple: write some code that makes sure that your code fails loudly by throwing an error as soon as an assumption is violated.1 As the saying goes, you want your analysis code to fail fast and fail loudly every time that something is not “as expected”.

So. Let’s talk about four different approaches to writing assertions in R.2

Just stopifnot(), Scott

Here’s a simplified version of a function that I use a lot in my generative art workflows. The identifier() function constructs a unique identifier for an output generated from a particular system:

identifier <- function(name, version, seed) {
  version <- stringr::str_pad(version, width = 2, pad = "0")
  seed <- stringr::str_pad(seed, width = 4, pad = "0")
  paste(name, version, seed, sep = "_") 
}

So let’s say I’m creating a piece from a version 1 system called “rtistry”, and using 203 as my random seed. The unique identifier for this piece would be as follows:

identifier(name = "rtistry", version = 1, seed = 203)
[1] "rtistry_01_0203"

The idea here is that:

  • The identifier should consist of exactly three parts, separated by underscores
  • The first part should be the name of the generative art system
  • The second part should specify the version of the system as a two-digit number
  • The third part should specify the RNG seed used to generate this piece as a four-digit number

For most of my systems this will produce a globally unique identifier, since I try to design them so that the only input parameter to the system is the RNG seed.

Notice, though, that there are some unstated – and unchecked! – assumptions about the kind of input that the function will receive. It’s implicitly assumed that name will be a character string that does not have any underscores, periods, or white spaces, and it’s also assumed that version and seed are both positive valued integers (or at least “integerish”) with upper bounds of 99 and 9999 respectively. Weirdness happens when I break those assumptions with my input:

identifier(name = "r tistry", version = 1.02, seed = 203)
[1] "r tistry_1.02_0203"

As a rule, of course, I don’t deliberately pass bad inputs to my functions, but if I want to be defensive about it, I should validate the inputs so that identifier() throws an error if I make a mistake and pass it input that violates the assumptions. The base R function stopifnot() is designed to solve exactly this problem:

identifier <- function(name, version, seed) {
  
  # throw error if any of the following assertions fail
  stopifnot(
    length(name) == 1,    # name must be a scalar
    length(version) == 1, # version must be a scalar
    length(seed) == 1,    # seed must be a scalar
    rlang::is_integerish(version),  # version must be a whole number
    rlang::is_integerish(seed),     # seed must be a whole number
    !stringr::str_detect(name, "[[:space:]._]"), # name can't have spaces, periods, or underscores 
    seed > 0,      # seed must be positive
    seed < 10000,  # seed must be less than 10000
    version > 0,   # version must be positive
    version < 100  # version must be less than 100
  )
  
  # the actual work of the function
  version <- stringr::str_pad(version, width = 2, pad = "0")
  seed <- stringr::str_pad(seed, width = 4, pad = "0")
  paste(name, version, seed, sep = "_") 
}

Using stopifnot() in this way causes all of the following to error and throw informative error messages:

identifier("r tistry", 1, 203)
Error in identifier("r tistry", 1, 203): !stringr::str_detect(name, "[[:space:]._]") is not TRUE
identifier("rtistry", 1.02, 203)
Error in identifier("rtistry", 1.02, 203): rlang::is_integerish(version) is not TRUE
identifier("rtistry", 1, 20013)
Error in identifier("rtistry", 1, 20013): seed < 10000 is not TRUE

The error messages aren’t the prettiest, but they do the job. In each case you can look at the error message and figure out what went wrong when calling the identifier() function. That said, you can sort of see the limitations to stopifnot() by looking at my source code: because stopifnot() throws pretty generic error messages that you can’t customise, my first instinct when writing the function was to group all my assertions into a single stopifnot() call, and then – because there isn’t a lot of structure to my assertion code – I’ve added comments explaining what each assertion does. That’s… fine. But not ideal.

As it turns out, there are ways to provide more informative error messages with stopifnot(). You can write a stopifnot() assertion as a name-value pair:

stopifnot("`version` must be scalar" = length(version) == 1)

If this assertion is violated, the error message thrown by the stopifnot() function corresponds to the name of the assertion, as illustrated below:

version <- 1:3 
stopifnot("`version` must be scalar" = length(version) == 1)
Error: `version` must be scalar

It’s kind of clunky but it works.

Actually, I have a confession to make. I actually didn’t know this trick until I’d already posted the original version of this post to the internet, so I have Jim Gardner to thank for kindly called my attention to it.

Summary: stopifnot() is suprisingly effective. It’s very general, and works for any expression that yields TRUE or FALSE. There are no dependencies since it’s a base R function. It does have some downsides: dealing with error messages is a bit clunky, and the code isn’t always the prettiest, but nevertheless it does the job that needs doing.

Just assert_that(), Kat

The assertthat package is designed to provide a drop-in replacement for the stopifnot() function, one that allows you to compose your own error messages when an assertion fails. It does have a variety of other convenience functions, but to be honest the main advantage over stopifnot() is the superior control over the error message. In practice, I find that this functionality allows me to write assertion code that is (a) easier to read, and (b) produces better error messages when an assertion fails.

To illustrate, here’s the code I end up with when I revisit my generative art identifier() function using assertthat:

library(assertthat)

identifier <- function(name, version, seed) {
  
  assert_that(
    length(name) == 1,
    length(version) == 1,
    length(seed) == 1,
    msg = "`name`, `version`, and `seed` must all have length 1"
  )

  assert_that(   
    !stringr::str_detect(name, "[[:space:]._]"),
    msg = "`name` must not contain white space, periods, or underscores"
  )

  assert_that(
    rlang::is_integerish(version),
    version > 0,
    version < 100,
    msg = "`version` must be a whole number between 1 and 99"
  )
   
  assert_that(
    rlang::is_integerish(seed),
    seed > 0, 
    seed < 10000,
    msg = "`seed` must be a whole number between 1 and 9999"    
  )
  
  # the actual work of the function
  version <- stringr::str_pad(version, width = 2, pad = "0")
  seed <- stringr::str_pad(seed, width = 4, pad = "0")
  paste(name, version, seed, sep = "_") 
}

Like stopifnot(), the assert_that() function allows you to construct arbitrary assertions, which I find useful. Additionally, the assert_that() function has some nice properties when compared to stopifnot(). Because it takes a msg argument that allows you to specify the error message, it gently encourages you to group together all the assertions that are of the same kind, and then write an informative message tailored to that subset of the assertion checks. This produces readable code because the error message is right there next to the assertions themselves, and the assertions end up being more organised than when I used stopifnot() earlier.

In any case, let’s have a look. First, let’s check that this works:

identifier("rtistry", 1, 203)
[1] "rtistry_01_0203"

Second, let’s check that all of these fail and throw readable error messages:

identifier("r tistry", 1, 203)
Error: `name` must not contain white space, periods, or underscores
identifier("rtistry", 1.02, 203)
Error: `version` must be a whole number between 1 and 99
identifier("rtistry", 1, 20013)
Error: `seed` must be a whole number between 1 and 9999

I find myself preferring this as a way of generating error messages when input arguments to a function don’t receive appropriate input. Because I know what I want the function to do, I’m able to write concise but informative error messages that are appropriate to the specific set of assertions that I’ve included within any particular assert_that() call.

Summary: The assertthat package has a pretty specific aim: to provide an assert_that() function works as a drop-in replacement for stopifnot() that allows custom error messages. Given that limited goal, it works nicely.

Just assert_*() it, Kit

The assertive package provides a large collection of assert_*() functions that are each tailored to a specific type of assertion, and designed to produce error messages that are tailored to that specific case. Here’s an example where I apply this approach to checking the inputs to the identifier() function:

library(assertive)

identifier <- function(name, version, seed) {

  assert_is_scalar(version)
  assert_is_scalar(name)
  assert_is_scalar(seed)
  
  assert_is_integer(version)
  assert_is_integer(seed)
  assert_all_are_positive(c(seed, version))
  assert_all_are_less_than(seed, 10000)
  assert_all_are_less_than(version, 100)
  
  assert_all_are_not_matching_regex(name, "[[:space:]._]")

  # the actual work of the function
  version <- stringr::str_pad(version, width = 2, pad = "0")
  seed <- stringr::str_pad(seed, width = 4, pad = "0")
  paste(name, version, seed, sep = "_") 
}

I’d probably argue that this is the most readable version of the code yet. The assert_*() functions have such transparently informative names that there’s no need at all for comments. However, there are some downsides to this approach, which become a little more apparent when we look at the error messages that it throws when I pass bad inputs to the identifier() function:

identifier("r tistry", 1L, 203L)
Error in identifier("r tistry", 1L, 203L): is_not_matching_regex : name does not match "[[:space:]._]"
There was 1 failure:
  Position    Value                   Cause
1        1 r tistry matches '[[:space:]._]'
identifier("rtistry", 1.02, 203L)
Error in identifier("rtistry", 1.02, 203L): is_integer : version is not of class 'integer'; it has class 'numeric'.
identifier("rtistry", 1L, 20013L)
Error in identifier("rtistry", 1L, 20013L): is_less_than : seed are not all less than 10000.
There was 1 failure:
  Position Value                          Cause
1        1 20013 greater than or equal to 10000

Because I don’t have custom error message code in my assertions, the errors that get returned to the user are a little bit opaque. They’re more informative than the stopifnot() versions, and because each assertion throws its own error message tailored to that function, the results are rather better suited to the context. Even so, they’re still quite long and there’s some cognitive effort required by the user to figure out what happened.

There’s a second issue here. Notice that when I wanted to pass a good input for seed or version in this version of the function, I used explicitly integer-classed values (e.g., 203L not 203). There’s a reason I did that. The assert_is_integer() function uses is.integer() test for integer status, which returns TRUE only when passed an actual integer. It returns FALSE when passed an “integerish” double:

is.integer(203L)
[1] TRUE
is.integer(203)
[1] FALSE

Because my assertion is a check for integer status not “integerish” status, this version of the identifier() function is more strict about type checking than I really want it to be, and this fails:

identifier("rtistry", 1, 203)
Error in identifier("rtistry", 1, 203): is_integer : version is not of class 'integer'; it has class 'numeric'.

Now, to be fair, there are of course many situations where you really do want to be strict about type checking integers: the integer representation of 203L is a different underlying object to the floating point representation of 203, and while R is usually pretty chill about this, it’s important to keep in mind that doubles and integers are fundamentally different data types. That being said, it’s vanishingly rare for this to actually matter in my generative art process, and I’d prefer to let this one slide.

This kind of thing is where you can run into some difficulties using the assert_*() functions. If there isn’t a specific assertion function tailored for your use case (as occurs with “integerish” check in identifier()) you’re left with the dilemma of either choosing an assertion that isn’t quite right, or else falling back on a general-purpose assertion like assert_all_are_true(). For example, this works…

library(assertive)

identifier <- function(name, version, seed) {

  assert_is_scalar(version)
  assert_is_scalar(name)
  assert_is_scalar(seed)
  
  assert_all_are_true(rlang::is_integerish(c(seed, version)))
  assert_all_are_positive(c(seed, version))
  assert_all_are_less_than(seed, 10000)
  assert_all_are_less_than(version, 100)
  
  assert_all_are_not_matching_regex(name, "[[:space:]._]")

  # the actual work of the function
  version <- stringr::str_pad(version, width = 2, pad = "0")
  seed <- stringr::str_pad(seed, width = 4, pad = "0")
  paste(name, version, seed, sep = "_") 
}

identifier("rtistry", 1, 203)
[1] "rtistry_01_0203"

…but it’s not quite as elegant as you might hope. Nevertheless, I’m not being critical here. It’s impossible to write a package like assertive in a way that covers every use case, and it’s pretty impressive that it has the breadth that it does.

Summary: Because it provides a huge number of well-named assertion functions, the assertive package tends to produce very readable code, and because each of those functions produces errors that are tailored to that check, the error messages tend to be useful too. It does get a little awkward when there isn’t an assertion for your use case, but usually there’s a way to work around that.

Just assertr, Carr

The assertr package solves a different problem to the other three methods discussed here. The other three approaches are general-purpose tools and – with various strengths and weaknesses – they’re designed to be used when checking an arbitrary input. The assertr package is more specialised: it focuses on checking a data input, specifically a tabular data object like a data frame or a tibble. Because it’s focused on that particular – and extremely important – special case, it’s able to provide a more powerful way of validating the content of a data frame.

In that sense, assertr is complementary to the other three approaches. For example, you could use assertr to check the data input to a function that takes a data frame as the primary argument, but then use (say) assert_that() to test the others.

To get started, I’ll load the packages I’m going to use in this section:

library(dplyr)
library(readr)
library(assertr)

The assertr package provides three primary verbs, verify(), assert(), and insist(). They all take a data set as the first argument and (by default) returns the original data set unaltered if the checks pass, which makes it include them as part of a data pipeline. There’s also two row-wise variants assert_rows() and insist_rows(). For the purposes of this post I’ll limit myself to talking about the simplest cases, verify() and assert().

Let’s start with verify(). The verify() function expects to receive an expression as the first non-data argument amd yields a logical value, which is then evaluated in the data context. If the expression evaluates to FALSE, an error is thrown.

Here’s a simple example using verify(). My data set comes from the List of Archibald Prize Winners wikipedia page. The Archibald Prize is a one of the most prestigious art prizes in Australia, awarded for painted portraits, and has been awarded (almost!) annually since 1921. My data set looks like this:

archibald <- read_csv("archibald.csv", show_col_types = FALSE)
archibald
# A tibble: 166 × 6
   prize           year  artist            title         subject n_finalists
   <chr>           <chr> <chr>             <chr>         <chr>         <dbl>
 1 Archibald Prize 1921  William McInnes   Desbrowe Ann… Harold…          45
 2 Archibald Prize 1922  William McInnes   Professor Ha… Willia…          53
 3 Archibald Prize 1923  William McInnes   Portrait of … Violet…          50
 4 Archibald Prize 1924  William McInnes   Miss Collins  Gladys…          40
 5 Archibald Prize 1925  John Longstaff    Maurice Mosc… Mauric…          74
 6 Archibald Prize 1926  William McInnes   Silk and Lac… Esther…          58
 7 Archibald Prize 1927  George W. Lambert Mrs Annie Mu… Annie …          56
 8 Archibald Prize 1928  John Longstaff    Dr Alexander… Alexan…          66
 9 Archibald Prize 1929  John Longstaff    The Hon W A … Willia…          75
10 Archibald Prize 1930  William McInnes   Drum-Major H… Harry …          67
# ℹ 156 more rows

To be precise, there are actually three different prizes included in the data set. There’s the original Archibald Prize (the famous one), and two more recent additions that are awarded using the same pool of entrants: the People’s Choice Award (which is what you’d think), and the Packing Room Prize (awarded by the staff who install the portraits in the gallery).

For my first analysis then, I want to do a simple tabulation: count the number of times any given artist has won a particular prize, and sort the results in descending count order. So the analysis part of my data pipeline would look like this:

archibald |> 
  count(artist, prize) |>
  arrange(desc(n))

However, I might want to verify() a few things first. I’d like to check that prize and artist both exist as columns in the data, and both contain character data. I can use the base R function exists() to check that the variables exist within the data context, and is.character() to check the variable type:

archibald |> 
  verify(exists("prize")) |>
  verify(exists("artist")) |>
  verify(is.character("prize")) |>
  verify(is.character("artist")) |>
  count(artist, prize) |>
  arrange(desc(n))
# A tibble: 118 × 3
   artist            prize                     n
   <chr>             <chr>                 <int>
 1 William Dargie    Archibald Prize           8
 2 William McInnes   Archibald Prize           7
 3 Ivor Hele         Archibald Prize           5
 4 John Longstaff    Archibald Prize           5
 5 Vincent Fantauzzo People's Choice Award     4
 6 Clifton Pugh      Archibald Prize           3
 7 Eric Smith        Archibald Prize           3
 8 Robert Hannaford  People's Choice Award     3
 9 William Dobell    Archibald Prize           3
10 William Pidgeon   Archibald Prize           3
# ℹ 108 more rows

In this case, all the verify() checks pass, so no errors are thrown and the analysis proceeds in the usual way. But suppose that the artist variable was actually supposed to be called painter:

archibald |> 
  verify(exists("prize")) |>
  verify(exists("painter")) |>
  verify(is.character("prize")) |>
  verify(is.character("painter")) |>
  count(painter, prize) |>
  arrange(desc(n))
verification [exists("painter")] failed! (1 failure)

    verb redux_fn         predicate column index value
1 verify       NA exists("painter")     NA     1    NA
Error: assertr stopped execution

There is no painter variable in the data set, so the assertion checks fail, and an error message is thrown. The form of the error message is rather elaborate though. There is a reason why assertr defaults to this strange-looking format: often there are multiple errors that appear in an assertion check, and by default assertr will group them into a table summarising all the issues.

There’s something a little repetitive about the validation code I wrote above. If my analysis pipeline involved many variables, it would be a bit obnoxious to write a separate verify() line to check that they all exist. For the column name checks, assertr provides a convenience function has_all_names() that you can use specifically for this purpose:3

archibald |> 
  verify(has_all_names("prize", "artist")) |>
  verify(is.character("prize")) |>
  verify(is.character("artist")) |>
  count(artist, prize) |>
  arrange(desc(n))
# A tibble: 118 × 3
   artist            prize                     n
   <chr>             <chr>                 <int>
 1 William Dargie    Archibald Prize           8
 2 William McInnes   Archibald Prize           7
 3 Ivor Hele         Archibald Prize           5
 4 John Longstaff    Archibald Prize           5
 5 Vincent Fantauzzo People's Choice Award     4
 6 Clifton Pugh      Archibald Prize           3
 7 Eric Smith        Archibald Prize           3
 8 Robert Hannaford  People's Choice Award     3
 9 William Dobell    Archibald Prize           3
10 William Pidgeon   Archibald Prize           3
# ℹ 108 more rows

For the type checking, however, there’s no equivalent convenience function and if you want to group multiple verify() checks what you want to do is use the assert() function. The first non-data argument to assert() specifies a predicate function that is applied to a set of columns.4 If the predicate function returns FALSE, the assert() function errors.

Rewriting the verify() code from our “successful” example as assert() checks gives us this:

archibald |> 
  verify(has_all_names("prize", "artist")) |>
  assert(is.character, prize, artist) |>
  count(artist, prize) |>
  arrange(desc(n))
# A tibble: 118 × 3
   artist            prize                     n
   <chr>             <chr>                 <int>
 1 William Dargie    Archibald Prize           8
 2 William McInnes   Archibald Prize           7
 3 Ivor Hele         Archibald Prize           5
 4 John Longstaff    Archibald Prize           5
 5 Vincent Fantauzzo People's Choice Award     4
 6 Clifton Pugh      Archibald Prize           3
 7 Eric Smith        Archibald Prize           3
 8 Robert Hannaford  People's Choice Award     3
 9 William Dobell    Archibald Prize           3
10 William Pidgeon   Archibald Prize           3
# ℹ 108 more rows

One thing I really like about the design of assertr is that pipe-friendly assertion checks make it possible to add your assertion checks at the appropriate point in the analysis pipeline. For instance, let’s suppose I want to look at the number of finalists in the Archibald Prize each year. The raw data only records n_finalists for the Archibald Prize, not the Packing Room Prize or the People’s Choice Award. Rows in the data corresponding to those latter prizes will always have NA values for n_finalists, but that isn’t a problem for my proposed analysis. The only missingness of possible concern to me is for the Archibald Prize proper. So I can write my assertion checks like this:

archibald |> 
  verify(has_all_names("prize", "n_finalists")) |>
  assert(is.character, prize) |>
  assert(is.numeric, n_finalists) |>
  filter(prize == "Archibald Prize") |>
  assert(\(x) !is.na(x), n_finalists) |>
  summarise(
    min_finalists = min(n_finalists),
    median_finalists = median(n_finalists),
    max_finlists = max(n_finalists)
  )
Column 'n_finalists' violates assertion 'function(x) !is.na(x)' 2 times
    verb redux_fn             predicate      column index value
1 assert       NA function(x) !is.na(x) n_finalists    13    NA
2 assert       NA function(x) !is.na(x) n_finalists    69    NA
Error: assertr stopped execution

Okay, so there is in fact a case where missingness is a problem in two rows of the data set, for the explicit subset of the data I care about. As it happens though, I simply don’t care when it’s only those two years, so for the purposes of this example I’ll filter those rows out before they even hit the assertion check, and unsurprisingly this runs without erroring:

archibald |> 
  verify(has_all_names("prize", "n_finalists")) |>
  assert(is.character, prize) |>
  assert(is.numeric, n_finalists) |>
  filter(prize == "Archibald Prize", !is.na(n_finalists)) |>
  assert(\(x) !is.na(x), n_finalists) |>
  summarise(
    min_finalists = min(n_finalists),
    median_finalists = median(n_finalists),
    max_finlists = max(n_finalists)
  )
# A tibble: 1 × 3
  min_finalists median_finalists max_finlists
          <dbl>            <dbl>        <dbl>
1            15               52          197

In addition to verify() and assert(), there are three other assertion functions in assertr. I’m not going to dive into those for the purposes of this post – that’s what the package documentation is there for! – but the TL;DR is as follows:

  • insist() works like assert() but it takes a “predicate generator” function instead of a “predicate” function, which makes it possible to specify an assertion check for a tidy selection of columns and have the predicate generator handle each column according to its own logic
  • assert_rows() is a row-wise version of assert()
  • insist_rows() is a row-wise version of insist()

Summary: My overall feeling is that assertr is probably the most powerful tool for assertion checks applied to tabular data. It lacks the generality of the other tools, true, but the special case that it works for is a really important one for data analysts. Data objects tend to have their own special issues, and pretty much every data analysis takes at least one data frame as an input, so it’s really convenient to have a specialised tool for that scenario.

Footnotes

  1. The idea is very similar to writing unit checks for software development. The difference is that unit tests are run at build time, whereas assertions apply at run time.↩︎

  2. It should be noted that these aren’t the only packages out there to support assertions in R. There are at least three others that I’m aware of but haven’t yet tried, and probably many others that I don’t know about. For what it’s worth, these are the other three I know of: the ensurer, checkmate, and tester packages can all be used for this purpose, and I’m sure I could come up with terrible rhymes for those too, but there’s a limit to how much effort I want to put into this post.↩︎

  3. In general, assertr doesn’t supply lots of convenience functions, but has_all_names() is an important special case because it’s used to check for the existence of columns, and that requires a special workflow. For type checking assertions, I can group together multiple verify() checks into a single assert() check that takes a tidy selection of columns. But for that to work the columns actually have to exist, so you can’t use assert() for existence checks! Hence (I presume) the inclusion of the has_all_names() convenience function.↩︎

  4. Column names are unquoted and are passed through the dots .... The documentation notes that the dots are passed to dplyr::select(), and accordingly the assert() function supports tidy selection.↩︎

Reuse

Citation

BibTeX citation:
@online{navarro2023,
  author = {Navarro, Danielle},
  title = {Four Ways to Write Assertion Checks in {R}},
  date = {2023-08-08},
  url = {https://blog.djnavarro.net/posts/2023-08-08_being-assertive},
  langid = {en}
}
For attribution, please cite this work as:
Navarro, Danielle. 2023. “Four Ways to Write Assertion Checks in R.” August 8, 2023. https://blog.djnavarro.net/posts/2023-08-08_being-assertive.