<- function(name, version, seed) {
identifier <- stringr::str_pad(version, width = 2, pad = "0")
version <- stringr::str_pad(seed, width = 4, pad = "0")
seed paste(name, version, seed, sep = "_")
}
Once upon a time in a land far, far away, I was a bright young thing who wrote my data analyses with the kind of self-assured confidence that only a bright young thing can have. I trusted myself to write analysis code that does exactly what I wanted it to do. After all, I was a smart lady who knows her data and knows her analysis tools. In those halycon days of yore, before I’d been badly burned by sequentially arriving data that don’t have precisely the same structure every single time the data updates, I had the naivete to believe that if something changed in unexpected ways, I’d notice it.
Sweet summer child.
What I have learned since then, following the well-trodden path of every embittered old data analyst whose heart has shrivelled into a dark ball of data cynicism, is that none of this is true:
- I don’t know the tools as well as I think I do.
- I don’t know the data as well as I think I do.
- When the data change unexpectedly, I don’t always notice it.
Worst of all: when my assumptions fail, my code can silently do the wrong thing and never throw an error. This happens very, very easily when data structure can change over time, or when code is reused in a new context. Which… happens a lot, actually.
Real world data are horrible.
Learning my lessons the hard way has taught me the importance of writing assertion checks. The idea behind an assertion check is very simple: write some code that makes sure that your code fails loudly by throwing an error as soon as an assumption is violated.1 As the saying goes, you want your analysis code to fail fast and fail loudly every time that something is not “as expected”.
So. Let’s talk about four different approaches to writing assertions in R.2
Just stopifnot()
, Scott
Here’s a simplified version of a function that I use a lot in my generative art workflows. The identifier()
function constructs a unique identifier for an output generated from a particular system:
So let’s say I’m creating a piece from a version 1 system called “rtistry”, and using 203 as my random seed. The unique identifier for this piece would be as follows:
identifier(name = "rtistry", version = 1, seed = 203)
[1] "rtistry_01_0203"
The idea here is that:
- The identifier should consist of exactly three parts, separated by underscores
- The first part should be the name of the generative art system
- The second part should specify the version of the system as a two-digit number
- The third part should specify the RNG seed used to generate this piece as a four-digit number
For most of my systems this will produce a globally unique identifier, since I try to design them so that the only input parameter to the system is the RNG seed.
Notice, though, that there are some unstated – and unchecked! – assumptions about the kind of input that the function will receive. It’s implicitly assumed that name
will be a character string that does not have any underscores, periods, or white spaces, and it’s also assumed that version
and seed
are both positive valued integers (or at least “integerish”) with upper bounds of 99 and 9999 respectively. Weirdness happens when I break those assumptions with my input:
identifier(name = "r tistry", version = 1.02, seed = 203)
[1] "r tistry_1.02_0203"
As a rule, of course, I don’t deliberately pass bad inputs to my functions, but if I want to be defensive about it, I should validate the inputs so that identifier()
throws an error if I make a mistake and pass it input that violates the assumptions. The base R function stopifnot()
is designed to solve exactly this problem:
<- function(name, version, seed) {
identifier
# throw error if any of the following assertions fail
stopifnot(
length(name) == 1, # name must be a scalar
length(version) == 1, # version must be a scalar
length(seed) == 1, # seed must be a scalar
::is_integerish(version), # version must be a whole number
rlang::is_integerish(seed), # seed must be a whole number
rlang!stringr::str_detect(name, "[[:space:]._]"), # name can't have spaces, periods, or underscores
> 0, # seed must be positive
seed < 10000, # seed must be less than 10000
seed > 0, # version must be positive
version < 100 # version must be less than 100
version
)
# the actual work of the function
<- stringr::str_pad(version, width = 2, pad = "0")
version <- stringr::str_pad(seed, width = 4, pad = "0")
seed paste(name, version, seed, sep = "_")
}
Using stopifnot()
in this way causes all of the following to error and throw informative error messages:
identifier("r tistry", 1, 203)
Error in identifier("r tistry", 1, 203): !stringr::str_detect(name, "[[:space:]._]") is not TRUE
identifier("rtistry", 1.02, 203)
Error in identifier("rtistry", 1.02, 203): rlang::is_integerish(version) is not TRUE
identifier("rtistry", 1, 20013)
Error in identifier("rtistry", 1, 20013): seed < 10000 is not TRUE
The error messages aren’t the prettiest, but they do the job. In each case you can look at the error message and figure out what went wrong when calling the identifier()
function. That said, you can sort of see the limitations to stopifnot()
by looking at my source code: because stopifnot()
throws pretty generic error messages that you can’t customise, my first instinct when writing the function was to group all my assertions into a single stopifnot()
call, and then – because there isn’t a lot of structure to my assertion code – I’ve added comments explaining what each assertion does. That’s… fine. But not ideal.
As it turns out, there are ways to provide more informative error messages with stopifnot()
. You can write a stopifnot()
assertion as a name-value pair:
stopifnot("`version` must be scalar" = length(version) == 1)
If this assertion is violated, the error message thrown by the stopifnot()
function corresponds to the name of the assertion, as illustrated below:
<- 1:3
version stopifnot("`version` must be scalar" = length(version) == 1)
Error: `version` must be scalar
It’s kind of clunky but it works.
Actually, I have a confession to make. I actually didn’t know this trick until I’d already posted the original version of this post to the internet, so I have Jim Gardner to thank for kindly called my attention to it.
Summary: stopifnot()
is suprisingly effective. It’s very general, and works for any expression that yields TRUE
or FALSE
. There are no dependencies since it’s a base R function. It does have some downsides: dealing with error messages is a bit clunky, and the code isn’t always the prettiest, but nevertheless it does the job that needs doing.
Just assert_that()
, Kat
The assertthat package is designed to provide a drop-in replacement for the stopifnot()
function, one that allows you to compose your own error messages when an assertion fails. It does have a variety of other convenience functions, but to be honest the main advantage over stopifnot()
is the superior control over the error message. In practice, I find that this functionality allows me to write assertion code that is (a) easier to read, and (b) produces better error messages when an assertion fails.
To illustrate, here’s the code I end up with when I revisit my generative art identifier()
function using assertthat:
library(assertthat)
<- function(name, version, seed) {
identifier
assert_that(
length(name) == 1,
length(version) == 1,
length(seed) == 1,
msg = "`name`, `version`, and `seed` must all have length 1"
)
assert_that(
!stringr::str_detect(name, "[[:space:]._]"),
msg = "`name` must not contain white space, periods, or underscores"
)
assert_that(
::is_integerish(version),
rlang> 0,
version < 100,
version msg = "`version` must be a whole number between 1 and 99"
)
assert_that(
::is_integerish(seed),
rlang> 0,
seed < 10000,
seed msg = "`seed` must be a whole number between 1 and 9999"
)
# the actual work of the function
<- stringr::str_pad(version, width = 2, pad = "0")
version <- stringr::str_pad(seed, width = 4, pad = "0")
seed paste(name, version, seed, sep = "_")
}
Like stopifnot()
, the assert_that()
function allows you to construct arbitrary assertions, which I find useful. Additionally, the assert_that()
function has some nice properties when compared to stopifnot()
. Because it takes a msg
argument that allows you to specify the error message, it gently encourages you to group together all the assertions that are of the same kind, and then write an informative message tailored to that subset of the assertion checks. This produces readable code because the error message is right there next to the assertions themselves, and the assertions end up being more organised than when I used stopifnot()
earlier.
In any case, let’s have a look. First, let’s check that this works:
identifier("rtistry", 1, 203)
[1] "rtistry_01_0203"
Second, let’s check that all of these fail and throw readable error messages:
identifier("r tistry", 1, 203)
Error: `name` must not contain white space, periods, or underscores
identifier("rtistry", 1.02, 203)
Error: `version` must be a whole number between 1 and 99
identifier("rtistry", 1, 20013)
Error: `seed` must be a whole number between 1 and 9999
I find myself preferring this as a way of generating error messages when input arguments to a function don’t receive appropriate input. Because I know what I want the function to do, I’m able to write concise but informative error messages that are appropriate to the specific set of assertions that I’ve included within any particular assert_that()
call.
Summary: The assertthat package has a pretty specific aim: to provide an assert_that()
function works as a drop-in replacement for stopifnot()
that allows custom error messages. Given that limited goal, it works nicely.
Just assert_*()
it, Kit
The assertive package provides a large collection of assert_*()
functions that are each tailored to a specific type of assertion, and designed to produce error messages that are tailored to that specific case. Here’s an example where I apply this approach to checking the inputs to the identifier()
function:
library(assertive)
<- function(name, version, seed) {
identifier
assert_is_scalar(version)
assert_is_scalar(name)
assert_is_scalar(seed)
assert_is_integer(version)
assert_is_integer(seed)
assert_all_are_positive(c(seed, version))
assert_all_are_less_than(seed, 10000)
assert_all_are_less_than(version, 100)
assert_all_are_not_matching_regex(name, "[[:space:]._]")
# the actual work of the function
<- stringr::str_pad(version, width = 2, pad = "0")
version <- stringr::str_pad(seed, width = 4, pad = "0")
seed paste(name, version, seed, sep = "_")
}
I’d probably argue that this is the most readable version of the code yet. The assert_*()
functions have such transparently informative names that there’s no need at all for comments. However, there are some downsides to this approach, which become a little more apparent when we look at the error messages that it throws when I pass bad inputs to the identifier()
function:
identifier("r tistry", 1L, 203L)
Error in identifier("r tistry", 1L, 203L): is_not_matching_regex : name does not match "[[:space:]._]"
There was 1 failure:
Position Value Cause
1 1 r tistry matches '[[:space:]._]'
identifier("rtistry", 1.02, 203L)
Error in identifier("rtistry", 1.02, 203L): is_integer : version is not of class 'integer'; it has class 'numeric'.
identifier("rtistry", 1L, 20013L)
Error in identifier("rtistry", 1L, 20013L): is_less_than : seed are not all less than 10000.
There was 1 failure:
Position Value Cause
1 1 20013 greater than or equal to 10000
Because I don’t have custom error message code in my assertions, the errors that get returned to the user are a little bit opaque. They’re more informative than the stopifnot()
versions, and because each assertion throws its own error message tailored to that function, the results are rather better suited to the context. Even so, they’re still quite long and there’s some cognitive effort required by the user to figure out what happened.
There’s a second issue here. Notice that when I wanted to pass a good input for seed
or version
in this version of the function, I used explicitly integer-classed values (e.g., 203L
not 203
). There’s a reason I did that. The assert_is_integer()
function uses is.integer()
test for integer status, which returns TRUE
only when passed an actual integer. It returns FALSE
when passed an “integerish” double:
is.integer(203L)
[1] TRUE
is.integer(203)
[1] FALSE
Because my assertion is a check for integer status not “integerish” status, this version of the identifier()
function is more strict about type checking than I really want it to be, and this fails:
identifier("rtistry", 1, 203)
Error in identifier("rtistry", 1, 203): is_integer : version is not of class 'integer'; it has class 'numeric'.
Now, to be fair, there are of course many situations where you really do want to be strict about type checking integers: the integer representation of 203L
is a different underlying object to the floating point representation of 203
, and while R is usually pretty chill about this, it’s important to keep in mind that doubles and integers are fundamentally different data types. That being said, it’s vanishingly rare for this to actually matter in my generative art process, and I’d prefer to let this one slide.
This kind of thing is where you can run into some difficulties using the assert_*()
functions. If there isn’t a specific assertion function tailored for your use case (as occurs with “integerish” check in identifier()
) you’re left with the dilemma of either choosing an assertion that isn’t quite right, or else falling back on a general-purpose assertion like assert_all_are_true()
. For example, this works…
library(assertive)
<- function(name, version, seed) {
identifier
assert_is_scalar(version)
assert_is_scalar(name)
assert_is_scalar(seed)
assert_all_are_true(rlang::is_integerish(c(seed, version)))
assert_all_are_positive(c(seed, version))
assert_all_are_less_than(seed, 10000)
assert_all_are_less_than(version, 100)
assert_all_are_not_matching_regex(name, "[[:space:]._]")
# the actual work of the function
<- stringr::str_pad(version, width = 2, pad = "0")
version <- stringr::str_pad(seed, width = 4, pad = "0")
seed paste(name, version, seed, sep = "_")
}
identifier("rtistry", 1, 203)
[1] "rtistry_01_0203"
…but it’s not quite as elegant as you might hope. Nevertheless, I’m not being critical here. It’s impossible to write a package like assertive in a way that covers every use case, and it’s pretty impressive that it has the breadth that it does.
Summary: Because it provides a huge number of well-named assertion functions, the assertive package tends to produce very readable code, and because each of those functions produces errors that are tailored to that check, the error messages tend to be useful too. It does get a little awkward when there isn’t an assertion for your use case, but usually there’s a way to work around that.
Just assertr, Carr
The assertr package solves a different problem to the other three methods discussed here. The other three approaches are general-purpose tools and – with various strengths and weaknesses – they’re designed to be used when checking an arbitrary input. The assertr package is more specialised: it focuses on checking a data input, specifically a tabular data object like a data frame or a tibble. Because it’s focused on that particular – and extremely important – special case, it’s able to provide a more powerful way of validating the content of a data frame.
In that sense, assertr is complementary to the other three approaches. For example, you could use assertr to check the data
input to a function that takes a data frame as the primary argument, but then use (say) assert_that()
to test the others.
To get started, I’ll load the packages I’m going to use in this section:
library(dplyr)
library(readr)
library(assertr)
The assertr package provides three primary verbs, verify()
, assert()
, and insist()
. They all take a data set as the first argument and (by default) returns the original data set unaltered if the checks pass, which makes it include them as part of a data pipeline. There’s also two row-wise variants assert_rows()
and insist_rows()
. For the purposes of this post I’ll limit myself to talking about the simplest cases, verify()
and assert()
.
Let’s start with verify()
. The verify()
function expects to receive an expression as the first non-data argument amd yields a logical value, which is then evaluated in the data context. If the expression evaluates to FALSE
, an error is thrown.
Here’s a simple example using verify()
. My data set comes from the List of Archibald Prize Winners wikipedia page. The Archibald Prize is a one of the most prestigious art prizes in Australia, awarded for painted portraits, and has been awarded (almost!) annually since 1921. My data set looks like this:
<- read_csv("archibald.csv", show_col_types = FALSE)
archibald archibald
# A tibble: 166 × 6
prize year artist title subject n_finalists
<chr> <chr> <chr> <chr> <chr> <dbl>
1 Archibald Prize 1921 William McInnes Desbrowe Ann… Harold… 45
2 Archibald Prize 1922 William McInnes Professor Ha… Willia… 53
3 Archibald Prize 1923 William McInnes Portrait of … Violet… 50
4 Archibald Prize 1924 William McInnes Miss Collins Gladys… 40
5 Archibald Prize 1925 John Longstaff Maurice Mosc… Mauric… 74
6 Archibald Prize 1926 William McInnes Silk and Lac… Esther… 58
7 Archibald Prize 1927 George W. Lambert Mrs Annie Mu… Annie … 56
8 Archibald Prize 1928 John Longstaff Dr Alexander… Alexan… 66
9 Archibald Prize 1929 John Longstaff The Hon W A … Willia… 75
10 Archibald Prize 1930 William McInnes Drum-Major H… Harry … 67
# ℹ 156 more rows
To be precise, there are actually three different prizes included in the data set. There’s the original Archibald Prize (the famous one), and two more recent additions that are awarded using the same pool of entrants: the People’s Choice Award (which is what you’d think), and the Packing Room Prize (awarded by the staff who install the portraits in the gallery).
For my first analysis then, I want to do a simple tabulation: count the number of times any given artist has won a particular prize, and sort the results in descending count order. So the analysis part of my data pipeline would look like this:
|>
archibald count(artist, prize) |>
arrange(desc(n))
However, I might want to verify()
a few things first. I’d like to check that prize
and artist
both exist as columns in the data, and both contain character data. I can use the base R function exists()
to check that the variables exist within the data context, and is.character()
to check the variable type:
|>
archibald verify(exists("prize")) |>
verify(exists("artist")) |>
verify(is.character("prize")) |>
verify(is.character("artist")) |>
count(artist, prize) |>
arrange(desc(n))
# A tibble: 118 × 3
artist prize n
<chr> <chr> <int>
1 William Dargie Archibald Prize 8
2 William McInnes Archibald Prize 7
3 Ivor Hele Archibald Prize 5
4 John Longstaff Archibald Prize 5
5 Vincent Fantauzzo People's Choice Award 4
6 Clifton Pugh Archibald Prize 3
7 Eric Smith Archibald Prize 3
8 Robert Hannaford People's Choice Award 3
9 William Dobell Archibald Prize 3
10 William Pidgeon Archibald Prize 3
# ℹ 108 more rows
In this case, all the verify()
checks pass, so no errors are thrown and the analysis proceeds in the usual way. But suppose that the artist
variable was actually supposed to be called painter
:
|>
archibald verify(exists("prize")) |>
verify(exists("painter")) |>
verify(is.character("prize")) |>
verify(is.character("painter")) |>
count(painter, prize) |>
arrange(desc(n))
verification [exists("painter")] failed! (1 failure)
verb redux_fn predicate column index value
1 verify NA exists("painter") NA 1 NA
Error: assertr stopped execution
There is no painter
variable in the data set, so the assertion checks fail, and an error message is thrown. The form of the error message is rather elaborate though. There is a reason why assertr defaults to this strange-looking format: often there are multiple errors that appear in an assertion check, and by default assertr will group them into a table summarising all the issues.
There’s something a little repetitive about the validation code I wrote above. If my analysis pipeline involved many variables, it would be a bit obnoxious to write a separate verify()
line to check that they all exist. For the column name checks, assertr provides a convenience function has_all_names()
that you can use specifically for this purpose:3
|>
archibald verify(has_all_names("prize", "artist")) |>
verify(is.character("prize")) |>
verify(is.character("artist")) |>
count(artist, prize) |>
arrange(desc(n))
# A tibble: 118 × 3
artist prize n
<chr> <chr> <int>
1 William Dargie Archibald Prize 8
2 William McInnes Archibald Prize 7
3 Ivor Hele Archibald Prize 5
4 John Longstaff Archibald Prize 5
5 Vincent Fantauzzo People's Choice Award 4
6 Clifton Pugh Archibald Prize 3
7 Eric Smith Archibald Prize 3
8 Robert Hannaford People's Choice Award 3
9 William Dobell Archibald Prize 3
10 William Pidgeon Archibald Prize 3
# ℹ 108 more rows
For the type checking, however, there’s no equivalent convenience function and if you want to group multiple verify()
checks what you want to do is use the assert()
function. The first non-data argument to assert()
specifies a predicate function that is applied to a set of columns.4 If the predicate function returns FALSE
, the assert()
function errors.
Rewriting the verify()
code from our “successful” example as assert()
checks gives us this:
|>
archibald verify(has_all_names("prize", "artist")) |>
assert(is.character, prize, artist) |>
count(artist, prize) |>
arrange(desc(n))
# A tibble: 118 × 3
artist prize n
<chr> <chr> <int>
1 William Dargie Archibald Prize 8
2 William McInnes Archibald Prize 7
3 Ivor Hele Archibald Prize 5
4 John Longstaff Archibald Prize 5
5 Vincent Fantauzzo People's Choice Award 4
6 Clifton Pugh Archibald Prize 3
7 Eric Smith Archibald Prize 3
8 Robert Hannaford People's Choice Award 3
9 William Dobell Archibald Prize 3
10 William Pidgeon Archibald Prize 3
# ℹ 108 more rows
One thing I really like about the design of assertr is that pipe-friendly assertion checks make it possible to add your assertion checks at the appropriate point in the analysis pipeline. For instance, let’s suppose I want to look at the number of finalists in the Archibald Prize each year. The raw data only records n_finalists
for the Archibald Prize, not the Packing Room Prize or the People’s Choice Award. Rows in the data corresponding to those latter prizes will always have NA
values for n_finalists
, but that isn’t a problem for my proposed analysis. The only missingness of possible concern to me is for the Archibald Prize proper. So I can write my assertion checks like this:
|>
archibald verify(has_all_names("prize", "n_finalists")) |>
assert(is.character, prize) |>
assert(is.numeric, n_finalists) |>
filter(prize == "Archibald Prize") |>
assert(\(x) !is.na(x), n_finalists) |>
summarise(
min_finalists = min(n_finalists),
median_finalists = median(n_finalists),
max_finlists = max(n_finalists)
)
Column 'n_finalists' violates assertion 'function(x) !is.na(x)' 2 times
verb redux_fn predicate column index value
1 assert NA function(x) !is.na(x) n_finalists 13 NA
2 assert NA function(x) !is.na(x) n_finalists 69 NA
Error: assertr stopped execution
Okay, so there is in fact a case where missingness is a problem in two rows of the data set, for the explicit subset of the data I care about. As it happens though, I simply don’t care when it’s only those two years, so for the purposes of this example I’ll filter those rows out before they even hit the assertion check, and unsurprisingly this runs without erroring:
|>
archibald verify(has_all_names("prize", "n_finalists")) |>
assert(is.character, prize) |>
assert(is.numeric, n_finalists) |>
filter(prize == "Archibald Prize", !is.na(n_finalists)) |>
assert(\(x) !is.na(x), n_finalists) |>
summarise(
min_finalists = min(n_finalists),
median_finalists = median(n_finalists),
max_finlists = max(n_finalists)
)
# A tibble: 1 × 3
min_finalists median_finalists max_finlists
<dbl> <dbl> <dbl>
1 15 52 197
In addition to verify()
and assert()
, there are three other assertion functions in assertr. I’m not going to dive into those for the purposes of this post – that’s what the package documentation is there for! – but the TL;DR is as follows:
insist()
works likeassert()
but it takes a “predicate generator” function instead of a “predicate” function, which makes it possible to specify an assertion check for a tidy selection of columns and have the predicate generator handle each column according to its own logicassert_rows()
is a row-wise version ofassert()
insist_rows()
is a row-wise version ofinsist()
Summary: My overall feeling is that assertr is probably the most powerful tool for assertion checks applied to tabular data. It lacks the generality of the other tools, true, but the special case that it works for is a really important one for data analysts. Data objects tend to have their own special issues, and pretty much every data analysis takes at least one data frame as an input, so it’s really convenient to have a specialised tool for that scenario.
Footnotes
The idea is very similar to writing unit checks for software development. The difference is that unit tests are run at build time, whereas assertions apply at run time.↩︎
It should be noted that these aren’t the only packages out there to support assertions in R. There are at least three others that I’m aware of but haven’t yet tried, and probably many others that I don’t know about. For what it’s worth, these are the other three I know of: the ensurer, checkmate, and tester packages can all be used for this purpose, and I’m sure I could come up with terrible rhymes for those too, but there’s a limit to how much effort I want to put into this post.↩︎
In general, assertr doesn’t supply lots of convenience functions, but
has_all_names()
is an important special case because it’s used to check for the existence of columns, and that requires a special workflow. For type checking assertions, I can group together multipleverify()
checks into a singleassert()
check that takes a tidy selection of columns. But for that to work the columns actually have to exist, so you can’t useassert()
for existence checks! Hence (I presume) the inclusion of thehas_all_names()
convenience function.↩︎Column names are unquoted and are passed through the dots
...
. The documentation notes that the dots are passed todplyr::select()
, and accordingly theassert()
function supports tidy selection.↩︎
Reuse
Citation
@online{navarro2023,
author = {Navarro, Danielle},
title = {Four Ways to Write Assertion Checks in {R}},
date = {2023-08-08},
url = {https://blog.djnavarro.net/posts/2023-08-08_being-assertive/},
langid = {en}
}