On blogging reproducibly with renv

Some initial thoughts on how to deploy a distill blog in a reproducible fashion. It’s a little harder than it looks and I am still working out all the details. To make my life a little easier, I started writing a small package called “refinery”, which uses the renv package to manage a separate R environment for every post, and aims to prevent conflicts between renv and distill. I’m not sure it’s useful to anyone except me, but it makes me happy.

true
2021-09-30

I started my very first blog in the dark ages, when dialup internet was a thing and the 21st century was still shiny and new. There are very few hints that this blog ever existed, which is perhaps fortunate for me since it wasn’t very good. For all its flaws though, it was a useful thing to try: rather than use one of the big blogging platforms, I hosted my own static site using the university website, and it got me started thinking about other forms of professional communication besides the tiresome process of writing academic papers. Besides, writing blog posts isn’t just useful, it’s fun.

It is surprising, then, that I haven’t managed to keep any of my many blogs running consistently. I used to think this was a personal failing on my part, but I’ve come to realise that technical blogging is an extremely difficult thing to do cleanly. In my first post on this blog I outlined four principles that I’ve tried to adhere to over the last year or two, and I think they’ve served me well:

The first three are (I think) somewhat self explanatory. It’s the fourth one that I want to talk about here, because it’s a lot harder than it looks, and my initial post on this blog underestimated how tricky it can be to get this one right. I won’t be so arrogant as to claim that I’ve gotten it right now, but with the help of Kevin Ushey’s very excellent renv package, I’m slowly making progress!

Why is this so hard?

Running a programming blog based in R markdown is fundamentally hard, because of the very thing that makes R markdown attractive: the blog post is also the source code. This is a both a blessing and a curse. It’s a blessing because it forces you, the blogger, to write code that is readable to your audience. It forces you to write code that actually works: if the code doesn’t work, the post doesn’t knit. This is extremely valuable to you and to your audience. Having become addicted to literate programming tools such as R markdown, I would never want to go back to the bad old days where you wrote your code in scripts and pasted chunks of non-executable code into a document. Over and over again I found that this introduced horrible problems: I’d fix a bug in the source code, and then forget to update it in the document. With the advent of R markdown and the many tools that rely on it (distill, blogdown, bookdown, etc), I hope never to be forced to return to that nightmare.

However, there is a catch. There is always a catch. The catch in this cases is that managing your R environment is hard. Every time you write a new post, your R environment is likely to change. Packages will have been updated, and there is a chance that code you wrote in an old post will no longer run the same way now as it did back then. The passage of time means that eventually all your old posts break: they were written using a particular R environment that no longer exists on your computer. What’s worse is that every post has a unique environment. If you want to ensure that old posts still knit, then every post needs to be associated with its own reproducible R environment. In effect, you’re in a situation where you need to maintain many R projects (one per blog post), that are themselves contained within an encompassing R project (the blog itself). That’s not easy to do.

Image by Patrick Tomasso. Available by CC0 licence on [unsplash](https://unsplash.com/photos/Oaqk7qqNh_c).

Figure 1: Image by Patrick Tomasso. Available by CC0 licence on unsplash.

Some useful tools

The difficulty in managing the R environments in a blogging context is something that comes up a lot, and there are a few workarounds that make your life a little easier. For example, in a Distill blog like this one, you maintain manual control over when a post is rendered. Building the whole site with rmarkdown::render_site() won’t trigger a rebuild of posts, so it’s possible to rebuild the rest of the site without attempting to re-knit old posts. This is a very good thing, and in the early days of blogdown the fact that you didn’t have that protection was the source of a lot of problems (happily, they fixed that now!)

Another thing you can do to make things a little easier is to use utils::sessionInfo() or devtools::session_info(). Appending a call to one of these functions to your post will at least ensure that the reader of your post knows something about what the R environment was at the time you last knit the post:

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

loaded via a namespace (and not attached):
 [1] fansi_0.5.0         rprojroot_2.0.2     digest_0.6.28      
 [4] R6_2.5.1            jsonlite_1.7.2      magrittr_2.0.1     
 [7] evaluate_0.14       highr_0.9           stringi_1.7.4      
[10] rlang_0.4.11        renv_0.14.0         fs_1.5.0           
[13] refinery_0.0.0.9057 jquerylib_0.1.4     bslib_0.3.0        
[16] vctrs_0.3.8         rmarkdown_2.10      distill_1.2        
[19] tools_4.1.1         stringr_1.4.0       xfun_0.26          
[22] yaml_2.2.1          fastmap_1.1.0       compiler_4.1.1     
[25] htmltools_0.5.2     knitr_1.36          downlit_0.2.1      
[28] sass_0.4.0         

These are useful, and taken together it’s possible to run a blog that won’t break on you, but it’s still less than ideal. For example, one problem I used to encounter often is the “minor edit” headache. I would often want to revisit an old blog post – one that no longer knits because the R environment has changed – and add a brief note mentioning that the code doesn’t work with more recent versions of certain packages! This is something I think is important to do, so that anyone reading my old posts won’t try using the same code in an R environment that won’t run it. At a bare minimum that seems polite, but… in order to make the update, I would need to modify the post, which means I’d have to re-knit the post, but… as aforementioned, the post won’t knit. It’s a Catch 22: you can’t inform people that the post won’t knit unless you are able to knit the post.

Image by Andrew Neel. Available by CC0 licence on [unsplash](https://unsplash.com/photos/7-buK9DsYO4).

Figure 2: Image by Andrew Neel. Available by CC0 licence on unsplash.

Project environments with renv

I imagine there are many different ways to solve this problem, but the approach I’ve taken in this blog is to rely on the renv package by Kevin Ushey. The goal of renv is to allow you to create and manage reproducible R environments that you can associate with a project. This post isn’t the place to write a full tutorial on how to use renv, but to oversimplify somewhat, the renv package manages an R environment using two things the lockfile and the a local package library. For any given project, you can start using renv using renv::init().

The lockfile associated with an R project has the file name renv.lock and it consists of a collection of records that precisely specify the version of renv, the version of R, and detailed information about the packages used in the project. One nice property of renv is that the lockfile is capable of tracking packages installed from GitHub as well as CRAN. For example, here’s what an entry looks like for a package installed from CRAN:

"distill": {
  "Package": "distill",
  "Version": "1.2",
  "Source": "Repository",
  "Repository": "CRAN",
  "Hash": "5edf0b55f685c668d5e800051bc31f3d"
}

This entry tells you that this post (because I’m copying from the lockfile for this post) was generated using version 1.2 of the distill package, downloaded from CRAN. On the other hand, the version of cli that I’m currently using came from GitHub:

"cli": {
  "Package": "cli",
  "Version": "3.0.1.9000",
  "Source": "GitHub",
  "RemoteType": "github",
  "RemoteHost": "api.github.com",
  "RemoteRepo": "cli",
  "RemoteUsername": "r-lib",
  "RemoteRef": "HEAD",
  "RemoteSha": "154f3215e458728a2155217a7f4897da5b8edea0",
  "Hash": "3347d46b7c20b31f8d40491f57e65c38"
}

The complete lockfile is rather long, and it contains all the information that you need to recreate the R environment.1 For any given project, you can create a lockfile using renv::snapshot().

However, although the lockfile contains the description of the R environment, it doesn’t actually contain the packages. Without the actual packages, you can’t do very much, so the renv package creates a local package library for each project, which contains the actual package installations.2. Given a lockfile, you can update the corresponding library using renv::restore().

To learn more about renv, I strongly recommend reading the package documentation. It’s very good.

Image by Nadine Marfurt. Available by CC0 licence on [unsplash](https://unsplash.com/photos/FZepNVi6f0Q).

Figure 3: Image by Nadine Marfurt. Available by CC0 licence on unsplash.

Escaping the Catch 22

The usual intent when using renv is to maintain one R environment per project, which is not quite perfectly aligned with the needs of a blog. For the blogging situation, we want one R environment per post, and – importantly – we don’t want the renv infrastructure and the blog infrastructure to interfere with each other. It’s not too difficult to do this, but I found it a little finicky to get started. So, to make my life a little easier, I started writing refinery, a small package whose sole purpose is to make distill and renv play nicely together!

The package is very much a work in progress. It’s reached the point where I can start using it on a regular basis in my own blogging, but that’s as far as I’ve gotten. But, to give you a sense of some of the design choices I’ve made, here’s a quick run through. The intended blogging workflow is as follows:

Step 1: Create the post

As a general rule, I find it extremely helpful to create posts from a template file. In my blog there’s a _templates folder containing R markdown files that are pre-populated with information that rarely changes (e.g., my name doesn’t change very often). Actually, I only have one template for this blog, but in principle there can be as many as you like: my post template has author information pre-populated, contains instructions on which fields need to be updated, and so on. Using templates is a low-tech but effective way of improving reproducibility, because it will help to ensure that all posts adhere to a common structure.

So the first step is to create a new post from a template, and to that end the refinery package has a use_article_template() function:

refinery::use_article_template(
  template = "_templates/standard_post.Rmd",
  slug = "fabulous-blog-post", 
  date = "1999-12-31"
  renv_new = FALSE
)

At a minimum, you need to specify the template argument and the slug argument. If you don’t specify a date, today’s date will be used. The concept behind this function is not at all novel: it was inspired by and is deeply similar to the create_post_from_template() function from Ella Kaye’s distilltools package. The arguments are a little different, but it’s the same idea.

Where use_article_template() differs from other “new post” functions is that it contains a renv_new argument. If renv_new = TRUE (the default), then creating the post will also set up the infrastructure necessary to manage the R environment with renv. My usual approach is to stick with the default, and allow use_article_template() to take care of that step for me, but for expository purposes the code snippet above prevents that from happening. So we’ll have to do that manually in the next section.

In the meantime, however, the effect of calling use_article_template() is to create a post inside the _posts folder of your blog. In the example above, a new folder will be created here:

_posts/1999-12-31_fabulous-blog-post

Inside this folder will be an index.Rmd file that has been constructed from the post template.

Image by Nick Fewings. Available by CC0 licence on [unsplash](https://unsplash.com/photos/pSxBMq3cAOk).

Figure 4: Image by Nick Fewings. Available by CC0 licence on unsplash.

Step 2: Start using renv

Because I set renv_new = FALSE in the code snippet above, we currently don’t have any renv infrastructure associated with this post. To do that, we’d use the following command:

refinery::renv_new("1999-12-31_fabulous-blog-post")

Like everything else in the refinery package, this is just a convenience function. All of the heavily lifting is being performed by renv::init(). What the refinery::renv_new() does is make sure that the renv infrastructure doesn’t get lumped in with the distill infrastructure, and a few other little niceties.

Why separate renv from distill? I’m so glad you asked! The default behaviour of renv::init() is to create a renv folder inside your project directory. This makes perfect sense in the “one environment per project” scenario, but it’s awkward for a blog. If you define “the blog” as the project, then you’re right back where you started: there’s no way to have separate environments for each post. But if you define “the post” as the project, you run into a different problem: distill doesn’t know about renv, and if a post folder contains a renv folder, distill will search inside it looking for things that might be blog posts (and it sometimes finds them, which leads to chaos!) We don’t want that.

The solution adopted by the refinery package is to create a new top level folder called _renv,3 and then place all the renv infrastructure in there. For our hypothetical post above, the renv infrastructure would be stored in

_renv/_posts/1999-12-31_fabulous-blog-post

The lockfile and library files associated with our new blog post are stored in there, cleanly separated from anything that distill would be interested in peeking at!

Step 3: Loading the environment

The next step is to make sure that your blog post makes proper use of the renv infrastructure we’ve just created. To do that for the hypothetical post above, all you’d need to do is ensure that the R markdown file contains a line like this:

refinery::renv_load("1999-12-31_fabulous-blog-post")

What that will do is ensure that when the post is knit, all the R code is executed using the R environment associated with the post. Yet again, if you take a look at the source code you’ll see that the refinery package really isn’t doing very much work. This is a very thin wrapper around renv::load().

Step 4: Updating the R environment

When writing a new blog post, there are two main functions in the refinery package that I use to manage the R environment (and a third one I use to rage quit!)

The process works like this. When the renv infrastructure gets created using refinery::renv_new(), it includes a bare minimum of packages in the local package library: only renv, distill, refinery, and their dependencies are added. It doesn’t, for example, include dplyr.

As you’re writing your blog post, you might find yourself using dplyr functions, and when you go to knit that post… it won’t work, even if you have dplyr on your machine. That’s because dplyr is not yet listed in the lockfile and it’s not stored in the local package library. We can fix this with two lines of code. First, we can use refinery::renv_snapshot() to scan the current post: because Kevin Ushey is very smart and renv is a very good package, the renv::snapshot() function that does all the real work will automatically discover that dplyr is being used, and it will update the lockfile:

refinery::renv_snapshot("1999-12-31_fabulous-blog-post")

This updates the lockfile, but only the lockfile. What you can then do is use the updated lockfile to update the library. The command for that is refinery::renv_restore() which – shockingly – is in fact just a thin wrapper around renv::restore():

refinery::renv_restore("1999-12-31_fabulous-blog-post")

Once you’ve done that, your post will knit, your lockfile will record all the reproducibility information associated with your post, and you will be happy! (Maybe)

Step 5: Let your readers know!

One thing I’ve been doing on my blog is including a couple of additional appendices besides the usual ones that distill provides: a “last updated” appendix that contains the timestamp indicating when post was most recently re-knit, and a “details” appendix that contains two links: one that goes to the R markdown source for the blog post, and another one that goes to the renv lockfile for the post. For that, there’s a convenience function called insert_appendix(). There are two arguments you need to include: repo_spec is the usual “user/repo” specification for the GitHub repository, and name is the name of the folder containing the blog post. Something like this:

refinery::insert_appendix(
  repo_spec = "djnavarro/distill-blog"
  name = "1999-12-31_fabulous-blog-post"
)
Image by Brett Jordan Available by CC0 licence on [unsplash](https://unsplash.com/photos/pKlBjhV1USY).

Figure 5: Image by Brett Jordan Available by CC0 licence on unsplash.

So… what next?

One of the open questions I have is whether it’s worthwhile putting much more effort into the refinery package. As it stands I’m planning to improve the documentation a little (so that “me six months in the future” doesn’t hate “me today”), but in truth this is something I wrote for myself: I like having the refinery package around because it supports this blog, but that goal is now (I hope!) mostly accomplished. It may be that other folks running distill blogs would like to use these tools, in which case it might be valuable to do something more rigorous like, oh, write some unit tests and send it to CRAN.

For now though I’m happy where things stand. If things work as planned, this should give me the infrastructure I need to maintain this blog properly for as long as I want to, and when the world moves on and an old post is no longer accurate, it should be easy to edit the post noting that the code in the post won’t work any more, re-knit it using the original R environment, and continue blogging with fewer tears. At least, that’s the hope!

Last updated

2021-10-01 20:41:34 AEST

Details

source code, R environment


  1. It doesn’t give you complete information about the machine it’s running on though, and I’m not quite at the point that I’m willing to resort to docker yet!↩︎

  2. This is an oversimplification: renv tries to be efficient and maintains a cache that helps you avoid duplication. But as I said, I’m not going to dive into details here↩︎

  3. I decided to call the top-level folder _renv rather than renv to ensure that distill will ignore the folder unless you explicitly tell it otherwise. The _renv files won’t end up being copied to your website.↩︎

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/djnavarro/distill-blog/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Navarro (2021, Sept. 30). Notes from a data witch: On blogging reproducibly with renv. Retrieved from https://blog.djnavarro.net/on-blogging-reproducibly

BibTeX citation

@misc{navarro2021on,
  author = {Navarro, Danielle},
  title = {Notes from a data witch: On blogging reproducibly with renv},
  url = {https://blog.djnavarro.net/on-blogging-reproducibly},
  year = {2021}
}