Data serialisation in R – Notes from a data witch

I still alive, and that’s what matters. The traumatic experience of the last week is fading, leaving a pale residue of fear and the few scraps of writing that are the sole surviving documentation of these days. It is a tale of fright, a desperate agony, and like any good tragedy it starts with the hope and naive optimism of youth…

I’ve decided the time has come for me to do a deep dive into data serialisation in R. Serialisation is one of those terms that comes up from time to time in data science, and it’s popped up so many times on my twitter feed that I feel like I need a better grasp of how serialisation works in R. It’s a topic that folks who work with big data or have a computer science background likely understand quite well, but a lot of people who use R come from other backgrounds. If you’re a social scientist who mostly works with small CSV files, for example, there’s no particular reason why you’d have encountered this. In my case, I’ve worked as a mathematical psychologist and computational modeller for about 20 years, and until very recently I’ve never had never had to think about it in any detail. The issue only came up for me when I started reading about Apache Arrow (a topic for another post, perhaps) and realised that I needed to have a better understanding of what all this data serialisation business is about, and how R handles it.

This post is aimed at anyone who is in a similar situation to me!

Oh you sweet summer child. You really think you are prepared for the dark? That’s adorable.

Image by Andrey Zvyagintsev. Available by CC0 licence on unsplash.

What is serialisation?

In general serialisation refers to any process that takes an object stored in memory and converts into a stream of bytes that can be written to a file or transmitted elsewhere. Any time we write data to a file, we are “serialising” it according to some encoding scheme. Suppose, for instance, I have a data frame called art:

art

   resolution      series sys_id img_id   short_name format
1        1000 watercolour  sys02  img34 teacup-ocean    jpg
2        1000 watercolour  sys02  img34 teacup-ocean    png
3        2000 watercolour  sys02  img34 teacup-ocean    jpg
4        2000 watercolour  sys02  img34 teacup-ocean    png
5        4000 watercolour  sys02  img34 teacup-ocean    jpg
6        4000 watercolour  sys02  img34 teacup-ocean    png
7         500 watercolour  sys02  img34 teacup-ocean    jpg
8         500 watercolour  sys02  img34 teacup-ocean    png
9        8000 watercolour  sys02  img34 teacup-ocean    jpg
10       8000 watercolour  sys02  img34 teacup-ocean    png

This data frame is currently stored in memory on my machine, and it has structure. R represents this data frame as a list of length 6. Each element of this list is a pointer to another data structure, namely an atomic vector (e.g., numeric vector). The list is accompanied by additional metadata that tells R that this particular list is a data frame. The details of how this is accomplished don’t matter for this post. All that matters for now is that the in-memory representation of art is a structured object. It’s little more complicated than a stream of data, but if I want to save this data to a file it needs to be converted into one. The process of taking an in-memory structure and converting it to a sequence of bytes is called serialisation.

Serialisation doesn’t have to be fancy. The humble CSV file can be viewed as a form of serialisation for a data frame, albeit one that does not store all the metadata associated with the data frame. Viewed this way, write.csv() can be viewed as a serialisation function for tabular data:

write.csv(art, file = "art.csv", row.names = FALSE)

When I call this function R uses the art object to write text onto the disk, saved as the file “art.csv”. If I were to open this file in a text editor, I’d see this:

"resolution","series","sys_id","img_id","short_name","format"
1000,"watercolour","sys02","img34","teacup-ocean","jpg"
1000,"watercolour","sys02","img34","teacup-ocean","png"
2000,"watercolour","sys02","img34","teacup-ocean","jpg"
2000,"watercolour","sys02","img34","teacup-ocean","png"
4000,"watercolour","sys02","img34","teacup-ocean","jpg"
4000,"watercolour","sys02","img34","teacup-ocean","png"
500,"watercolour","sys02","img34","teacup-ocean","jpg"
500,"watercolour","sys02","img34","teacup-ocean","png"
8000,"watercolour","sys02","img34","teacup-ocean","jpg"
8000,"watercolour","sys02","img34","teacup-ocean","png"

Although this view is human-readable, it is slightly misleading. The text in shown above isn’t the literal sequence of bytes. It’s how those bytes are displayed when the have been unserialised and displayed on screen as UTF-8 plain text. To get a sense of what serialised text actually looks like we can use the charToRaw() function. The first few characters of the text file are "resolu" which looks like this when series of bytes:

charToRaw('"resolu')

[1] 22 72 65 73 6f 6c 75

The raw vector shown in the output above uses one byte to represent each character. For instance, the character "l" is represented with the byte 6c in the usual hexadecimal representation. We can unpack that byte into its consituent 8-bit representation using rawToBits()

"u" |>
  charToRaw() |>
  rawToBits()

[1] 01 00 01 00 01 01 01 00

(Note that the base pipe |> is rendered as a triangle-shaped ligature in Fira Code)

Returning to the “art.csv” data file, I can use file() and readBin() to define a simple helper function that opens a binary connection to the file, reads in the first 100 bytes (or whatever), closes the file, and then returns those bytes as a raw vector:

read_bytes <- function(path, max_bytes = 100) {
  con <- file(path, open = "rb")
  bytes <- readBin(con, what = raw(), n = max_bytes)
  close(con)
  return(bytes)
}

Here are the first 100 bytes of the “art.csv” file:

read_bytes("art.csv")

  [1] 22 72 65 73 6f 6c 75 74 69 6f 6e 22 2c 22 73 65 72 69 65 73 22 2c 22 73 79
 [26] 73 5f 69 64 22 2c 22 69 6d 67 5f 69 64 22 2c 22 73 68 6f 72 74 5f 6e 61 6d
 [51] 65 22 2c 22 66 6f 72 6d 61 74 22 0a 31 30 30 30 2c 22 77 61 74 65 72 63 6f
 [76] 6c 6f 75 72 22 2c 22 73 79 73 30 32 22 2c 22 69 6d 67 33 34 22 2c 22 74 65

The read.csv() function is similar to read_bytes() in spirit: when I call read.csv("art.csv"), R opens a connection to the “art.csv” file. It then reads that sequence of bytes into memory, and then closes the file. However, unlike my simple read_bytes() function, it does something useful with that information. The sequence of bytes gets decoded (unserialised), and the result is that R reconstructs the original art data frame:

art <- read.csv("art.csv")
art

   resolution      series sys_id img_id   short_name format
1        1000 watercolour  sys02  img34 teacup-ocean    jpg
2        1000 watercolour  sys02  img34 teacup-ocean    png
3        2000 watercolour  sys02  img34 teacup-ocean    jpg
4        2000 watercolour  sys02  img34 teacup-ocean    png
5        4000 watercolour  sys02  img34 teacup-ocean    jpg
6        4000 watercolour  sys02  img34 teacup-ocean    png
7         500 watercolour  sys02  img34 teacup-ocean    jpg
8         500 watercolour  sys02  img34 teacup-ocean    png
9        8000 watercolour  sys02  img34 teacup-ocean    jpg
10       8000 watercolour  sys02  img34 teacup-ocean    png

Thrilling stuff.

Do you feel that slow dread yet, my dear? Do you feel yourself slipping? You are on the edge of the cliff. You can still climb back to safety if you want. You don’t have to fall. The choice is still yours.

Image by Daniel Jensen. Available by CC0 licence on unsplash.

How does RDS serialisation work?

Data can be serialised in different ways. The CSV format works reasonably well for rectangular data structures like data frames, but doesn’t work well if you need to serialise something complicated like a nested list. The JSON format is a better choice for those cases, but it too has some limitations when it comes to storing R objects. To serialise an R object we need to store the metadata (classes, names, and other attributes) associated with the object, and if the object is a function there is a lot of other information relevant to its execution besides the source code (e.g., enclosing environment). Because R needs this information, it relies on the native RDS format to do the work. As it happens I have an “art.rds” file on disk that stores the same data frame in the RDS format. When I use readRDS() to unserialise the file, it recreates the same data frame:

readRDS("art.rds")

   resolution      series sys_id img_id   short_name format
1        1000 watercolour  sys02  img34 teacup-ocean    jpg
2        1000 watercolour  sys02  img34 teacup-ocean    png
3        2000 watercolour  sys02  img34 teacup-ocean    jpg
4        2000 watercolour  sys02  img34 teacup-ocean    png
5        4000 watercolour  sys02  img34 teacup-ocean    jpg
6        4000 watercolour  sys02  img34 teacup-ocean    png
7         500 watercolour  sys02  img34 teacup-ocean    jpg
8         500 watercolour  sys02  img34 teacup-ocean    png
9        8000 watercolour  sys02  img34 teacup-ocean    jpg
10       8000 watercolour  sys02  img34 teacup-ocean    png

However, when I read this file using read_bytes() it’s also clear that “art.rds” contains a very different sequence of bytes to “art.csv”:

read_bytes("art.rds")

  [1] 1f 8b 08 00 00 00 00 00 00 03 8b e0 62 60 60 60 66 60 61 64 62 60 66 05 32
 [26] 19 58 43 43 dc 74 2d 80 62 c2 40 0e 1b 10 f3 02 31 50 11 f3 0b 08 66 bf 00
 [51] c1 fc 0b 20 98 f1 0b 04 cb 3b 40 30 83 00 58 3d 0b 03 27 90 e6 2e 4f 2c 49
 [76] 2d 4a ce cf c9 2f 2d 1a 4a 42 a8 be 60 2d ae 2c 36 30 1a 18 0e 9a 4b 32 73

This is hardly surprising since RDS and CSV are different file formats. But while I have a pretty good mental model of what the contents of a CSV file look like, I don’t have a very solid grasp of what the format of an RDS file is. I’m curious.

Oh sweetie, I tried to warn you…

The `serialize()` function

To get a sense of how the RDS format works, it’s helpful to note that R has a serialize() function and an unserialize() function that provide low-level access to the same mechanisms that underpin saveRDS() and readRDS().

bytes <- serialize(art, connection = NULL)

As you can see, this is the same sequence of bytes returned by read_bytes()…

bytes[1:100]

  [1] 58 0a 00 00 00 03 00 04 03 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
 [26] 03 13 00 00 00 06 00 00 00 0d 00 00 00 0a 00 00 03 e8 00 00 03 e8 00 00 07
 [51] d0 00 00 07 d0 00 00 0f a0 00 00 0f a0 00 00 01 f4 00 00 01 f4 00 00 1f 40
 [76] 00 00 1f 40 00 00 00 10 00 00 00 0a 00 04 00 09 00 00 00 0b 77 61 74 65 72

…oh wait, no it’s not. What gives???? The “art.rds” file begins with 1f 8b 08 00, whereas serialize() returns a sequence of bytes that begins with 58 0a 00 00. These are not the same at all! Why is this happening???

RDS uses gzip compression

After digging a little into the help documentation, I realised that this happens because the default behaviour of saveRDS() is to write a compressed RDS file using gzip compression. In contrast, serialize() does not employ any form of compression. The art.rds file that I have stored on disk is that gzipped version, but it’s easy enough to save an uncompressed RDS file, simply by setting compress = FALSE:

saveRDS(art, file = "art_nozip.rds", compress = FALSE)

So now when I inspect the uncompressed file using read_bytes(), the output is the same one I obtained when I called serialize(art) earlier:

read_bytes("art_nozip.rds")

  [1] 58 0a 00 00 00 03 00 04 03 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
 [26] 03 13 00 00 00 06 00 00 00 0d 00 00 00 0a 00 00 03 e8 00 00 03 e8 00 00 07
 [51] d0 00 00 07 d0 00 00 0f a0 00 00 0f a0 00 00 01 f4 00 00 01 f4 00 00 1f 40
 [76] 00 00 1f 40 00 00 00 10 00 00 00 0a 00 04 00 09 00 00 00 0b 77 61 74 65 72

That’s a relief. I was getting very anxious there, but I feel a little better now. My sanity is restored.

…for now.

The `unserialize()` function

That was frustrating. Anyway getting back to the main thread, the inverse of the serialize() function is unserialize(). It’s very similar to the readRDS() function that you’d normally use to read an RDS file, but you can apply it to a raw vector like bytes. Once again we reconstruct the original data frame:

unserialize(bytes)

   resolution      series sys_id img_id   short_name format
1        1000 watercolour  sys02  img34 teacup-ocean    jpg
2        1000 watercolour  sys02  img34 teacup-ocean    png
3        2000 watercolour  sys02  img34 teacup-ocean    jpg
4        2000 watercolour  sys02  img34 teacup-ocean    png
5        4000 watercolour  sys02  img34 teacup-ocean    jpg
6        4000 watercolour  sys02  img34 teacup-ocean    png
7         500 watercolour  sys02  img34 teacup-ocean    jpg
8         500 watercolour  sys02  img34 teacup-ocean    png
9        8000 watercolour  sys02  img34 teacup-ocean    jpg
10       8000 watercolour  sys02  img34 teacup-ocean    png

Yay.

You can sense it can’t you? It will only get worse for you, my sweet. Look upon the grim visage of those that have passed this way before. Their lifeless bones are a warning.

Image by Chelms Varthoumlien. Available by CC0 licence on unsplash.

Serialising to plain text RDS

Okay, so what I’ve learned so far is that in most cases, an RDS file is just a gzipped version of … something. It’s the gzipped version of whatever the hell it is that serialize() creates. What I don’t yet know is how the serialize() function operates. What secret magic does it use? How does it construct this sequence of bytes? What do the contents of this file actually include?

I’ll start simple. Trying to understand how a complicated object is serialised might be painful, so I’ll set the art data frame to one side. Instead, I’ll serialise a numeric vector containing three elements, and … I guess I’ll set ascii = TRUE so that R uses UTF-8 to serialise the object to plain text format rather than … writing a binary file?

Clever girl. Yes, the default behaviour is binary serialization. Unless otherwise specified using the xdr argument, serialize() enforces a big-endian representation on the binary encoding. But you didn’t want to go there did you? It frightened you, didn’t it? The abyss stares back at you, sweetness, and you are beginning to attract its attention

bytes <- serialize(
  object = c(10.1, 2.2, 94.3), 
  connection = NULL,
  ascii = TRUE
)

When I print out the bytes vector I still don’t get text though?

bytes

 [1] 41 0a 33 0a 32 36 32 39 31 32 0a 31 39 37 38 38 38 0a 35 0a 55 54 46 2d 38
[26] 0a 31 34 0a 33 0a 31 30 2e 31 0a 32 2e 32 0a 39 34 2e 33 0a

I was expecting text. Where is my text??? I dig a little deeper and realise my mistake. What I’m looking at here is the sequence of bytes that correspond to the UTF-8 encoded text. If I want to see that text using actual letters, I need to use rawToChar(). When I do that I see something that looks vaguely like data:

rawToChar(bytes)

[1] "A\n3\n262912\n197888\n5\nUTF-8\n14\n3\n10.1\n2.2\n94.3\n"

It is a little easier to read if I use cat() to print the output:

bytes |>
  rawToChar() |>
  cat()

It’s… not immediately obvious how this output should be interpreted? I don’t know what all these lines mean, but I recognise the last three lines: those are the three values stored in the vector I serialised. Now I just need to work out what the rest of it is all about.

But before I do, I’ll check that this is exactly the same text that I see if I create an RDS file using the following command and then open that file in a text editor:

saveRDS(
  object = c(10.1, 2.2, 94.3), 
  file = "numbers.rds", 
  ascii = TRUE, 
  compress = FALSE
)

Okay, it checks out. My excitement can barely be contained.

Wilting already, aren’t you? Poor little flower, you’ve been cut from the stem. You’re dead already but you don’t even know it. All that is left is to wither away under the blistering glare of knowledge.

Image by Daria Shevtsova. Available by CC0 licence on unsplash.

Interpreting the RDS format

All right, lets see if I can interpret the contents of an RDS file. Rather than tediously writing the file to disk using saveRDS() and then loading it again, I’ll cheat slightly and write a show_rds() function that serialises an object and prints the results directly to the R console:

show_rds <- function(object, header = TRUE) {
  rds <- object |>
    serialize(connection = NULL, ascii = TRUE) |>
    rawToChar() |>
    strsplit(split = "\n") |>
    unlist()
  if(header == FALSE) rds <- rds[-(1:6)]
  cat(rds, sep = "\n")
}

Just to make sure it’s doing what it’s supposed to I’ll make sure it gives the output I’m expecting. Probably a good idea given how many times I’ve been surprised so far…

show_rds(object = c(10.1, 2.2, 94.3))

Okay, phew. That looks good.

I guess my next task is to work out what all this output means. The last three lines are obvious: that’s the data! What about the line above the data? That line reads 3 and is followed by three data values. I wonder if that’s a coincidence? I’ll see what happens if I try to serialise just 2 numbers. Does that line change to 2?

show_rds(object = c(10.1, 2.2))

Yes. Yes it does. I am learning things.

Here’s what I know so far:

A
3
262402
197888
5
UTF-8
14
3      # the object has length 3
10.1   # first value is 10.1
2.2    # second value is 2.2
94.3   # third value is 94.3

Okay, so what’s next? The 14 in the preceding line. What does that mean?

I puzzled over this for a while, and ended up needing to consult an occult tome of dangerous lore – the R Internals Manual – to find a partial answer. On the very first page of the Infernals Manual there is a table listing the SEXPTYPE codes that R uses internally to specify what kind of entity is encoded by an R object. Here are a few of these SEXPTYPE codes:

Value	SEXPTYPE	Variable type
10	LGLSXP	logical
13	INTSXP	integer
14	REALSXP	numeric
16	STRSXP	character
19	VECSXP	list

So… when I serialise a plain numeric vector, the RDS file writes the number 14 to the file. In that case I will tentatively update my beliefs about the RDS file

A
3
262402
197888
5
UTF-8
14     # the object is numeric
3      # the object has length 3
10.1   # first value is 10.1
2.2    # second value is 2.2
94.3   # third value is 94.3

Oh no dear. You have strayed so far from the light already. That 14 carries much more meaning than your fragile mind is prepared to handle. Soon you will know better. Soon you will unravel entirely. You can feel it coming, can’t you?

Image by Roxy Aln Available by CC0 licence on unsplash.

The RDS header

At this point, I have annotated every part of the RDS file that corresponds to the actual object. Consulting the section of the Infernal Manual devoted to serialisation, I learn that the six lines at the beginning of the file are known as the RDS header. Reading further I learn that the first line specifies the encoding scheme (A for ASCII, X for binary big-endian). The second line specifies which version of the RDS file format is used. The third line indicates the version of R that wrote the file. Finally, the fourth line is the minimum version of R required to read the file.

If I annotate my RDS header to reflect this knowledge, I get this:

A       # use ASCII encoding
3       # use version 3 of the RDS format
262402  # written with R version 4.1.2
197888  # minimum R version that can read it is 3.5
5
UTF-8

I am confused. Where did those numbers come from? Why does version 4.1.2 correspond to the number 262402, and why does 3.5 get encoded as 197888? The Manual is silent, and my thoughts become bleak. Am I losing my mind? Is the answer obvious??? What mess have I gotten myself into?

In desperation, I look at the R source code which reveals unto me the magic formula:

encode_r_version <- function(major, minor, patch) {
  (major * 65536) + (minor * 256) + patch
}

Yessss. This all makes sense now…

encode_r_version(4, 1, 2)
encode_r_version(3, 5, 0)

[1] 262402
[1] 197888

…so much sense.

What about the other two lines in the header? Prior to RDS version 3 – which was released in R version 3.5 – those two lines didn’t exist in the header. Those are now used to specify the “native encoding” of the file, according to the Manual.

“But isn’t that ASCII????”, whispers a voice in my head. “Is that not what the A is for?”

Not quite. The RDS file format isn’t restricted to ASCII characters. In the usual case, the RDS file can encode any UTF-8 character and the native encoding line reads UTF-8. There is another possibility though: the file may use the Latin-1 alphabet. Because of this, there is some ambiguity that needs to be resolved. The RDS file needs to indicate which character set is used for the encoding.

My annotated header now looks like this:

A      # the file uses ASCII encoding
3      # the file uses version 3 of the RDS format
262402 # the file was written in R version 4.1.2
197888 # the minimum R version that can read it is 3.5
5
UTF-8  # the file encodes UTF-8 characters not Latin-1

Okay, that makes a certain kind of sense, but what’s the story behind that 5? What does that mean? What dark secret does it hide?

It took me so very long to figure this one out. As far as I can tell this line isn’t discussed in the R Internals Manual, but I worked it out by looking at the source code for serialize. That line reads 5 because it’s telling the parser that the string that follows on the next line (i.e., UTF-8) contains five characters. Presumably if I’d used Latin-1 encoding, the corresponding line would have been 7.

This is doing my head in, but I think I’m okay?

Are you sure? Really? You don’t sound too certain

Image by Liza Polyanskaya. Available by CC0 licence on unsplash.

Logical, integer, and numeric vectors

Now that I have a sense of how the RDS header works, I’ll set header = FALSE whenever I call show_rds() from now on. That way I won’t have to look at that same six lines of output over and over and they will no longer haunt my dreams.

Oh no my dear. Hiding won’t save you.

I think the time has come to look at how RDS encodes other kinds of data. For three of the four commonly used atomic vector types (logical, integer, and numeric), the RDS format looks exactly as I expected given what I learned earlier. As shown in the table above, the SEXPTYPE code for a logical vector is 10, so a logical vector with four elements looks like this:

show_rds(
  object = c(TRUE, TRUE, FALSE, NA), 
  header = FALSE
)

TRUE values are represented by 1 in the RDS file, and FALSE values are represented by 0. Missing values are represented as NA.

For an integer vector, the output is again familiar. The SEXPTYPE here is 13, so a vector of four integer looks like this:

show_rds(
  object = c(-10L, 20L, 30L, NA),
  header = FALSE
)

Numeric vectors I’ve already seen. They have SEXPTYPE of 14, so a numeric vector of length 3 starts with 14 on the first line, 3 on the second line, and then the numbers themselves appear over the remaining three lines. However, there is a catch. There always is when dealing with real numbers. Numeric values are subject to the vagaries of floating point arithmetic when represented in memory, and the encoding is not exact. As a consequence, it is entirely possible that something like this happens:

show_rds(
  object = c(10.3, 99.9, 100),
  header = FALSE
)

14
3
10.3
99.90000000000001
100

Floating point numbers always make my head hurt. It is best not to dwell too long upon them lest my grip on sanity loosen.

Too late. Far, far too late.

Image by Hoshino Ai. Available by CC0 licence on unsplash.

Character vectors

What about character vectors?

Adorable that you think these will be safer waters in which to swim my dear. A wiser woman would turn back now and return to the shallows. Yet there you go, drifting out to sea. Fool.

Let’s create a simple character vector. According to the table above, character vectors have SEXPTYPE 16, so I’d expect that a character vector with three elements would start with 16 on the first line and 3 on the second line, which would then be followed by the contents of each cell.

And that’s… sort of true?

show_rds(
  object = c("text", "is", "strange"),
  header = FALSE
)

16
3
262153
4
text
262153
2
is
262153
7
strange

The format of this output is roughly what I was expecting, except for the fact that each string occupies three lines. For instance, these three lines correspond to the word "strange":

262153
7
strange

This puzzled me at first. Eventually, I remembered that the source code for R is written in C, and C represents strings as an array. So where R treats the word "strange" a single object with length 1, C treats it as a string array containing 7 characters. In the R source code, the object encoding a string is called a CHARSXP. So lines two and three begin to make sense:

262153
7        # the string has "length" 7
strange  # the 7 characters in the string

What about the first line? Given everything I’ve seen previously it’s pretty tempting to guess that it means something similar to the SEXPTYPE codes that we’ve seen earlier. Perhaps in the same way that numeric is SEXPTYPE 14 and logical is SEXPTYPE 10, maybe there’s some sense in which a single string has a “SEXPTYPE” of 262153? That can’t be right though. According to the R Internals Manual, a CHARSXP object has a SEXPTYPE code of 9, not 262153. I must be misunderstanding something? Why is it 262153?

Frightened by the first wave, are you? All in good time my love. The secrets of 262153 will reveal themselves soon.

Image by Tim Marshall Available by CC0 licence on unsplash.

Lists

What about lists? Lists are more complicated than atomic vectors, because they’re just containers for other data structures that can have different lengths and types. As mentioned earlier, they have SEXPTYPE 19, so a list with three elements will of course start with 19 on the first line and 3 on the second line. Here’s an example:

show_rds(
  object = list(
    c(TRUE, FALSE), 
    10.2, 
    c("strange", "thing")
  ),
  header = FALSE
)

19
3
10
2
1
0
14
1
10.2
16
2
262153
7
strange
262153
5
thing

This output makes my brain hurt, but it does make sense if I stare at it long enough. It begins with the two lines specifying that it’s a list of length three. This is then followed by the RDS representation for the logical vector c(TRUE, FALSE), the RDS representation for the numeric vector 10.2, and finally the RDS representation for the character vector c("strange", "thing").

I have started using annotations and whitespace to make it clearer:

19 # it's a list
3  # of length 3

  10  # list entry 1 is logical
   2  # of length 2
   
    1       # value is TRUE
    0       # value is FALSE
      
  14  # list entry 2 is numeric 
   1  # of length 1
   
    10.2    # value is 10.2
    
  16  # list entry 3 is character
   2  # of length 2
   
    262153  # every string starts with this
         7  # this string has 7 characters
   strange  # values are: s, t, r, a, n, g, e
   
    262153  # every string starts with this
         5  # this string has 5 characters
     thing  # values are: t, h, i, n, g

I feel so powerful! My mind is now afire with knowledge! All the secrets of RDS will be mine…

…and the madness strikes at last. Pride comes before the fall, always.

Image by Moreno Matković. Available by CC0 licence on unsplash.

Object attributes

One of the key features of R is that vectors are permitted to have arbitrary metadata: names, classes, attributes. If an R object contains metadata, that metadata must be serialised too. That has some slightly surprising effects. Let’s start with this very simple numeric object with two elements:

show_rds(object = c(100, 200), header = FALSE)

As expected it has SEXPTYPE 14 (numeric), length 2, and the values it stores are 100 and 200. Nothing out of the ordinary here. But when I add a name to the object, the output is … complicated.

show_rds(object = c(a = 100, b = 200), header = FALSE)

I … don’t know what I am looking at here. First off, I seem to be having the same problem I had with character strings. If I take the first line of this output at face value I would think that a named numeric vector has SEXPTYPE 526. That can’t be right, can it?

It isn’t. In the same way that strings don’t have a SEXPTYPE of 262153 (the actual number is 9), the 526 here is a little misleading. This is a numeric vector and like all numeric vectors it is SEXPTYPE 14. You will learn the error of your ways very soon.

Setting that mystery aside, I notice that the RDS output is similar to the output we saw when converting a list to RDS. The output contains the numeric vector first (the data), which is then followed by a list that specifies the attributes linked to that object?

Not quite. You’re so close, but it’s a pairlist, not a list. The underlying data structure is different. Don’t let it worry your mind, sweet thing. Preserve your mind for the trials still to come.

For this object, there’s only one attribute that needs to be stored, corresponding to the names associated with each element of the vector. If I annotate the output again, I get this:

526     # Numeric vector 
2       # with two values

   100     # value 1 
   200     # value 2
   
1026    # Pairlist for attributes
1       # with one pair of entries

   262153  # The attribute is called "names"
   5       # 
   names   # 
   
   16      # The attribute has two values
   2       # 
   
      262153   # First value is "a"
           1   #
           a   # 

      262153   # Second value is "b"
           1   #
           b   #

254   # end of pairlist

The 254 marking the end of the pairlist confused me for a little while, but it isn’t arbitrary. It represents a NULL value in the RDS format:

show_rds(NULL, header=FALSE)

Yes, my dear. If you look at the relevant part of the R source code, you see that there are a collection of “administrative codes” that are used to denote special values in a SEXPTYPE-like fashion. NULL is the one you’d be most likely to encounter though. Perhaps best not to travel down that road tonight though? Wait until day. You’re getting tired.

Image by Kelly Sikkema. Available by CC0 licence on unsplash.

Type/flag packing

Throughout this post, I’ve given the impression that when R serialises an object to RDS format, the first thing it writes is the SEXPTYPE of that object. Technically I wasn’t lying, but this is an oversimplificiation that hides something important. It’s time to unpack this, and to do that I’ll have to dive into the R source code…

Decoding the SEXPTYPE

After digging around in the source code I found the answer. What R actually does in that first entry is write a single integer, and packs multiple pieces of information into the bits that comprise that integer. Only the first eight bits are used to define the SEXPTYPE. Other bits are used as flags indicating other things. Earlier on, I said that a value of 526 actually corresponds to a SEXPTYPE of 14. That becomes clearer when we take a look at the binary representation of 14 and 526. The first eight bits are identical:

intToBits(14)

 [1] 00 01 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[26] 00 00 00 00 00 00 00

intToBits(526)

 [1] 00 01 01 01 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[26] 00 00 00 00 00 00 00

To extract the SEXPTYPE, what we want to do is ignore all the later bits. I could write a function that uses intToBits() to unpack an integer into its binary representation, then sets all the bits except the first eight to 0, and then converts back to an integer …but there’s no need. The thing I just described is a “bitwise AND” operation:

decode_sexptype <- function(x) bitwAnd(x, 255)

decode_sexptype(14)

[1] 14

decode_sexptype(526)

[1] 14

When I said that those 262153 values we encounter every time a string is serialised actually correspond to a SEXPTYPE of 9, this is exactly what I was talking about:

decode_sexptype(262153)

[1] 9

The attributes pairlist, which gave us a value of 1026 when the RDS is printed out as text?

decode_sexptype(1026)

[1] 2

Those are SEXPTYPE 2, and if we check the R internals manual again, we see that this is indeed the code for a pairlist.

I feel triumphant, but broken.

Girl, same.

Image by Aimee Vogelsang. Available by CC0 licence on unsplash.

What’s in the other bits?

I fear that my mind is lost, but in case anyone uncover these notes and read this far, I should document what I have learned about the contents of the other bits. There are a few different things in there. The two you’d most likely encounter are the object flag (bit 9) and the attributes flag (bit 10). For example, consider the data frame below:

data.frame(
  a = 1, 
  b = 2
)

  a b
1 1 2

has an integer code of 787. Data frames are just lists with additional metadata, so it’s not surprising that when we extract the SEXPTYPE we get a value of 19:

decode_sexptype(787)

[1] 19

But data frames are also more than lists. They have an explicit S3 class ("data.frame") and they have other attributes too: "names" and "row.names". If we unpack the integer code 787 into its constituent bits we see that bit 9 and bit 10 are both set to 1:

intToBits(787)

 [1] 01 01 00 00 01 00 00 00 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[26] 00 00 00 00 00 00 00

Bit 9 is the “object flag”: it specifies whether or not the R data structure has a class attribute. Bit 10 is the more general one, and is called the “attribute flag”: it specifies whether or not the object has any attributes.

Okay but what’s up with 262153?

Who is asking me all these questions anyway?

It worries me that I’m now listening to the voices in my head, but okay fine. If we unpack the integer code 262153, we see that there’s something encoded in bit 19:

intToBits(262153)

 [1] 01 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00
[26] 00 00 00 00 00 00 00

I haven’t found the part of the source code that sets this bit yet, but I’m pretty sure that the role of this bit is to flag whether or not the string should be added to the global string pool. In recent versions of R that’s true for all strings, so in practice every string has an integer code of 262153 rather than 9.

Image by Pelly Benassi. Available by CC0 licence on unsplash.

Are we done yet?

Well that depends on what you mean by asking the question. If you mean “have we described everything there is to know about the RDS format and how data serialisation works in base R?” then no, we’re absolutely not done. I haven’t said anything about how R serialises functions or expressions:

expr <- quote(sum(a, b, c))
fn <- function(x) x + 1

These are both R objects and you can save them to RDS files. So of course there’s a serialisation format for those but it’s not a lot of fun. I mean, if you squint at it you can kiiiiiinnnnda see what’s going on with the expression…

show_rds(expr, header = FALSE)

…but if I do the same thing to serialise the function it gets unpleasant. This has been quite an ordeal just getting this far, and I see no need to write about the serialisation of closures. Let someone else suffer through that, because my brain is a wreck.

So no, we are not “done”. The RDS format keeps some secrets still.

But if you mean “have we reached the point where the author is losing her mind and needs to rest?” then… oh my god yes I am utterly and completely done with this subject, and wish to spend the rest of my night sobbing quietly in darkness.

Let us never speak of this again.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{navarro2021,
  author = {Navarro, Danielle},
  title = {Data Serialisation in {R}},
  date = {2021-11-15},
  url = {https://blog.djnavarro.net/serialisation-with-rds},
  langid = {en}
}

For attribution, please cite this work as:

Navarro, Danielle. 2021. “Data Serialisation in R.” November 15, 2021. https://blog.djnavarro.net/serialisation-with-rds.