Software design by example – Notes from a data witch

The book I’m currently reading is Software Design by Example: A Tool-Based Introduction with JavaScript by Greg Wilson. Greg was kind enough to send me a review copy a little while back, and I’ve been slowly working my way through it.

In some ways I’m not the target audience for the book: it’s a book about software engineering that uses JavaScript for the worked examples, not a book designed to teach you JavaScript. I’m not the worst JavaScript coder in the world, but I’m not the strongest either, so it’s harder work for me than maybe it would have been if JavaScript were my primary language.

Book cover for 'Software Design by Example'

Who is the book for?

Some years ago I took a very good course that Greg ran on how to teach technical concepts,¹ and one thing he emphasised in that course is the importance of designing teaching materials with specific learning personas in mind. Having a small number of representative examples in mind when you write the content is incredibly useful when teaching, so it is no surprise that – literally on page 1 – the book states explicitly what the learner personas used to write the book were.

I’m going to reproduce them in full in this blog post as a reminder to myself that this is the right way to construct learner personas.

Aïsha started writing VB macros for Excel in an accounting course and never looked back. After spending three years doing front-end JavaScript work she now wants to learn how to build back-end applications. This material will fill in some gaps in her programming knowledge and teach her some common design patterns

Rupinder is studying computer science at college. He has learned a lot about the theory of algorithms, and while he uses Git and unit testing tools in his assignments, he doesn’t feel he understands how they work. This material will give him a better understanding of these tools and how to design new ones.

Yim builds mobile apps for a living but also teaches two college courses: one on full-stack web development using JavaScript and Node and another titled “Software Design”. They are happy with the former, but frustrated that so many books about the latter subject talk about it in the abstract and use examples that their students can’t relat to. This material will fill those gaps and give them starting points for a wide variety of course assignments.

They’re detailed enough to make the teacher engage with the personas during the writing process, diverse enough to help catch things the writer might not have thought of, and provide a strong indication to the learner about whether the book is written for them. In my case, for instance, it’s pretty clear from the outset that I’m likely to struggle with the level of JavaScript involved. Indeed, when I look at the list of skills that the reader is expected to have (on pages 1-2) I’m fine on most things but I suspect I have a little less practical experience with JavaScript than is ideal for a reader of this book. I’m not terrible at it, I just don’t spend enough time with it to feel fluent.

That’s okay, of course. It’s often a good experience as a learner to read content that pushes you outside your comfort zone. What matters is that you know in advance which aspects are going to be hard work for you. Thankfully, Software Design by Example does that very well at the beginning.

Rewarding the cursory reader

Despite knowing from the outset that I’m ever-so-slightly outside the target audience for the book, I’m still finding it very rewarding. To understand why, it’s worth taking a moment to look at how the book uses the glossary as the end to help readers like myself who have never received a formal education in programming… I’m basically the Aïsha persona, except with R playing the same role for me that JavaScript plays for her.

Because I don’t have a formal background and – to put it gently – don’t belong to a demographic that can easily acquire the “culture” of software engineering, I very commonly have the experience in conversation that software engineers will use terms that they simply assume that “everybody knows”, and never take the time to explain them. The actual concepts are often very simple things, and often I’ve had experience with them without knowing the names, but no-one ever tells you what the words mean and – sadly – many people in the field have a tendency to make you feel like an idiot if you ask.

With that as the real world backdrop, I’m finding the glossary to be worth its weight in gold. Here’s a little snippet from the top of page 318:

query string. The portion of a URL after the question mark ? that specfies extra parameters for the HTTP request as name-value pairs

race condition. A situation in which a result depend on the order in which two or more concurrent operations are carried out.

raise (an exception). To signal that something unexpected or unusual has happened in a program by creating an exception and handling it to the error-handling system, which then tries to find a point in the program that will catch it.

read-eval-print-loop (REPL). An interactive program that reads a command typed in by a user, executes it, prints the result, and then waits patiently for the next command. REPLs are often used to explore new ideas, or for debugging.

I can honestly say that at no point in my life has someone explained to me what a race condition is or what a REPL actually is. Seriously. I’ve worked in tech companies and people use those terms all the time but they never explain them. Very frustrating. So when I read entries like this in the glossary I find myself going “oh, that’s what means… okay, yes, I did already know this… but now I know what the name for it is”. I mean, race conditions are not at all unfamiliar to me – I encounter them quite a bit – but because software engineers have a tendency to refer to “race conditions” without ever saying what the term means, I’ve sat in a lot of very confusing conversations over the years that would have made a lot more bloody sense had I known the nomenclature or been in a position to “just ask” without being made to feel stupid.

I think that’s likely to be true for a lot of self-taught programmers who never studied computer science, but instead had to learn to code in order to solve practical problems. The mere act of reading a concise definition of each thing has the effect of making my mental model more precise, and better aligned with the mental models that other people in the field adopt. It’s a helpful way to learn the culture and avoid getting caught out by the various shibboleths that are sadly pervasive the tech industry.²

There are other examples of this sort of thing throughout the book, historical anecdotes and other tidbits that make it a little easier for an outsider to make sense of the culture of software engineering. As an example, this little passage on page 145 makes sense of something I’ve never understood:

The coordinate systems for screens puts (0, 0) in the upper left corner instead of the lower left. X increases to the right as usual, but Y increases as we go down, rather than up [The book has a little picture here]. This convention is a holdover from the days of teletype terminals that printed lines on rolls of paper

These historical asides are really valuable. It feels a little bit like one of those “Magician’s Secrets Revealed!” shows. Knowing the terminology, conventions, and history behind a thing does so much of the work in making it all feel a bit more coherent.

Anyway, let’s dive a little deeper, shall we?

A worked example

My mum always liked Delia Smith
And I drank, drank, drank just to deal with my shit
I learned to tell little white lies
When I feel inadequate almost all the time

I’d like to think I’m deep
I’d like to think I’m deep
I’d like to think I’m deep
But I just skim the pages, so I can fake my speech
- Sprints

A little honesty when writing blog posts is important, I feel. When reading the book I did not, in fact, attempt all the exercises or work through all the code examples. I tinkered with a few of the examples, read some parts thoroughly, and skimmed other parts. That’s pretty representative of how I read technical books, really. I’ll pick a few parts that I want to understand properly and do a deep dive in those sections, and read the rest of it in a much more superficial way.

The parts that I did read fairly deeply are Chapters 7 and 8, which talk about how to parse a regular expression, and how to write the code that does pattern matching using them. Actually, if I’m super honest the part I spent most time with is Chapter 8, which shows you how to extract tokens and a parse tree for a restricted form of regular expression syntax. For that chapter, I worked through the examples and translated (most of) the code in R. In the rest of the post I’ll show the code that I wrote, tie it back to the structure of Chapter 8 in the book, and at the end I’ll say a little about what I learned from this exercise.

The subset of regular expression syntax that the book uses for this chapter has the following rules:

Token Kind	Meaning	Characters used
Literal	A literal character	`a`, `b`, `c`, etc
Start	Beginning of string	`^`
End	End of string	`$`
Any	Zero or more of the previous thing	`*`
Or	Either the thing before or the thing after	`\|`
Group	Collection of tokens to treat as one thing	`(` and `)`

It’s a very small subset of the regular expression grammar available for pattern matching in R, JavaScript, and pretty much every language these days, but it’s a handy one. It’s certainly rich enough to make it an interesting exercise to write a parser. Not that it’s particularly important to write my own parser: the purpose of doing so is to wrap my head around the basics of how parsers work and nothing more. As Greg helpfully reminds us on page 99, if a format is commonly known there will be good tools already, and if you find yourself wanting to roll your own parser to interpret some new file format you just invented for something… there’s a good chance you shouldn’t do it. The world has enough file formats already.

Writing the tokenizer

If you want to write a regular expression parser, or any other parser for that matter, you actually have two distinct but interrelated problems to solve. Your data takes the form of an input string that you need to process. In English text, an input string might look like "Danielle hates regular expressions" but in a regular expression you’re more likely to have something that looks like "^caa*t$. To interpret these strings – in a purely syntactic sense, not a semantic one – you need to

Carve the string up into a sequence of distinct (and possibly labelled) tokens. My English expression could be divided into a sequence of four tokens corresponding to the four words: "Danielle", "hates", "regular", "expressions". This tokenization process is not uniquely defined. For instance, I might choose to acknowledge that the "-s" ending in some words is in fact a distinct unit. While this is absolutely not the post to talk about inflexional morphology in linguistics, it’s important to recognise that in some contexts you might want to tokenize my sentence as "Danielle", "hate", "-s", "regular", "expression", "-s".
Organise (or parse) the tokens into a structured format that explicitly acknowledges the way they relate to each other grammatically. For my English sentence this parsing might be something like this [Danielle] [[hates] [regular expressions]], where I’m using the square brackets to illustrate the organisation: "regular expressions" and "Danielle" are both noun phrases, [hates] is a verb, and [hates regular expressions] is a verb phrase.

The difference between the two is visually apparent when you try to draw them. When you tokenize the input you end up with a list of tokens:

Danielle
hates
regular
expressions

After parsing this list of tokens, you end up with a tree. There’s lots of ways you could visualise this tree, but something like this is good enough for my purposes:

[
  Danielle
]
[
  [
    hates
  ]
  [
    regular
    expressions
  ]
]

I could probably take this a bit further and annotate each part of the tree the way that linguists like to do, but it’s not necessary to get the basic idea.

The key thing to realise is that these two problems aren’t independent. If you tokenize the string in a way that isn’t suited to your problem, you’re going to make life harder when you try to write the parser. As it happens, Software Design by Example gives you a very gentle example of how this happens, which I’ll get to in moment when I try to write some code that automatically parses regular expressions like ^caa*t$.

Preliminaries

In the book, all the examples are – obviously!!! – written in JavaScript, and my code is going to be written in R. This matters somewhat since R and JavaScript are different languages with different assumptions about how you write code.³ To make my life a little easier, I’m going to define some extremely informal R classes⁴ that don’t actually do much. In my code, a “token” is just a list that has three fields: the kind of token referred to, the loc (location) of the token within the original string, and (optionally) a value that specifies the character (or characters) that represent the token in that string. Relatedly, the “token list” class is literally just a list:

token_class.R

token <- function(kind, loc, value = NULL) {
  structure(
    list(kind = kind, loc = loc, value = value),
    class = "token"
  )
}

token_list <- function(...) {
  structure(
    list(...),
    class = "token_list"
  )
}

print.token <- function(x, ...) {
  cat("<Token at ",  x$loc, "> ", x$kind, sep = "")
  if(!is.null(x$value)) {
    cat(":", x$value)
  }
  cat("\n")
  return(invisible(x))
}

print.token_list <- function(x, ...) {
  if(length(x) == 0) {
    cat("<Empty token list>\n")
  } else {
    for(token in x) {
      print(token)
    }
  }
  return(invisible(x))
}

The reason I decided to write this code for this post is that it also defines print() methods for these objects that make the printed output a little prettier:

token(kind = "Noun", loc = 1, value = "Danielle")

<Token at 1> Noun: Danielle

token_list(
  token(kind = "Noun", loc = 1, value = "Danielle"),
  token(kind = "Verb", loc = 10, value = "hates")
)

<Token at 1> Noun: Danielle
<Token at 10> Verb: hates

Without the print methods, you’d see each token printed as a list, and the list of tokens printed as a list of lists. It’s not super-important, but it does help me stay sane as I write. But whatever, let’s move on and start writing a tokenizer for (very restricted!) regular expressions…

Version 1

A hallmark of good teaching – in my not-entirely-humble opinion – is when the instructor designs material in a way that gently encourages the learner to discover things on your own. Chapter 8 of Software Design by Example does this rather nicely at the beginning, by having the reader start out by writing a simple tokenizer that is later revealed to be “readable, efficient, and wrong”. At the risk of revealing the instructional magic that Greg is so terribly good at, I’ll explicitly state the flawed intuition that might lead you to write a tokenizer like this.

Suppose I were to think about my tokenizer by considering a regular expression like "^(cat)|(dog)$". There’s a real trap you can fall into here, because this regular expression has natural language words, and you might be tempted to write a tokenizer that treats "cat" and "dog" as tokens. Or, to be slightly more precise, you might decide that to create tokens that allow “Literals” to contain multiple characters, like this:

tokenize("^(cat|dog)$")

<Token at 1> Start
<Token at 2> GroupStart
<Token at 3> Literal: cat
<Token at 6> Or
<Token at 7> Literal: dog
<Token at 10> GroupEnd
<Token at 11> End

Merging a sequence of literal characters into a single multi-character literal makes the output readable, and who doesn’t love it when the tokenizer informs you that you have a “Literal cat” in your string?

The book then walks you through the process of writing a tokenize() function that behaves exactly like this. Obviously, the original is in JavaScript, but here’s an R version:

tokenizer_1.R

simple <- list(
  "*" = "Any",
  "|" = "Or",
  "(" = "GroupStart",
  ")" = "GroupEnd"
)

tokenize <- function(text) {
  result <- token_list()
  n <- 0
  for (i in 1:nchar(text)) {
    chr <- substr(text, start = i, stop = i)

    # simple cases are always added as non-literal tokens
    if (chr %in% names(simple)) {
      result[[n + 1]] <- token(simple[[chr]], i)
      n <- n + 1

    # the ^ character is non-literal if position is 1
    } else if (chr == "^" & i == 1) {
      result[[n + 1]] <- token("Start", i)
      n <- n + 1

    # the $ character is non-literal if it's the last character
    } else if (chr == "$" & i == nchar(text)) {
      result[[n + 1]] <- token("End", i)
      n <- n + 1

    # literals that follow a non-literal create a new token
    } else if (n > 0 && result[[n]]$kind != "Literal"){
      result[[n + 1]] <- token("Literal", i, value = chr)
      n <- n + 1

    # literals that follow a literal are combined with it
    } else {
      result[[n]]$value <- paste0(result[[n]]$value, chr)
    }
  }
  return(result)
}

I won’t recapitulate all the steps that go into writing code like this – read the book if you want to know things at this level of detail – but suffice it to say that if I were better at JavaScript I’d have found it very easy to follow.

The point that matters here is that the reader is being encouraged to consider what happens to you later if you tokenize the input this way. Merging a sequence of literal characters into a single multi-character literal makes the output readable, and in this specific case the token list is very convenient if I later wanted to write a regular expression matcher that builds on top of this tokenizer. Using the base R grepl() function, you can see which strings match "^(cat|dog)$" and which don’t:

grepl("^(cat|dog)$", c("cat", "dog", "dag", "ca"))

[1]  TRUE  TRUE FALSE FALSE

If my tokenizer represents the literals in ^(cat|dog)$ as the two multicharacter literals "cat" and "dog" then – in this instance – it’s going to be easier for me to write a regex matcher that mirrors the behaviour of grepl().

Very nice.

Unfortunately, once you start thinking about regular expressions more generally, there’s a big problem with this tokenizer. It’s trying to be clever, by grouping multiple literals together without considering how those literals will be used later on by the parser, and it ends up being too greedy sometimes. Consider the regular expression "^caa*t$". This is a pattern that should match against "cat" and "caaaaat" but should not match "caacaat", as illustrated below:

grepl("^caa*t$", c("cat", "caaaaat", "caacaat"))

[1]  TRUE  TRUE FALSE

However, let’s look at the tokens produced by our tokenizer() function:

tokenize("^caa*t$")

<Token at 1> Start
<Token at 2> Literal: caa
<Token at 5> Any
<Token at 6> Literal: t
<Token at 7> End

Yeah, we’re in big trouble here.

That’s the wrong way to tokenize this input: in this regular expression the * operator (i.e., the Any token in our token list) needs to be applied only to the preceding character, so it’s not correct to treat "caa" as a single literal string. To write a functioning parser on top of this tokenizer would be a nightmare, because the parser would need to inspect the internal contents of the tokens: in order to parse the * character (the Any token), it would need to grab the Literal caa token and split it into two parts, a prefix "ca" and a suffix "a", because * applies only to the suffix.

What this is telling us is that we’ve chosen a poor tokenizing scheme. The book is quite gentle in leading the reader to this conclusion, but when you’re actually writing the code yourself you can’t avoid discovering it. If your parser has to break apart your tokens to organise the input, then really you should be rethinking the tokenizer.

Version 2

Okay so that doesn’t work. The second approach considered in the chapter simplifies the code a little and produces a token list where every literal character is treated as a distinct token. The code is only a minor modification of the previous version:

tokenizer_2.R

simple <- list(
  "*" = "Any",
  "|" = "Or",
  "(" = "GroupStart",
  ")" = "GroupEnd"
)

tokenize <- function(text) {
  result <- token_list()
  n <- 0
  for (i in 1:nchar(text)) {
    chr <- substr(text, start = i, stop = i)

    # simple cases are always added as non-literal tokens
    if (chr %in% names(simple)) {
      result[[n + 1]] <- token(simple[[chr]], i)
      n <- n + 1

    # the ^ character is non-literal if position is 1
    } else if (chr == "^" & i == 1) {
      result[[n + 1]] <- token("Start", i)
      n <- n + 1

    # the $ character is non-literal if it's the last character
    } else if (chr == "$" & i == nchar(text)) {
      result[[n + 1]] <- token("End", i)
      n <- n + 1

    # literals always create a new token
    } else {
      result[[n + 1]] <- token("Literal", i, value = chr)
      n <- n + 1
    }
  }
  return(result)
}

Here’s the sort of output we get from this version of the tokenizer:

tokenize("^caa*t$")

<Token at 1> Start
<Token at 2> Literal: c
<Token at 3> Literal: a
<Token at 4> Literal: a
<Token at 5> Any
<Token at 6> Literal: t
<Token at 7> End

This output is entirely correct, as long as our goal is only to extract and label the tokens in our regular expression. The output correctly labels each literal as a literal and assigns the appropriate label to the non-literals. What is very clear, though, when you look at the output from this version of tokenize() is that it is absolutely not a parser. We’ve lost the grouping structure that we had in our original version. The tokenizer has no way of expressing the idea that "ca" is a syntactically coherent unit in the expression ^caa*t$, or that "cat" is similarly coherent within ^(cat|dog)$. It’s a good tokenizer, but a bad parser.

That’s okay: it’s not meant to be a parser. The tokenizer does its job perfectly well, and importantly it provides output that a good parser can work with. It requires a little thought, but it’s going to work out okay because the tokenizer is reliable as a tokenizer. Each tool does one job and only that job: you don’t need your tokenizer to parse the syntax, and you don’t want your parser to mess around with the internal contents of the tokens.

Gosh… I wonder if that’s one of those software engineering principles?

Post-mortem

At this point any competent software engineer is probably screaming internally, because I have not written any unit tests for my code. This is terribly bad practice, and something I would never do when writing actual software.⁵ Suffice it to say Software Design by Example is at great pains to emphasize the importance of unit tests, and if you were actually following the chapter step by step you’d see that Greg does in fact introduce tests as he develops this example. I’ve been lazy in this blog post because… well, it’s a blog post. It’s neither intended to be software nor a chapter in a book on software engineering.

Anyway… let’s return to the development of ideas in the chapter, yes?

Parsing the tokens

The second half of chapter 8 in Software Design by Example focuses on the parser, and again I’m not going to try to recapitulate all the logic that the book walks you through. It takes the reader through an intuitive process of thinking about how you want to write the parser, but the key thing I want to highlight is that when you’re reading the book you get a strong sense of why you want to write the parser in two parts: there’s a “forward pass” where the parser sweeps through the token list from first to last, constructing the parts of the parse tree that it can handle on the basis of what it has seen so far, and then a subsequent clean up phase where it sweeps back and fixes all the incomplete parts. When I came to implement it in R myself I made some small departures from the way Greg has done it in the book, but the essence is the same. What I’ll do here is present two versions of the parser, one that only does the forward pass (so you can see all the missing bits), and then a second version that does the clean up afterwards.

Preliminaries

As before, I’ll do some preliminary work that isn’t really essential for the purposes of the book, but I find helpful for writing this blog post. Specifically, I’ll define a subtree() function that provides a “subtree” class. All it does is capture the fundamental structure of a tree: each node is defined by a parent element, and that parent can have zero, one, or more children. This isn’t really necessary for our parser, but it does allow me to define a print() method that makes the parse trees in this post look a little prettier:

parse_class.R

subtree <- function(parent, children = list()) {
  structure(
    list(parent = parent, children = children),
    class = "subtree"
  )
}

print.subtree <- function(x, ...) {
  if(length(x) == 0) {
    cat("<Empty parse_tree>\n")
  } else {
    print(x$parent)
    if(length(x$children) > 0) {
      for(child in x$children) {
        out <- capture.output(print(child))
        out <- paste("    ", out)
        cat(out, sep = "\n")
      }
    }
  }
  return(invisible(x))
}

Whatevs. That’s not really the point of the book, and frankly if I were writing a book on software design⁶ I would probably make the same choice that Greg has made: for the blog post I want pretty output because it’s supposed to be an easy read. For a book that is intended to help you think about software design? For that you actually want the reader to engage deeply with the “list of lists of lists of…” data structure that my print methods are glossing over.

Version 1

With that out of the way, let’s have a look at the code for a parser that only does the forward sweep. The key insight that the book walks you through is that there are three distinct types of action the parser takes at this step:

There are kinds of token (Literal, Start, End, and GroupStart) that the parser can process simply by appending the token to the bottom of the tree the moment it encounters them.
There are other kinds of token (GroupEnd and Any) that require the parser to restructure the tree, but they rely only on the tokens seen so far, so the parser can reorganise the tree on the fly during this first pass
There is one annoying token (Or) that depends both on things the parser has already seen and on things that haven’t been observed yet as the parser sweeps forward. What we do here is a partial organisation: we process the bits we know about (e.g., the a on the left hand side of an a|b statement is known even when we’ve only read a| so far) but then leave a “Missing” placeholder token to express the fact that we know that the Or operator | has two children: we know a, but b will need to be filled in later.

Here’s the R code, which supplies an update_tree() function that sequentially modifes the parse tree whenever new tokens arrive:⁷

parser_1.R

list_reverse <- function(x) {
  x[length(x):1]
}

update_tree <- function(tree, token) {

  # For some kinds of token, we simply append them to the tree
  if(token$kind %in% c("Literal", "Start", "End", "GroupStart")) {
    tree[[length(tree) + 1]] <- subtree(token)
  }

  # When GroupEnd is encountered, find the most recent GroupStart and
  # make the tokens between them the children of a Group
  if (token$kind == "GroupEnd") {
    children <- list()
    while(TRUE) {
      last <- tree[[length(tree)]]
      tree <- tree[-length(tree)]
      if(last$parent$kind == "GroupStart") {
        break
      }
      children[[length(children) + 1]] <- last
    }
    tree[[length(tree) + 1]] <- subtree(
      parent = token("Group", last$parent$loc),
      children = list_reverse(children)
    )
  }

  # When Any is encountered, make the preceding token (or subtree)
  # the child of the Any token
  if (token$kind == "Any") {
    last <- tree[[length(tree)]]
    tree[[length(tree)]] <- subtree(
      parent = token,
      children = list(last)
    )
  }

  # When Or is encountered, create a subtree with two children. The
  # first (or left) child is taken by moving it from the previous
  # token/subtree in our list. The second child is tagged as "Missing"
  # and will be filled in later
  if (token$kind == "Or") {
    last <- tree[[length(tree)]]
    tree[[length(tree)]] <- subtree(
      parent = token,
      children = list(last, subtree(token(kind = "Missing", loc = 0)))
    )
  }

  return(tree)
}

parse <- function(text) {
  tree <- list()
  tokens <- tokenize(text)
  for(token in tokens) {
    tree <- update_tree(tree, token)
  }
  class(tree) <- "token_list" # allows pretty printing
  return(tree)
}

For expressions like "^caa*t$ that don’t have a |, this version of the parser constructs a completed parse tree:

parse("^caa*t$")

<Token at 1> Start
<Token at 2> Literal: c
<Token at 3> Literal: a
<Token at 5> Any
     <Token at 4> Literal: a
<Token at 6> Literal: t
<Token at 7> End

Notice that the ordering of the tokens has changed: to represent the subexpression a*, the parser creates an Any token corresponding to the * character and ensures that the Literal a token is a child of the Any operator.

A similar thing happens where GroupStart and GroupEnd tokens are collapsed into a single Group token that has all tokens within the group as children:

parse("(na)* hey yeah goodbye")

<Token at 5> Any
     <Token at 1> Group
          <Token at 2> Literal: n
          <Token at 3> Literal: a
<Token at 6> Literal:  
<Token at 7> Literal: h
<Token at 8> Literal: e
<Token at 9> Literal: y
<Token at 10> Literal:  
<Token at 11> Literal: y
<Token at 12> Literal: e
<Token at 13> Literal: a
<Token at 14> Literal: h
<Token at 15> Literal:  
<Token at 16> Literal: g
<Token at 17> Literal: o
<Token at 18> Literal: o
<Token at 19> Literal: d
<Token at 20> Literal: b
<Token at 21> Literal: y
<Token at 22> Literal: e

For expressions that contain an either/or operation, we end up with a parse tree that contains one or more Missing tokens. That’s my way of expressing the fact that the parser needs to come back and clean up afterwards:

parse("(cat)|(dog)")

<Token at 6> Or
     <Token at 1> Group
          <Token at 2> Literal: c
          <Token at 3> Literal: a
          <Token at 4> Literal: t
     <Token at 0> Missing
<Token at 7> Group
     <Token at 8> Literal: d
     <Token at 9> Literal: o
     <Token at 10> Literal: g

Notice that there are some kinds of regular expressions for which this clean-up might require us to dive deep into the tree to fix the incomplete parts:

parse("ab|((cd*)|ef)|g")

<Token at 1> Literal: a
<Token at 3> Or
     <Token at 2> Literal: b
     <Token at 0> Missing
<Token at 14> Or
     <Token at 4> Group
          <Token at 10> Or
               <Token at 5> Group
                    <Token at 6> Literal: c
                    <Token at 8> Any
                         <Token at 7> Literal: d
               <Token at 0> Missing
          <Token at 11> Literal: e
          <Token at 12> Literal: f
     <Token at 0> Missing
<Token at 15> Literal: g

Again, there’s a kind of principle here: each part of the tool has its own job to do. The forward pass does the parts that it can do, and delegates the unfinished work to the clean-up process. When reading the book, and especially when implementing it yourself, the learner is invited to think about the importance of carving up the software into sensible parts.

Seems like a thing worth knowing.

Version 2

At this point in the post I’m guessing that the reader is about ready to see the end product. The final version of the code adds a compress_tree() function that is called once the forward pass is complete. For simple trees all it really does is sweep up from bottom to top, but because there are regular expressions in which you can have Or tokens nested quite a long way into the parse tree it also sweeps up through the subtrees when they are encountered:

parser_2.R

list_reverse <- function(x) {
  x[length(x):1]
}

update_tree <- function(tree, token) {

  # For some kinds of token, we simply append them to the tree
  if(token$kind %in% c("Literal", "Start", "End", "GroupStart")) {
    tree[[length(tree) + 1]] <- subtree(token)
  }

  # When GroupEnd is encountered, find the most recent GroupStart and
  # make the tokens between them the children of a Group
  if (token$kind == "GroupEnd") {
    children <- list()
    while(TRUE) {
      last <- tree[[length(tree)]]
      tree <- tree[-length(tree)]
      if(last$parent$kind == "GroupStart") {
        break
      }
      children[[length(children) + 1]] <- last
    }
    tree[[length(tree) + 1]] <- subtree(
      parent = token("Group", last$parent$loc),
      children = list_reverse(children)
    )
  }

  # When Any is encountered, make the preceding token (or subtree)
  # the child of the Any token
  if (token$kind == "Any") {
    last <- tree[[length(tree)]]
    tree[[length(tree)]] <- subtree(
      parent = token,
      children = list(last)
    )
  }

  # When Or is encountered, create a subtree with two children. The
  # first (or left) child is taken by moving it from the previous
  # token/subtree in our list. The second child is tagged as "Missing"
  # and will be filled in later
  if (token$kind == "Or") {
    last <- tree[[length(tree)]]
    tree[[length(tree)]] <- subtree(
      parent = token,
      children = list(last, subtree(token(kind = "Missing", loc = 0)))
    )
  }

  return(tree)
}

has_children <- function(x) {
  !is.null(x$children) && length(x$children) > 0
}

compress_or_skip <- function(tree, location) {
  if (has_children(tree[[location - 1]])) {
    n <- length(tree[[location - 1]]$children)
    if (tree[[location - 1]]$children[[n]]$parent$kind == "Missing") {
      tree[[location - 1]]$children[[n]] <- tree[[location]]
      tree[[location]] <- NULL
    }
  }
  return(tree)
}

compress_tree <- function(tree) {
  if (length(tree) <= 1) {
    return(tree)
  }

  # Compress branches of the top-level tree
  loc <- length(tree)
  while (loc > 1) {
    tree <- compress_or_skip(tree, loc)
    loc <- loc - 1
  }

  # Recursively compress branches of children subtrees
  for (loc in 1:length(tree)) {
    tree[[loc]]$children <- compress_tree(tree[[loc]]$children)
  }

  return(tree)
}

parse <- function(text) {
  tree <- list()
  tokens <- tokenize(text)
  for(token in tokens) {
    tree <- update_tree(tree, token)
  }
  tree <- compress_tree(tree)
  class(tree) <- "token_list" # allows pretty printing
  return(tree)
}

For expressions that don’t have an Or token, the output is the same as before:

parse("^caa*t$")

<Token at 1> Start
<Token at 2> Literal: c
<Token at 3> Literal: a
<Token at 5> Any
     <Token at 4> Literal: a
<Token at 6> Literal: t
<Token at 7> End

However, what we see in this version is that the parse tree for expressions like "(cat)|(dog)" now places the right hand side of the either/or expression (i.e., the (dog) group) in the appropriate place:

parse("(cat)|(dog)")

<Token at 6> Or
     <Token at 1> Group
          <Token at 2> Literal: c
          <Token at 3> Literal: a
          <Token at 4> Literal: t
     <Token at 7> Group
          <Token at 8> Literal: d
          <Token at 9> Literal: o
          <Token at 10> Literal: g

No more of those “Missing” tokens. And because the compress_tree() function is recursively applied to the subtrees, it can also handle uglier expressions like "ab|((cd*)|ef)|g":

parse("ab|((cd*)|ef)|g")

<Token at 1> Literal: a
<Token at 3> Or
     <Token at 2> Literal: b
     <Token at 14> Or
          <Token at 4> Group
               <Token at 10> Or
                    <Token at 5> Group
                         <Token at 6> Literal: c
                         <Token at 8> Any
                              <Token at 7> Literal: d
                    <Token at 11> Literal: e
               <Token at 12> Literal: f
          <Token at 15> Literal: g

Neat. I mean… the thing itself is ugly as sin, but the fact that an amateur like myself can spend a day or two working through the relevant chapters in the book and write code that can handle parse trees like these ones… yeah, that’s very neat indeed.

Final thoughts?

Back when I was an academic, as opposed to whatever the hell it is I am in my middle aged unemployment era, there was this whole tradition of writing “critical” book reviews that would ostensibly count as scholarly output. This led to this weird thing where you’d read a book and enjoy it, but you’d feel this obligation to write a Proper Academic Review, and like most things in academia that have Prestige Points associated with them, these reviews would be extremely thorough, mind-meltingly boring, and pointlessly cruel. Because that’s what academia is designed to encourage.

Yeah nah. I’m too old for that shit, and I gave up tenure a lifetime⁸ ago.

Having read most of the book at a superficial level and done a deep dive into the parts that I felt like diving into, I have to say I rather like Software Design by Example. It’s not written the way I would write a book: anyone who knows me understands that I will never choose to write a 200 page book when I could write a 600 page book instead. Greg Wilson doesn’t write like me: the book is brief, it covers a lot of topics concisely, and yet still manages to convey a lot of the “folk knowledge” and other cultural bits and pieces that actually matter in the wild. Honestly that’s an awfully impressive achievement. I quite enjoyed it.

Totally worth the time I spent reading it. Would recommend to others.

Footnotes

Something that you’d think I’d have been taught back when I was an academic and had to do it for a living. Universities, however, are utterly useless at this kind of thing. They tend to throw professors in the deep end with this expectation that someone who has made a career as a good researcher will automatically work out how to be a good teacher. Suffice it to say, this widespread practice has not been the best thing for higher education.↩︎
As an aside, this is hardly unique to the tech world. Academia is just as bad. Probably worse, actually.↩︎
Example: Greg’s code relies heavily on JavaScript .push() and .pop() methods that make it very easy to treat an array as a stack. I’m going to work with R lists that don’t have these little niceties. It’s going to make some of my code later a little ugly, I’m afraid. Oh well.↩︎
This is not the place for me to talk about my love/hate relationship with the S3 object-oriented programming system in R. Suffice it to say, I have feelings. That’s enough for now.↩︎
Okay, I’ll be honest, I did have a set of informal tests that I wrote to accompany my code, even for this blog post: I was too lazy to bother writing a proper test suite, but the initial drafts of this blog post had a bunch of regular expressions that would be tokenized/parsed every time I rendered the draft, which let me check that the code I was writing was not “total batshit”. I do have some standards, at least when it comes to code. The same cannot be said of my taste in men, but that’s a different story…↩︎
Yeah no that will never happen. If you’re waiting for me to attempt something like what Greg has done, then – to quote Paul Kelly and Kev Carmody – you don’t stand the chance of a cinder in snow. I know my limits.↩︎
This is analogous to the handle() JavaScript function that appears in the book. What can I say? I’m a perverse woman who likes to use her own names for things.↩︎
Two years, but that’s a lifetime in tranny-years. We age fast, you know.↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{navarro2023,
  author = {Navarro, Danielle},
  title = {Software Design by Example},
  date = {2023-05-31},
  url = {https://blog.djnavarro.net/posts/2023-05-31_software-design-by-example/},
  langid = {en}
}

For attribution, please cite this work as:

Navarro, Danielle. 2023. “Software Design by Example.” May 31, 2023. https://blog.djnavarro.net/posts/2023-05-31_software-design-by-example/.