Plotting data in Julia

Part three of this three-part series of Julia posts, in which I try to figure out how to draw some nice plots
Julia
Data Visualisation
Author
Published

March 3, 2024

And so we come to the third post in my extremely ill-considered foray into learning Julia. In the first part of the series I taught myself some of the foundations for writing Julia code, in the second part I discussed data wrangling in Julia, and so in the third and – dear god in heaven please – last of them, I’ll take a look at data visualisation using the Plots package, and using PalmerPenguins as my data set.

It will be brief this time, right? Right????

Cover of 'Sirius' by Olaf Stapledon

Sirius, by Olaf Stapledon. See, it’s a Penguin science fiction book so it keeps to the penguins theme of this post and the sci-fi theme of the series. I’m soooooo clever

Penguins data

It is hardly a deep insight to say this, but if you’re going to play around with data visualisation tools it does help somewhat to have some data that you can plot. In the last post I pretty much exhausted my ability to look at the Star Wars data set that comes bundled in with the dplyr R package, so I’m picked something different this time. For no particular reason I decided to go with the Palmer Penguins data set that also exists in R, and has a Julia package in PalmerPenguins. However, due to a painful little episode part way through writing all these posts I mistakenly decided that the PalmerPenguins package was causing me problems in the context of this quarto blog post (it wasn’t… the actual problem was that I had my Julia environment configured incorrectly), so instead I ended up writing a copy of the data to a CSV file and used that instead. Sigh.

Okay, let’s start by loading the packages I’ll need for this post:

using CSV
using DataFrames
using Plots
using StatsPlots

Amazing. Thrilling. Et cetera. Now that we’re all overwhelmed by the feeling of rising anticipation, let’s load the data set from the CSV file:

penguins = DataFrame(CSV.File("penguins.csv"; missingstring = "NA")) 
344×8 DataFrame
319 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
String15 String15 Float64? Float64? Int64? Int64? String7? Int64
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
4 Adelie Torgersen missing missing missing missing missing 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 missing 2007
10 Adelie Torgersen 42.0 20.2 190 4250 missing 2007
11 Adelie Torgersen 37.8 17.1 186 3300 missing 2007
12 Adelie Torgersen 37.8 17.3 180 3700 missing 2007
13 Adelie Torgersen 41.1 17.6 182 3200 female 2007
333 Chinstrap Dream 45.2 16.6 191 3250 female 2009
334 Chinstrap Dream 49.3 19.9 203 4050 male 2009
335 Chinstrap Dream 50.2 18.8 202 3800 male 2009
336 Chinstrap Dream 45.6 19.4 194 3525 female 2009
337 Chinstrap Dream 51.9 19.5 206 3950 male 2009
338 Chinstrap Dream 46.8 16.5 189 3650 female 2009
339 Chinstrap Dream 45.7 17.0 195 3650 female 2009
340 Chinstrap Dream 55.8 19.8 207 4000 male 2009
341 Chinstrap Dream 43.5 18.1 202 3400 female 2009
342 Chinstrap Dream 49.6 18.2 193 3775 male 2009
343 Chinstrap Dream 50.8 19.0 210 4100 male 2009
344 Chinstrap Dream 50.2 18.7 198 3775 female 2009

Indeed, that is a lot of penguins. Precisely what I was looking for. I have some data, now I can start drawing some plots.

Cover of 'The Handmaids Tale'

The Handmaids Tale by Margaret Atwood, because obviously what every queer person and woman wants to be thinking about in 2024 is what’s probably going to happen to us in the near future

Scatter plots

As you might imagine, there’s more than one way you can go about constructing a data visualisation in Julia. For example, there’s the AlgebraOfGraphics system that appears share some design philosophy with the ggplot2 package in R. There’s also Compose, which aims to be a modernised version of the grid package in R, and the Gadfly data visualisation system built on top of it. Any of these might have been better choices for me to explore in the first instance, but for whatever reason1 I chose instead to look at the Plots package and its extension package StatsPlots.

The impression I get from playing around with Plots/StatsPlots is that its design has more in common with the base graphics system in R than with ggplot2. There’s a generic function plot() that you can use to construct plots, and lots of more specific functions that are essentially wrappers to plot(). For example, I’m going to start by drawing a scatter plot. I could use the wrapper function scatter() for this purposes, but you don’t actually need to do this because it’s effectively the same thing as calling plot() with different defaults. For learning purposes I find it helpful to understand what the different arguments to plot() actually do, so I’ll avoid using the convenience functions here.

Okay, so here’s a simple scatter plot that plots the bill length of each penguin against the bill depth:

plot(
  penguins.bill_length_mm, 
  penguins.bill_depth_mm, 
  seriestype=:scatter,
  size=(500,500)
)

In this code chunk, the first argument specifies the variable to be plotted on the x-axis and the second specifies the variable to be plotted on the y-axis. No surprises there. The key thing to note is that in the later arguments I’ve specified two plot attributes. The seriestype attribute is what gives me a scatter plot (each observation is plotted with a single discrete plot marker), whereas the size attribute is used to control the size of the image produced in the output (in pixels).

Cool. Well, I can definitely draw something. That feels like a win.

Using the @df macro

One thing that immediately irritates me about the code I wrote in the last section is that I’ve was passing the raw vectors penguins.bill_length_mm and penguins.bill_depth_mm to the plotting function. When you’re working with a data frame that always feels clunky to me. What you really want to do is just use the column names :bill_length_mm and :bill_depth_mm. On the surface though that seems a little tricky to do, because the plots() function doesn’t necessarily need to be given data that are contained within a data frame.

This is where the @df macro comes in handy. Somehow, I have made it three posts into Julia and I have not yet actually had to use a macro for anything, but now is the time. I’ll talk about macros in a just a moment, but for now let’s simply note that I can prefix my call to plot() with a magic bit of syntactic sugar, and now all of a sudden I can simply pass column names and everything works:

@df penguins plot(
  :bill_length_mm,
  :bill_depth_mm,
  seriestype=:scatter,
  group=:species,
  size=(500,500)
)

I’ve made a nice little change in this version of the plot too. By mapping the group attribute onto the :species column in the penguins data set, I’ve ended up with a plot that shows each of the three penguin species in a different colour and gives me a nice little explanatory legend.

I can take this line of thinking a little further and modify other relevant attributes:

@df penguins plot(
  :bill_length_mm,
  :bill_depth_mm,
  seriestype=:scatter,
  group=:species,
  title="Palmer Penguins",
  xlabel="Bill Length (mm)",
  ylabel="Bill Depth (mm)",
  size=(500,500)
)

Now I have a plot that has a mildly informative title, and axis labels that are nicer to read. Definitely making some progress now.

Cover of 'The Day of the Triffids'

The Day of the Triffids by John Wyndham. Well, it’s more optimistic than Handmaids Tale I guess

Macros

Siiiiiiigh. Okay, I used a macro. I suppose I’d better take a quick look at how those things work, right? Macros are part of the Metaprogramming toolkit in Julia, and can be used to manipulate Julia expressions passed by the user, making it possible for code that wouldn’t otherwise be valid Julia to be executed. In essence we’re talking about something qualitatively similar to “non-standard evaluation” in R.

Based on that description, you can kind of see what the @df macro is doing in the example plots from the last section. It’s taking a call to plots() that specifies only the column names that are associated with a particular vector of observations, and replaces them with the actual data values stored in the relevant data frame when the code is evaluated.

You can see this in action here:

@df penguins println(:bill_length_mm[1:3])
Union{Missing, Float64}[39.1, 39.5, 40.3]

In this code, what @df is doing is transforming :bill_length_mm[1:3] (which really shouldn’t work at all) into penguins.bill_length_mm[1:3]. So as a consequence, the code that actually gets executed here is something like println(penguins.bill_length_mm[1:3]), and you get sensible output.

Well, sort of. I haven’t quite looked into this in a lot of detail yet, and I’m not quite at the point where I’m really prepared to start writing macros of my own, but it does look like @df is slightly more sophisticated, which you can see by using the @macroexpand macro to see what actually gets executed in my little example above:

@macroexpand @df penguins println(:bill_length_mm[1:3])
:(((var"##297"->begin
          ((var"##bill_length_mm#298",), var"##299") = (StatsPlots).extract_columns_and_names(var"##297", :bill_length_mm)
          (StatsPlots).add_label(["(bill_length_mm)[13]"], println, var"##bill_length_mm#298"[1:3])
      end))(penguins))

Okay yeah, there’s a little bit more going on than meets the eye (presumably in part because @df is designed to work in the context of the Plots package), but the basic idea makes sense at least. I’m happy to move on and defer a proper dive into macros for a future occasion.

Cover of 'Earthsea: The First Four Books'

The Earthsea Quartet by Ursula Le Guin. Much better!

Violin plots

At this point I am getting exhausted. I have written far more on Julia than I intended to, so I’m going to do one more plot and call it a day. In my previous examples I used the seriestype argument to plot() ensure that what I got out at the end was a scatter plot. I could alternatively set seriestype=:violin to get a violin plot.

Here’s an example, just to prove to myself that I understand plot() well enough to create various different kinds of data visualisations. In the code below I’ll first define a bill_lengths data frame that contains only the columns I need and – importantly – removes the missing values (because the violin series can’t handle missing data). Then I’ll use plots() to create a violin plot:

bill_lengths = penguins |> 
  d -> subset(d, :bill_length_mm => b -> .!ismissing.(b)) |>
  d -> select(d, [:species, :bill_length_mm])

@df bill_lengths plot(
  string.(:species),
  :bill_length_mm,
  seriestype=:violin,
  legend=false,
  xlabel="Species",
  ylabel="Bill Length (mm)",
  size=(500,500)
)

Yes, that’s what I wanted. Good enough.

Cover of 'Science Fiction Omnibus'

A science fiction omnibus edited by Brian Aldiss. Yes I was getting lazy at this point

Wrap up

This has been a long series of posts, written all in one go. I was originally planning to write only the one blog post: I mean, all I wanted to do here was teach myself a tiny bit of Julia and scribble down a few notes. But learning a new language always involves introducing yourself to a lot of new concepts, and so the post got very long and needed to be broken down into parts.

Besides, I’ve really enjoyed learning the basics of Julia. It feels surprisingly comfortable to me, capturing a lot of what I really love about R and also reminding me of the bits about Matlab that I didn’t hate. But it’s also designed for performance in a way that both R and Matlab sometimes struggle with (e.g., in R the solution to “how to I make it fast?” is so often “rewrite the slow bit in C++”), so I guess I can see why a lot of people I admire have a lot of positive things to say about Julia.

So yeah. I’m exhausted. I’ve written too much. But I’m happy nevertheless.

Footnotes

  1. Honestly, at this point I don’t even know why I’m making the choices I’m making in this increasingly-unhinged series of posts↩︎

Reuse

Citation

BibTeX citation:
@online{navarro2024,
  author = {Navarro, Danielle},
  title = {Plotting Data in {Julia}},
  date = {2024-03-03},
  url = {https://blog.djnavarro.net/posts/2024-03-03_julia-plots},
  langid = {en}
}
For attribution, please cite this work as:
Navarro, Danielle. 2024. “Plotting Data in Julia.” March 3, 2024. https://blog.djnavarro.net/posts/2024-03-03_julia-plots.