Plotting data in Julia – Notes from a data witch

And so we come to the third post in my extremely ill-considered foray into learning Julia. In the first part of the series I taught myself some of the foundations for writing Julia code, in the second part I discussed data wrangling in Julia, and so in the third and – dear god in heaven please – last of them, I’ll take a look at data visualisation using the Plots package, and using PalmerPenguins as my data set.

It will be brief this time, right? Right????

Cover of 'Sirius' by Olaf Stapledon — Sirius, by Olaf Stapledon. See, it’s a Penguin science fiction book so it keeps to the penguins theme of this post and the sci-fi theme of the series. I’m soooooo clever

Penguins data

It is hardly a deep insight to say this, but if you’re going to play around with data visualisation tools it does help somewhat to have some data that you can plot. In the last post I pretty much exhausted my ability to look at the Star Wars data set that comes bundled in with the dplyr R package, so I’m picked something different this time. For no particular reason I decided to go with the Palmer Penguins data set that also exists in R, and has a Julia package in PalmerPenguins. However, due to a painful little episode part way through writing all these posts I mistakenly decided that the PalmerPenguins package was causing me problems in the context of this quarto blog post (it wasn’t… the actual problem was that I had my Julia environment configured incorrectly), so instead I ended up writing a copy of the data to a CSV file and used that instead. Sigh.

Okay, let’s start by loading the packages I’ll need for this post:

using CSV
using DataFrames
using Plots
using StatsPlots

Amazing. Thrilling. Et cetera. Now that we’re all overwhelmed by the feeling of rising anticipation, let’s load the data set from the CSV file:

penguins = DataFrame(CSV.File("penguins.csv"; missingstring = "NA"))

344×8 DataFrame

319 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64
1	Adelie	Torgersen	39.1	18.7	181	3750	male	2007
2	Adelie	Torgersen	39.5	17.4	186	3800	female	2007
3	Adelie	Torgersen	40.3	18.0	195	3250	female	2007
4	Adelie	Torgersen	missing	missing	missing	missing	missing	2007
5	Adelie	Torgersen	36.7	19.3	193	3450	female	2007
6	Adelie	Torgersen	39.3	20.6	190	3650	male	2007
7	Adelie	Torgersen	38.9	17.8	181	3625	female	2007
8	Adelie	Torgersen	39.2	19.6	195	4675	male	2007
9	Adelie	Torgersen	34.1	18.1	193	3475	missing	2007
10	Adelie	Torgersen	42.0	20.2	190	4250	missing	2007
11	Adelie	Torgersen	37.8	17.1	186	3300	missing	2007
12	Adelie	Torgersen	37.8	17.3	180	3700	missing	2007
13	Adelie	Torgersen	41.1	17.6	182	3200	female	2007
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
333	Chinstrap	Dream	45.2	16.6	191	3250	female	2009
334	Chinstrap	Dream	49.3	19.9	203	4050	male	2009
335	Chinstrap	Dream	50.2	18.8	202	3800	male	2009
336	Chinstrap	Dream	45.6	19.4	194	3525	female	2009
337	Chinstrap	Dream	51.9	19.5	206	3950	male	2009
338	Chinstrap	Dream	46.8	16.5	189	3650	female	2009
339	Chinstrap	Dream	45.7	17.0	195	3650	female	2009
340	Chinstrap	Dream	55.8	19.8	207	4000	male	2009
341	Chinstrap	Dream	43.5	18.1	202	3400	female	2009
342	Chinstrap	Dream	49.6	18.2	193	3775	male	2009
343	Chinstrap	Dream	50.8	19.0	210	4100	male	2009
344	Chinstrap	Dream	50.2	18.7	198	3775	female	2009

Indeed, that is a lot of penguins. Precisely what I was looking for. I have some data, now I can start drawing some plots.

Cover of 'The Handmaids Tale' — The Handmaids Tale by Margaret Atwood, because obviously what every queer person and woman wants to be thinking about in 2024 is what’s probably going to happen to us in the near future

Scatter plots

As you might imagine, there’s more than one way you can go about constructing a data visualisation in Julia. For example, there’s the AlgebraOfGraphics system that appears share some design philosophy with the ggplot2 package in R. There’s also Compose, which aims to be a modernised version of the grid package in R, and the Gadfly data visualisation system built on top of it. Any of these might have been better choices for me to explore in the first instance, but for whatever reason¹ I chose instead to look at the Plots package and its extension package StatsPlots.

The impression I get from playing around with Plots/StatsPlots is that its design has more in common with the base graphics system in R than with ggplot2. There’s a generic function plot() that you can use to construct plots, and lots of more specific functions that are essentially wrappers to plot(). For example, I’m going to start by drawing a scatter plot. I could use the wrapper function scatter() for this purposes, but you don’t actually need to do this because it’s effectively the same thing as calling plot() with different defaults. For learning purposes I find it helpful to understand what the different arguments to plot() actually do, so I’ll avoid using the convenience functions here.

Okay, so here’s a simple scatter plot that plots the bill length of each penguin against the bill depth:

plot(
  penguins.bill_length_mm, 
  penguins.bill_depth_mm, 
  seriestype=:scatter,
  size=(500,500)
)

In this code chunk, the first argument specifies the variable to be plotted on the x-axis and the second specifies the variable to be plotted on the y-axis. No surprises there. The key thing to note is that in the later arguments I’ve specified two plot attributes. The seriestype attribute is what gives me a scatter plot (each observation is plotted with a single discrete plot marker), whereas the size attribute is used to control the size of the image produced in the output (in pixels).

Cool. Well, I can definitely draw something. That feels like a win.

Using the `@df` macro

One thing that immediately irritates me about the code I wrote in the last section is that I’ve was passing the raw vectors penguins.bill_length_mm and penguins.bill_depth_mm to the plotting function. When you’re working with a data frame that always feels clunky to me. What you really want to do is just use the column names :bill_length_mm and :bill_depth_mm. On the surface though that seems a little tricky to do, because the plots() function doesn’t necessarily need to be given data that are contained within a data frame.

This is where the @df macro comes in handy. Somehow, I have made it three posts into Julia and I have not yet actually had to use a macro for anything, but now is the time. I’ll talk about macros in a just a moment, but for now let’s simply note that I can prefix my call to plot() with a magic bit of syntactic sugar, and now all of a sudden I can simply pass column names and everything works:

@df penguins plot(
  :bill_length_mm,
  :bill_depth_mm,
  seriestype=:scatter,
  group=:species,
  size=(500,500)
)

I’ve made a nice little change in this version of the plot too. By mapping the group attribute onto the :species column in the penguins data set, I’ve ended up with a plot that shows each of the three penguin species in a different colour and gives me a nice little explanatory legend.

I can take this line of thinking a little further and modify other relevant attributes:

@df penguins plot(
  :bill_length_mm,
  :bill_depth_mm,
  seriestype=:scatter,
  group=:species,
  title="Palmer Penguins",
  xlabel="Bill Length (mm)",
  ylabel="Bill Depth (mm)",
  size=(500,500)
)

Now I have a plot that has a mildly informative title, and axis labels that are nicer to read. Definitely making some progress now.

Cover of 'The Day of the Triffids' — The Day of the Triffids by John Wyndham. Well, it’s more optimistic than Handmaids Tale I guess

Macros

Siiiiiiigh. Okay, I used a macro. I suppose I’d better take a quick look at how those things work, right? Macros are part of the Metaprogramming toolkit in Julia, and can be used to manipulate Julia expressions passed by the user, making it possible for code that wouldn’t otherwise be valid Julia to be executed. In essence we’re talking about something qualitatively similar to “non-standard evaluation” in R.

Based on that description, you can kind of see what the @df macro is doing in the example plots from the last section. It’s taking a call to plots() that specifies only the column names that are associated with a particular vector of observations, and replaces them with the actual data values stored in the relevant data frame when the code is evaluated.

You can see this in action here:

@df penguins println(:bill_length_mm[1:3])

Union{Missing, Float64}[39.1, 39.5, 40.3]

In this code, what @df is doing is transforming :bill_length_mm[1:3] (which really shouldn’t work at all) into penguins.bill_length_mm[1:3]. So as a consequence, the code that actually gets executed here is something like println(penguins.bill_length_mm[1:3]), and you get sensible output.

Well, sort of. I haven’t quite looked into this in a lot of detail yet, and I’m not quite at the point where I’m really prepared to start writing macros of my own, but it does look like @df is slightly more sophisticated, which you can see by using the @macroexpand macro to see what actually gets executed in my little example above:

@macroexpand @df penguins println(:bill_length_mm[1:3])

:(((var"##297"->begin
          ((var"##bill_length_mm#298",), var"##299") = (StatsPlots).extract_columns_and_names(var"##297", :bill_length_mm)
          (StatsPlots).add_label(["(bill_length_mm)[13]"], println, var"##bill_length_mm#298"[1:3])
      end))(penguins))

Okay yeah, there’s a little bit more going on than meets the eye (presumably in part because @df is designed to work in the context of the Plots package), but the basic idea makes sense at least. I’m happy to move on and defer a proper dive into macros for a future occasion.

Cover of 'Earthsea: The First Four Books' — The Earthsea Quartet by Ursula Le Guin. Much better!

Violin plots

At this point I am getting exhausted. I have written far more on Julia than I intended to, so I’m going to do one more plot and call it a day. In my previous examples I used the seriestype argument to plot() ensure that what I got out at the end was a scatter plot. I could alternatively set seriestype=:violin to get a violin plot.

Here’s an example, just to prove to myself that I understand plot() well enough to create various different kinds of data visualisations. In the code below I’ll first define a bill_lengths data frame that contains only the columns I need and – importantly – removes the missing values (because the violin series can’t handle missing data). Then I’ll use plots() to create a violin plot:

bill_lengths = penguins |> 
  d -> subset(d, :bill_length_mm => b -> .!ismissing.(b)) |>
  d -> select(d, [:species, :bill_length_mm])

@df bill_lengths plot(
  string.(:species),
  :bill_length_mm,
  seriestype=:violin,
  legend=false,
  xlabel="Species",
  ylabel="Bill Length (mm)",
  size=(500,500)
)

Yes, that’s what I wanted. Good enough.

Cover of 'Science Fiction Omnibus' — A science fiction omnibus edited by Brian Aldiss. Yes I was getting lazy at this point

Wrap up

This has been a long series of posts, written all in one go. I was originally planning to write only the one blog post: I mean, all I wanted to do here was teach myself a tiny bit of Julia and scribble down a few notes. But learning a new language always involves introducing yourself to a lot of new concepts, and so the post got very long and needed to be broken down into parts.

Besides, I’ve really enjoyed learning the basics of Julia. It feels surprisingly comfortable to me, capturing a lot of what I really love about R and also reminding me of the bits about Matlab that I didn’t hate. But it’s also designed for performance in a way that both R and Matlab sometimes struggle with (e.g., in R the solution to “how to I make it fast?” is so often “rewrite the slow bit in C++”), so I guess I can see why a lot of people I admire have a lot of positive things to say about Julia.

So yeah. I’m exhausted. I’ve written too much. But I’m happy nevertheless.

Footnotes

Honestly, at this point I don’t even know why I’m making the choices I’m making in this increasingly-unhinged series of posts↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{navarro2024,
  author = {Navarro, Danielle},
  title = {Plotting Data in {Julia}},
  date = {2024-03-03},
  url = {https://blog.djnavarro.net/posts/2024-03-03_julia-plots/},
  langid = {en}
}

For attribution, please cite this work as:

Navarro, Danielle. 2024. “Plotting Data in Julia.” March 3, 2024. https://blog.djnavarro.net/posts/2024-03-03_julia-plots/.