using CSV
using DataFrames
using Plots
using StatsPlots
And so we come to the third post in my extremely ill-considered foray into learning Julia. In the first part of the series I taught myself some of the foundations for writing Julia code, in the second part I discussed data wrangling in Julia, and so in the third and – dear god in heaven please – last of them, I’ll take a look at data visualisation using the Plots package, and using PalmerPenguins as my data set.
It will be brief this time, right? Right????
Penguins data
It is hardly a deep insight to say this, but if you’re going to play around with data visualisation tools it does help somewhat to have some data that you can plot. In the last post I pretty much exhausted my ability to look at the Star Wars data set that comes bundled in with the dplyr R package, so I’m picked something different this time. For no particular reason I decided to go with the Palmer Penguins data set that also exists in R, and has a Julia package in PalmerPenguins. However, due to a painful little episode part way through writing all these posts I mistakenly decided that the PalmerPenguins package was causing me problems in the context of this quarto blog post (it wasn’t… the actual problem was that I had my Julia environment configured incorrectly), so instead I ended up writing a copy of the data to a CSV file and used that instead. Sigh.
Okay, let’s start by loading the packages I’ll need for this post:
Amazing. Thrilling. Et cetera. Now that we’re all overwhelmed by the feeling of rising anticipation, let’s load the data set from the CSV file:
= DataFrame(CSV.File("penguins.csv"; missingstring = "NA")) penguins
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
4 | Adelie | Torgersen | missing | missing | missing | missing | missing | 2007 |
5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
6 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
7 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
8 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
9 | Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | missing | 2007 |
10 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 |
11 | Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | missing | 2007 |
12 | Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | missing | 2007 |
13 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
333 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
334 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
335 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
336 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
337 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
338 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
339 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
340 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
341 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
342 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
343 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
344 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
Indeed, that is a lot of penguins. Precisely what I was looking for. I have some data, now I can start drawing some plots.
Scatter plots
As you might imagine, there’s more than one way you can go about constructing a data visualisation in Julia. For example, there’s the AlgebraOfGraphics system that appears share some design philosophy with the ggplot2 package in R. There’s also Compose, which aims to be a modernised version of the grid package in R, and the Gadfly data visualisation system built on top of it. Any of these might have been better choices for me to explore in the first instance, but for whatever reason1 I chose instead to look at the Plots package and its extension package StatsPlots.
The impression I get from playing around with Plots/StatsPlots is that its design has more in common with the base graphics system in R than with ggplot2. There’s a generic function plot()
that you can use to construct plots, and lots of more specific functions that are essentially wrappers to plot()
. For example, I’m going to start by drawing a scatter plot. I could use the wrapper function scatter()
for this purposes, but you don’t actually need to do this because it’s effectively the same thing as calling plot()
with different defaults. For learning purposes I find it helpful to understand what the different arguments to plot()
actually do, so I’ll avoid using the convenience functions here.
Okay, so here’s a simple scatter plot that plots the bill length of each penguin against the bill depth:
plot(
penguins.bill_length_mm,
penguins.bill_depth_mm, =:scatter,
seriestype=(500,500)
size )
In this code chunk, the first argument specifies the variable to be plotted on the x-axis and the second specifies the variable to be plotted on the y-axis. No surprises there. The key thing to note is that in the later arguments I’ve specified two plot attributes. The seriestype
attribute is what gives me a scatter plot (each observation is plotted with a single discrete plot marker), whereas the size
attribute is used to control the size of the image produced in the output (in pixels).
Cool. Well, I can definitely draw something. That feels like a win.
Using the @df
macro
One thing that immediately irritates me about the code I wrote in the last section is that I’ve was passing the raw vectors penguins.bill_length_mm
and penguins.bill_depth_mm
to the plotting function. When you’re working with a data frame that always feels clunky to me. What you really want to do is just use the column names :bill_length_mm
and :bill_depth_mm
. On the surface though that seems a little tricky to do, because the plots()
function doesn’t necessarily need to be given data that are contained within a data frame.
This is where the @df
macro comes in handy. Somehow, I have made it three posts into Julia and I have not yet actually had to use a macro for anything, but now is the time. I’ll talk about macros in a just a moment, but for now let’s simply note that I can prefix my call to plot()
with a magic bit of syntactic sugar, and now all of a sudden I can simply pass column names and everything works:
@df penguins plot(
:bill_length_mm,
:bill_depth_mm,
=:scatter,
seriestype=:species,
group=(500,500)
size )
I’ve made a nice little change in this version of the plot too. By mapping the group
attribute onto the :species
column in the penguins data set, I’ve ended up with a plot that shows each of the three penguin species in a different colour and gives me a nice little explanatory legend.
I can take this line of thinking a little further and modify other relevant attributes:
@df penguins plot(
:bill_length_mm,
:bill_depth_mm,
=:scatter,
seriestype=:species,
group="Palmer Penguins",
title="Bill Length (mm)",
xlabel="Bill Depth (mm)",
ylabel=(500,500)
size )
Now I have a plot that has a mildly informative title, and axis labels that are nicer to read. Definitely making some progress now.
Macros
Siiiiiiigh. Okay, I used a macro. I suppose I’d better take a quick look at how those things work, right? Macros are part of the Metaprogramming toolkit in Julia, and can be used to manipulate Julia expressions passed by the user, making it possible for code that wouldn’t otherwise be valid Julia to be executed. In essence we’re talking about something qualitatively similar to “non-standard evaluation” in R.
Based on that description, you can kind of see what the @df
macro is doing in the example plots from the last section. It’s taking a call to plots()
that specifies only the column names that are associated with a particular vector of observations, and replaces them with the actual data values stored in the relevant data frame when the code is evaluated.
You can see this in action here:
@df penguins println(:bill_length_mm[1:3])
Union{Missing, Float64}[39.1, 39.5, 40.3]
In this code, what @df
is doing is transforming :bill_length_mm[1:3]
(which really shouldn’t work at all) into penguins.bill_length_mm[1:3]
. So as a consequence, the code that actually gets executed here is something like println(penguins.bill_length_mm[1:3])
, and you get sensible output.
Well, sort of. I haven’t quite looked into this in a lot of detail yet, and I’m not quite at the point where I’m really prepared to start writing macros of my own, but it does look like @df
is slightly more sophisticated, which you can see by using the @macroexpand
macro to see what actually gets executed in my little example above:
@macroexpand @df penguins println(:bill_length_mm[1:3])
:(((var"##297"->begin
((var"##bill_length_mm#298",), var"##299") = (StatsPlots).extract_columns_and_names(var"##297", :bill_length_mm)
(StatsPlots).add_label(["(bill_length_mm)[13]"], println, var"##bill_length_mm#298"[1:3])
end))(penguins))
Okay yeah, there’s a little bit more going on than meets the eye (presumably in part because @df
is designed to work in the context of the Plots package), but the basic idea makes sense at least. I’m happy to move on and defer a proper dive into macros for a future occasion.
Violin plots
At this point I am getting exhausted. I have written far more on Julia than I intended to, so I’m going to do one more plot and call it a day. In my previous examples I used the seriestype
argument to plot()
ensure that what I got out at the end was a scatter plot. I could alternatively set seriestype=:violin
to get a violin plot.
Here’s an example, just to prove to myself that I understand plot()
well enough to create various different kinds of data visualisations. In the code below I’ll first define a bill_lengths
data frame that contains only the columns I need and – importantly – removes the missing values (because the violin series can’t handle missing data). Then I’ll use plots()
to create a violin plot:
= penguins |>
bill_lengths -> subset(d, :bill_length_mm => b -> .!ismissing.(b)) |>
d -> select(d, [:species, :bill_length_mm])
d
@df bill_lengths plot(
string.(:species),
:bill_length_mm,
=:violin,
seriestype=false,
legend="Species",
xlabel="Bill Length (mm)",
ylabel=(500,500)
size )
Yes, that’s what I wanted. Good enough.
Wrap up
This has been a long series of posts, written all in one go. I was originally planning to write only the one blog post: I mean, all I wanted to do here was teach myself a tiny bit of Julia and scribble down a few notes. But learning a new language always involves introducing yourself to a lot of new concepts, and so the post got very long and needed to be broken down into parts.
Besides, I’ve really enjoyed learning the basics of Julia. It feels surprisingly comfortable to me, capturing a lot of what I really love about R and also reminding me of the bits about Matlab that I didn’t hate. But it’s also designed for performance in a way that both R and Matlab sometimes struggle with (e.g., in R the solution to “how to I make it fast?” is so often “rewrite the slow bit in C++”), so I guess I can see why a lot of people I admire have a lot of positive things to say about Julia.
So yeah. I’m exhausted. I’ve written too much. But I’m happy nevertheless.
Footnotes
Honestly, at this point I don’t even know why I’m making the choices I’m making in this increasingly-unhinged series of posts↩︎
Reuse
Citation
@online{navarro2024,
author = {Navarro, Danielle},
title = {Plotting Data in {Julia}},
date = {2024-03-03},
url = {https://blog.djnavarro.net/posts/2024-03-03_julia-plots/},
langid = {en}
}