Working with data in Julia – Notes from a data witch

This is the second of an impromptu three-part series in which, in a decision I am rapidly starting to regret as these posts get longer and longer, I decided it was time to teach myself how to use Julia. In the first part of the series I looked at some foundational concepts (types, functions, pipes, etc), though in a completely idiosyncratic way and ignoring concepts that I find boring (loops, conditionals). I mean, this is a blog where I write “notes to self”. It’s not a textbook.

Anyway… my thought for the second part of the series is to shift away from core programming concepts and instead look at a practical task that data analysts have to work with on a daily basis: wrangling rectangular data sets. In other words, I’m going to talk about data frames and tools for manipulating them.

Reimagined Mass Effect 1 cover showing the Normandy departing the Citadel — Mass Effect 1. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence. The astute observer will notice that at no point in this post does the main text reference the *Mass Effect* games. But I don’t care, because they are awesome.

Creating data frames

Unlike R, Julia doesn’t come with a native class to represent data frames. Instead, there is the DataFrames package which provides the functionality needed to represent tabular data. The DataFrame() function allows you to manually construct a data frame, with a syntax that feels very familiar to an R user. Vectors passed as inputs to DataFrame() must all have one element for every row in the data frame, or else be length one.

Continuing the vague science fiction theme that started in the previous post, I’ll start by constructing a small data frame listing the novels from William Gibson’s Sprawl trilogy, which I enjoyed considerably more than Asimov’s Foundation series that (very very loosely) inspired the TV show of the same name. Anyway, here’s how you do that:

using DataFrames

sprawl = DataFrame(
  title = ["Neuromancer", "Count Zero", "Mona Lisa Overdrive"],
  published = [1984, 1986, 1988], 
  author = "William Gibson"
)

3×3 DataFrame

Row	title	published	author
	String	Int64	String
1	Neuromancer	1984	William Gibson
2	Count Zero	1986	William Gibson
3	Mona Lisa Overdrive	1988	William Gibson

Data frames have pretty print methods so the output looks quite nice here. But internally it’s essentially a collection of vectors, one for each column. For example, sprawl.title is a vector of three strings:

sprawl.title

3-element Vector{String}:
 "Neuromancer"
 "Count Zero"
 "Mona Lisa Overdrive"

In real life though, you don’t usually construct a data frame manually. It’s more typical to import a data frame from a CSV file or similar. To that end, we can take advantage of the CSV package to read data from a data file:

using CSV
starwars_csv = CSV.File("starwars.csv"; missingstring = "NA");

This starwars_csv object isn’t a data frame yet, it’s an object of type CSV.file. Data frames are columnar data structures (i.e., a collection of vectors, one per column), whereas a CSV.file is a rowwise data structure (i.e., a collection of CSV.row objects, one per row). You could test this for yourself by taking a look at the first element starwars_csv[1] to verify that it’s a representation of a single CSV row, but the output isn’t very interesting so I’m going to move on.

To convert this CSV.file object to a DataFrame object, we can simply pass it to DataFrame(), and this time around the data we end up with is a little bit richer than the last one (even if the Star Wars movies are incredibly boring compared to the infinitely superior Sprawl novels…)

starwars = DataFrame(starwars_csv)

87×11 DataFrame

62 rows omitted

Row	name	height	mass	hair_color	skin_color	eye_color	birth_year	sex	gender	homeworld	species
	String31	Int64?	Float64?	String15?	String31	String15	Float64?	String15?	String15?	String15?	String15?
1	Luke Skywalker	172	77.0	blond	fair	blue	19.0	male	masculine	Tatooine	Human
2	C-3PO	167	75.0	missing	gold	yellow	112.0	none	masculine	Tatooine	Droid
3	R2-D2	96	32.0	missing	white, blue	red	33.0	none	masculine	Naboo	Droid
4	Darth Vader	202	136.0	none	white	yellow	41.9	male	masculine	Tatooine	Human
5	Leia Organa	150	49.0	brown	light	brown	19.0	female	feminine	Alderaan	Human
6	Owen Lars	178	120.0	brown, grey	light	blue	52.0	male	masculine	Tatooine	Human
7	Beru Whitesun Lars	165	75.0	brown	light	blue	47.0	female	feminine	Tatooine	Human
8	R5-D4	97	32.0	missing	white, red	red	missing	none	masculine	Tatooine	Droid
9	Biggs Darklighter	183	84.0	black	light	brown	24.0	male	masculine	Tatooine	Human
10	Obi-Wan Kenobi	182	77.0	auburn, white	fair	blue-gray	57.0	male	masculine	Stewjon	Human
11	Anakin Skywalker	188	84.0	blond	fair	blue	41.9	male	masculine	Tatooine	Human
12	Wilhuff Tarkin	180	missing	auburn, grey	fair	blue	64.0	male	masculine	Eriadu	Human
13	Chewbacca	228	112.0	brown	unknown	blue	200.0	male	masculine	Kashyyyk	Wookiee
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
76	San Hill	191	missing	none	grey	gold	missing	male	masculine	Muunilinst	Muun
77	Shaak Ti	178	57.0	none	red, blue, white	black	missing	female	feminine	Shili	Togruta
78	Grievous	216	159.0	none	brown, white	green, yellow	missing	male	masculine	Kalee	Kaleesh
79	Tarfful	234	136.0	brown	brown	blue	missing	male	masculine	Kashyyyk	Wookiee
80	Raymus Antilles	188	79.0	brown	light	brown	missing	male	masculine	Alderaan	Human
81	Sly Moore	178	48.0	none	pale	white	missing	missing	missing	Umbara	missing
82	Tion Medon	206	80.0	none	grey	black	missing	male	masculine	Utapau	Pau'an
83	Finn	missing	missing	black	dark	dark	missing	male	masculine	missing	Human
84	Rey	missing	missing	brown	light	hazel	missing	female	feminine	missing	Human
85	Poe Dameron	missing	missing	brown	light	brown	missing	male	masculine	missing	Human
86	BB8	missing	missing	none	none	black	missing	none	masculine	missing	Droid
87	Captain Phasma	missing	missing	none	none	unknown	missing	female	feminine	missing	Human

Subsetting data frames I

The core tools for working with data frames in Julia feel quite familiar coming from either Matlab or R. You can subset a data frame by passing it numeric indices, for instance:

starwars[1:6, 1:5]

6×5 DataFrame

Row	name	height	mass	hair_color	skin_color
	String31	Int64?	Float64?	String15?	String31
1	Luke Skywalker	172	77.0	blond	fair
2	C-3PO	167	75.0	missing	gold
3	R2-D2	96	32.0	missing	white, blue
4	Darth Vader	202	136.0	none	white
5	Leia Organa	150	49.0	brown	light
6	Owen Lars	178	120.0	brown, grey	light

However, there are other methods for subsetting a data frame. You can also filter the rows of a data frame using logical expressions. Again, this is quite similar to how it works in base R. For instance, I can construct a boolean vector fair_skinned which indicates whether the corresponding row in starwars refers to a person with fair skin:¹

fair_skinned = starwars.skin_color .== "fair";

Now that I have these indices, I can create a subset of the data frame containing only those rows referring to fair skinned person (or robot, or…)

starwars[fair_skinned, 1:5]

17×5 DataFrame

Row	name	height	mass	hair_color	skin_color
	String31	Int64?	Float64?	String15?	String31
1	Luke Skywalker	172	77.0	blond	fair
2	Obi-Wan Kenobi	182	77.0	auburn, white	fair
3	Anakin Skywalker	188	84.0	blond	fair
4	Wilhuff Tarkin	180	missing	auburn, grey	fair
5	Han Solo	180	80.0	brown	fair
6	Wedge Antilles	170	77.0	brown	fair
7	Jek Tono Porkins	180	110.0	brown	fair
8	Boba Fett	183	78.2	black	fair
9	Mon Mothma	150	missing	auburn	fair
10	Arvel Crynyd	missing	missing	brown	fair
11	Qui-Gon Jinn	193	89.0	brown	fair
12	Finis Valorum	170	missing	blond	fair
13	Ric Olié	183	missing	brown	fair
14	Shmi Skywalker	163	missing	black	fair
15	Cliegg Lars	183	missing	brown	fair
16	Dooku	193	80.0	white	fair
17	Jocasta Nu	167	missing	white	fair

On the columns side, we also have more flexible options for subsetting a data frame. For example, instead of referring to columns using numerical indices, we can select the variables that we want to keep using their names:

starwars[1:6, [:name, :gender, :homeworld]]

6×3 DataFrame

Row	name	gender	homeworld
	String31	String15?	String15?
1	Luke Skywalker	masculine	Tatooine
2	C-3PO	masculine	Tatooine
3	R2-D2	masculine	Naboo
4	Darth Vader	masculine	Tatooine
5	Leia Organa	feminine	Alderaan
6	Owen Lars	masculine	Tatooine

Referring to columns by name is very handy in practice, and there’s some hidden Julia concepts here that I didn’t talk about in the last post. So with that in mind I’ll digress slightly to talk about…

Symbols

Looking at the syntax in the last code cell, it’s fairly clear that [:name, :gender, :homeworld] is a vector of three… somethings, but it’s not immediately obvious what :name actually is. Much like R (and also inherited from Lisp) Julia has extensive Metaprogramming capabilities because it has the ability to represent Julia code as data structures within the language itself. In the simplest case, we have Symbols like :name, which are constructed using the quotation operator : and used to represent object names. So as you can see, :name is an object of type Symbol:

typeof(:name)

Symbol

Symbols can be assigned to variables, and those variables can be used as part of expressions to be evaluated. In the code below I create a variable colname that stores the symbolic representation of a column name that I can invoke later:

colname = :title

:title

As a simple example of how symbols can be used in practice, here’s a Julia implementation of something like the pull() function in the R package dplyr, which allows the user to extract a single column from a data frame:

function pull(data::DataFrame, column::Symbol)
  getproperty(data, column)
end;

In this code I’m using the getproperty() function to do the same job that the . operator would do in an expression like sprawl.title. So here it is in action:

pull(sprawl, :title)

3-element Vector{String}:
 "Neuromancer"
 "Count Zero"
 "Mona Lisa Overdrive"

I know, it’s exciting right?

Okay yeah, at the moment this pull() function isn’t very useful at all – pull(sprawl, :title) is really not an improvement on sprawl.title – but a little bit later when I get around to talking about data wrangling pipelines it might turn out to be a little less silly.

Reimagined Mass Effect 2 cover showing the Normandy attacked by a Collector ship — Mass Effect 2. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence. Still the strangest of the three games: the main storyline with the Collectors is a hot mess, but it has the best side quests in the series, and the best romance too (Thane, obviously…)

Subsetting data frames II

Anyway, getting back on track, the key thing to realise is that when I wrote [:name, :gender, :homeworld] earlier what I was really doing is constructing a vector of symbols, and it’s those symbols that I was using to select the columns that I wanted to retain. The DataFrames package also supplies a various selector functions that can be used to extract a subset of the columns. For example, Not() will select every column except the ones that are passed to Not(). So if I want to drop the hair color, eye color, sex, and homeworld columns, I could do this:

starwars[1:6, Not([:hair_color, :eye_color, :sex, :homeworld])]

6×7 DataFrame

Row	name	height	mass	skin_color	birth_year	gender	species
	String31	Int64?	Float64?	String31	Float64?	String15?	String15?
1	Luke Skywalker	172	77.0	fair	19.0	masculine	Human
2	C-3PO	167	75.0	gold	112.0	masculine	Droid
3	R2-D2	96	32.0	white, blue	33.0	masculine	Droid
4	Darth Vader	202	136.0	white	41.9	masculine	Human
5	Leia Organa	150	49.0	light	19.0	feminine	Human
6	Owen Lars	178	120.0	light	52.0	masculine	Human

The Between() selector does what you’d think. It returns all columns in between two named columns:

starwars[1:6, Between(:sex, :homeworld)]

6×3 DataFrame

Row	sex	gender	homeworld
	String15?	String15?	String15?
1	male	masculine	Tatooine
2	none	masculine	Tatooine
3	none	masculine	Naboo
4	male	masculine	Tatooine
5	female	feminine	Alderaan
6	male	masculine	Tatooine

There’s also an All() selector that returns all columns, but that’s not super exciting. More interesting, I think, is the Cols() selector which takes a predicate function as input.² The column names are passed to the function, and they are included in the output if that function returns true. So, for example, if I want to extract the columns in the data whose name ends in "color" I can do this:

starwars[1:6, Cols(x -> endswith(x, "color"))]

6×3 DataFrame

Row	hair_color	skin_color	eye_color
	String15?	String31	String15
1	blond	fair	blue
2	missing	gold	yellow
3	missing	white, blue	red
4	none	white	yellow
5	brown	light	brown
6	brown, grey	light	blue

I find myself liking these selector functions. Coming from the tidyverse style in R where tidyselect is used to govern column selection it feels… not terribly different. Superficially different, perhaps, but the combination of All(), Not(), Between(), and Cols() seems to provide a fairly powerful and (I think?) user-friendly way to select columns.

Reimagined Mass Effect 3 cover showing the Normandy facing a fleet fo reapers — Mass Effect 3. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence. No, I will not be drawn into expressing a comment on the ending. I love ME3, in part because every Shepard I’ve ever played comes into this game already completely broken and makes unhinged choices because of it…

Data wrangling I: groupby, combine

Up to this point I haven’t really done any data wrangling with the starwars data. Okay, yeah, to some extent there’s some data wrangling implied by the discussion of subsetting in the previous sections, but in truth none of that is how you’d normally go about it in a more real-world context. So to that end I’ll talk about some of the data wrangling functions that DataFrames supplies.

Let’s start with something simple, and not very useful. Suppose what I want to do here is group the data by gender and sex, and then for every unique combination of gender and sex that appears in the data set have Julia pick one row at random and report the corresponding mass. To do that is a two step operation. First, I need to use groupby() to describe the groups, and then I need to call combine() to tell Julia what function to apply separately for each group. This does the trick:

combine(groupby(starwars, [:gender, :sex]), :mass => rand)

6×3 DataFrame

Row	gender	sex	mass_rand
	String15?	String15?	Float64?
1	masculine	male	74.0
2	masculine	none	32.0
3	masculine	hermaphroditic	1358.0
4	feminine	none	missing
5	feminine	female	missing
6	missing	missing	110.0

In the call to groupby(starwars, [:gender, :sex]) what Julia does is construct a grouped data frame (very similar to what you expect in R, really), and then this grouped data frame is passed to combine(). For each such group, we take the relevant subset of the :mass column and pass it to the rand() function, and by doing so a random mass is returned.

There’s some obvious limitations to note in my code here though. Firstly, I’m not using the pipe |> at all, and while it’s sort of fine in this context because there’s only two steps in my data wrangling exercise, the code is going to get very ugly very quickly if I try to do something fancier.³ So let’s start by fixing this.

As I mentioned in the first post in this series, one way I could transform this into a pipeline is to use the Pipe package, which supplies a pipe that behaves very similarly to the base pipe in R. However, I’m not going to do that. Instead, I’m going to adopt a workflow where I use the Julia base pipe together with anonymous functions. Here’s the same code expressed in this kind of pipeline:⁴

starwars |>
  d -> groupby(d, [:gender, :sex]) |>
  d -> combine(d, :mass => rand)

6×3 DataFrame

Row	gender	sex	mass_rand
	String15?	String15?	Float64?
1	masculine	male	missing
2	masculine	none	32.0
3	masculine	hermaphroditic	1358.0
4	feminine	none	missing
5	feminine	female	missing
6	missing	missing	85.0

I genuinely wasn’t expecting this when I first learned about the restrictiveness of the Julia pipe, but I think I really like this syntax. Because you have to define an anonymous function at each step in the pipeline, I find myself noticing that:

It’s only slightly more verbose than the R style, and has the advantage (to my mind) that you can use this workflow without having to think too much about Julia macros
The input argument (in this case d) serves the same role that the placeholder (_ for the R base pipe and the Julia “Pipe-package-pipe”, or . for the R magrittr pipe)
You have the ability to subtly remind yourself of the internal workings of your pipeline by naming the input argument cleverly. If the input to this step in the pipeline is a data frame I tend to call the input argument d, but if – as sometimes happens in real life – at some point in the pipeline I pull out a column and do a bit of processing on that before returning the results, I might find it handy to use something else to remind myself that this step is applied to string variables.

As regards that third point, here’s an example using the pull() function that I defined earlier that does exactly this:

starwars |>
  d -> subset(d, :skin_color => x -> x.=="fair") |>
  d -> pull(d, :name) |>
  n -> map(x -> split(x, " ")[1], n)

17-element Vector{SubString{String31}}:
 "Luke"
 "Obi-Wan"
 "Anakin"
 "Wilhuff"
 "Han"
 "Wedge"
 "Jek"
 "Boba"
 "Mon"
 "Arvel"
 "Qui-Gon"
 "Finis"
 "Ric"
 "Shmi"
 "Cliegg"
 "Dooku"
 "Jocasta"

Again, not the most exciting pipeline in the world – all I’m doing is returning the first names of all the fair-skinned characters – but it does highlight the fact that the combination of base pipe and anonymous function syntax in Julia works rather well if you’re inclined to write in this style.

In fact, the ability to name the input argument is especially helpful in the last line of the pipe where there are two separate functions being used, one of which is a call to map() applied to the :name column (and takes n as the input), and another that is used by map() when extracting the first name out of every name (where I’ve unimaginatively used x to name my input).

In any case, though it may not be to everyone’s tastes, I’ve found a pipe-centric style that I can use in Julia that I don’t mind, so it’s time to move on and look at some other functions available in the DataFrames package.

Data wrangling II: subset, select, sort

In the previous section I gave an example of a workflow that uses groupby() and combine() to compute summaries of a data frame. But there are other functions that come in very handy for data wrangling: I can use subset() to choose a subset of rows⁵, select() to choose a subset of columns, and sort() to order the rows according to some criterion. For example, here’s how I could find all the characters from Tattooine and sort them by weight:

starwars |> 
  d -> select(d, [:name, :mass, :homeworld]) |>
  d -> subset(d, :homeworld => h -> h.=="Tatooine", skipmissing=true) |>
  d -> sort(d, :mass)

10×3 DataFrame

Row	name	mass	homeworld
	String31	Float64?	String15?
1	R5-D4	32.0	Tatooine
2	C-3PO	75.0	Tatooine
3	Beru Whitesun Lars	75.0	Tatooine
4	Luke Skywalker	77.0	Tatooine
5	Biggs Darklighter	84.0	Tatooine
6	Anakin Skywalker	84.0	Tatooine
7	Owen Lars	120.0	Tatooine
8	Darth Vader	136.0	Tatooine
9	Shmi Skywalker	missing	Tatooine
10	Cliegg Lars	missing	Tatooine

Notice that this time I’ve been a little smarter about handling missing values. In the call to subset() I specified skipmissing=true to drop all cases where the homeworld is missing. The sort() function doesn’t have a skipmissing argument, so the results include the two cases where someone from Tatooine doesn’t have a stated weight. But hopefully it’s clear that I could easily subset the data again to remove any cases with missing values on the :mass column if I wanted to. In fact, the DataFrames package supplies functions dropmissing(), allowmissing(), and completecases() that could be used for that purpose. For example:

starwars |> 
  d -> select(d, [:name, :mass, :homeworld]) |>
  d -> subset(d, :homeworld => h -> h.=="Tatooine", skipmissing=true) |>
  d -> sort(d, :mass, rev=true) |>
  d -> dropmissing(d, :mass)

8×3 DataFrame

Row	name	mass	homeworld
	String31	Float64	String15?
1	Darth Vader	136.0	Tatooine
2	Owen Lars	120.0	Tatooine
3	Biggs Darklighter	84.0	Tatooine
4	Anakin Skywalker	84.0	Tatooine
5	Luke Skywalker	77.0	Tatooine
6	C-3PO	75.0	Tatooine
7	Beru Whitesun Lars	75.0	Tatooine
8	R5-D4	32.0	Tatooine

The missing :mass rows are now gone, and – just for my own personal amusement – this time I’ve sorted the results in order of descending weight by setting rev=true.

Star wars X-wing and tie-fighters in front of the Death Star — Star Wars. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence. If I had more Mass Effect images to use here I would but alas, I do not.

Data wrangling III: stack

Okay, now it’s time to start thinking about how to reshape a data frame in Julia. Consider this, as the beginnings of a pipeline:

starwars |>
  d -> select(d, [:name, :eye_color, :skin_color, :hair_color])

87×4 DataFrame

62 rows omitted

Row	name	eye_color	skin_color	hair_color
	String31	String15	String31	String15?
1	Luke Skywalker	blue	fair	blond
2	C-3PO	yellow	gold	missing
3	R2-D2	red	white, blue	missing
4	Darth Vader	yellow	white	none
5	Leia Organa	brown	light	brown
6	Owen Lars	blue	light	brown, grey
7	Beru Whitesun Lars	blue	light	brown
8	R5-D4	red	white, red	missing
9	Biggs Darklighter	brown	light	black
10	Obi-Wan Kenobi	blue-gray	fair	auburn, white
11	Anakin Skywalker	blue	fair	blond
12	Wilhuff Tarkin	blue	fair	auburn, grey
13	Chewbacca	blue	unknown	brown
⋮	⋮	⋮	⋮	⋮
76	San Hill	gold	grey	none
77	Shaak Ti	black	red, blue, white	none
78	Grievous	green, yellow	brown, white	none
79	Tarfful	blue	brown	brown
80	Raymus Antilles	brown	light	brown
81	Sly Moore	white	pale	none
82	Tion Medon	black	grey	none
83	Finn	dark	dark	black
84	Rey	hazel	light	brown
85	Poe Dameron	brown	light	brown
86	BB8	black	none	none
87	Captain Phasma	unknown	none	none

Suppose what I want to do is transform this into a data set that has variables :name, :body_part, and :color. In other words I want to pivot this into a long-form data set where each character is represented by three rows, and has one row that specifies the colour of the relevant body part. We can do this with the stack() function:

starwars |>
  d -> select(d, [:name, :eye_color, :skin_color, :hair_color]) |>
  d -> stack(d, [:eye_color, :skin_color, :hair_color],
    variable_name=:body_part,
    value_name=:color
  )

261×3 DataFrame

236 rows omitted

Row	name	body_part	color
	String31	String	String31?
1	Luke Skywalker	eye_color	blue
2	C-3PO	eye_color	yellow
3	R2-D2	eye_color	red
4	Darth Vader	eye_color	yellow
5	Leia Organa	eye_color	brown
6	Owen Lars	eye_color	blue
7	Beru Whitesun Lars	eye_color	blue
8	R5-D4	eye_color	red
9	Biggs Darklighter	eye_color	brown
10	Obi-Wan Kenobi	eye_color	blue-gray
11	Anakin Skywalker	eye_color	blue
12	Wilhuff Tarkin	eye_color	blue
13	Chewbacca	eye_color	blue
⋮	⋮	⋮	⋮
250	San Hill	hair_color	none
251	Shaak Ti	hair_color	none
252	Grievous	hair_color	none
253	Tarfful	hair_color	brown
254	Raymus Antilles	hair_color	brown
255	Sly Moore	hair_color	none
256	Tion Medon	hair_color	none
257	Finn	hair_color	black
258	Rey	hair_color	brown
259	Poe Dameron	hair_color	brown
260	BB8	hair_color	none
261	Captain Phasma	hair_color	none

Data wrangling IV: unstack

We can also go the other way. Let’s start with a slightly different data frame called census, one that counts the number of characters of each species on each homeworld

census = starwars |> 
  d -> dropmissing(d, [:homeworld, :species]) |>
  d -> groupby(d, [:homeworld, :species]) |> 
  d -> combine(d, :name => (n -> length(n)) => :count)

51×3 DataFrame

26 rows omitted

Row	homeworld	species	count
	String15	String15	Int64
1	Tatooine	Human	8
2	Tatooine	Droid	2
3	Naboo	Droid	1
4	Alderaan	Human	3
5	Stewjon	Human	1
6	Eriadu	Human	1
7	Kashyyyk	Wookiee	2
8	Corellia	Human	2
9	Rodia	Rodian	1
10	Nal Hutta	Hutt	1
11	Naboo	Human	5
12	Kamino	Human	1
13	Trandosha	Trandoshan	1
⋮	⋮	⋮	⋮
40	Geonosis	Geonosian	1
41	Mirial	Mirialan	2
42	Serenno	Human	1
43	Concord Dawn	Human	1
44	Zolan	Clawdite	1
45	Ojom	Besalisk	1
46	Kamino	Kaminoan	2
47	Skako	Skakoan	1
48	Muunilinst	Muun	1
49	Shili	Togruta	1
50	Kalee	Kaleesh	1
51	Utapau	Pau'an	1

So now, if I wanted a version of this data set with one row per :homeworld and a column for each :species that contains the :count of the number of characters of that species on the corresponding world, I could use unstack() to pivot from long-form to wide-form data like this:

unstack(census, :species, :count, fill=0)

46×37 DataFrame

21 rows omitted

Row	homeworld	Human	Droid	Wookiee	Rodian	Hutt	Trandoshan	Mon Calamari	Ewok	Sullustan	Neimodian	Gungan	Toydarian	Dug	Zabrak	Twi'lek	Aleena	Vulptereen	Xexto	Toong	Cerean	Nautolan	Tholothian	Iktotchi	Quermian	Kel Dor	Chagrian	Geonosian	Mirialan	Clawdite	Besalisk	Kaminoan	Skakoan	Muun	Togruta	Kaleesh	Pau'an
	String15	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64
1	Tatooine	8	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	Naboo	5	1	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	Alderaan	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	Stewjon	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
5	Eriadu	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
6	Kashyyyk	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
7	Corellia	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
8	Rodia	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
9	Nal Hutta	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
10	Kamino	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0
11	Trandosha	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
12	Socorro	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
13	Bespin	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
35	Champala	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
36	Geonosis	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
37	Mirial	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0
38	Serenno	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
39	Concord Dawn	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
40	Zolan	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0
41	Ojom	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
42	Skako	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
43	Muunilinst	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
44	Shili	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
45	Kalee	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
46	Utapau	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1

Here I’ve specified fill=0 to indicate missing values should be replaced with zeros, which is very sensible in this case because if there are no characters with a particular species/homeworld combination there wouldn’t be a row in census. Also, because I can, here’s a version that appears in a pipeline where I return only a subset of species, and – in act of appalling xenophobia – consider only planets inhabited by at least one human character:

census|>
  d -> unstack(d, :species, :count, fill=0) |>
  d -> select(d, [:homeworld, :Human, :Droid, :Ewok, :Wookiee, :Hutt]) |>
  d -> subset(d, :Human => h -> h .> 0)

14×6 DataFrame

Row	homeworld	Human	Droid	Ewok	Wookiee	Hutt
	String15	Int64	Int64	Int64	Int64	Int64
1	Tatooine	8	2	0	0	0
2	Naboo	5	1	0	0	0
3	Alderaan	3	0	0	0	0
4	Stewjon	1	0	0	0	0
5	Eriadu	1	0	0	0	0
6	Corellia	2	0	0	0	0
7	Kamino	1	0	0	0	0
8	Socorro	1	0	0	0	0
9	Bespin	1	0	0	0	0
10	Chandrila	1	0	0	0	0
11	Coruscant	2	0	0	0	0
12	Haruun Kal	1	0	0	0	0
13	Serenno	1	0	0	0	0
14	Concord Dawn	1	0	0	0	0

Hm. Not sure what those results say about the willingness of humans to mix with other species in the Star Wars universe but it’s probably not good to reflect on it too much.

Darth Vader, with the slogan 'Don't be like your father' — Also Star Wars. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence.

Data wrangling V: transform

In all honesty I am getting exhausted with this post, and mildly irrited at the fact that I’ve spent so much time in the Star Wars universe rather than in a fictional universe that I actually enjoy. So it’s time to start wrapping this one up. There’s only one more topic I really want to mention and that’s the transform() function which you can use to add new columns to a data frame.

starwars[1:6, [:name]] |> 
  d -> transform(d, :name => (n -> n.=="Darth Vader") => :lukes_father)

6×2 DataFrame

Row	name	lukes_father
	String31	Bool
1	Luke Skywalker	false
2	C-3PO	false
3	R2-D2	false
4	Darth Vader	true
5	Leia Organa	false
6	Owen Lars	false

There. It’s done.

Wrap up

No. Just no. There was already a first post and there’s about to be a third post. I am not being paid for this and I do not have the energy to think of a witty and erudite way to wrap up the unloved middle child of the trilogy. So let us never speak of this again.

Footnotes

As an aside, notice that I’ve used .== rather than == as the equality test. This is because == is a scalar operator: it doesn’t work for vectors unless you broadcast it using .↩︎
In this context, a predicate function is just one that returns true or false.↩︎
The other issue is that my code doesn’t handle missing data gracefully, but that will come up later so I’m ignoring it for now.↩︎
At some point I want to take a look at Tidier.jl, but that’s a topic for the future.↩︎
There is also filter() which has a slightly different syntax.↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{navarro2024,
  author = {Navarro, Danielle},
  title = {Working with Data in {Julia}},
  date = {2024-03-02},
  url = {https://blog.djnavarro.net/posts/2024-03-02_julia-data-frames/},
  langid = {en}
}

For attribution, please cite this work as:

Navarro, Danielle. 2024. “Working with Data in Julia.” March 2, 2024. https://blog.djnavarro.net/posts/2024-03-02_julia-data-frames/.