This is the second of an impromptu three-part series in which, in a decision I am rapidly starting to regret as these posts get longer and longer, I decided it was time to teach myself how to use Julia. In the first part of the series I looked at some foundational concepts (types, functions, pipes, etc), though in a completely idiosyncratic way and ignoring concepts that I find boring (loops, conditionals). I mean, this is a blog where I write “notes to self”. It’s not a textbook.
Anyway… my thought for the second part of the series is to shift away from core programming concepts and instead look at a practical task that data analysts have to work with on a daily basis: wrangling rectangular data sets. In other words, I’m going to talk about data frames and tools for manipulating them.
Creating data frames
Unlike R, Julia doesn’t come with a native class to represent data frames. Instead, there is the DataFrames package which provides the functionality needed to represent tabular data. The DataFrame() function allows you to manually construct a data frame, with a syntax that feels very familiar to an R user. Vectors passed as inputs to DataFrame() must all have one element for every row in the data frame, or else be length one.
Continuing the vague science fiction theme that started in the previous post, I’ll start by constructing a small data frame listing the novels from William Gibson’s Sprawl trilogy, which I enjoyed considerably more than Asimov’s Foundation series that (very very loosely) inspired the TV show of the same name. Anyway, here’s how you do that:
usingDataFramessprawl =DataFrame( title = ["Neuromancer", "Count Zero", "Mona Lisa Overdrive"], published = [1984, 1986, 1988], author ="William Gibson")
3×3 DataFrame
Row
title
published
author
String
Int64
String
1
Neuromancer
1984
William Gibson
2
Count Zero
1986
William Gibson
3
Mona Lisa Overdrive
1988
William Gibson
Data frames have pretty print methods so the output looks quite nice here. But internally it’s essentially a collection of vectors, one for each column. For example, sprawl.title is a vector of three strings:
sprawl.title
3-element Vector{String}:
"Neuromancer"
"Count Zero"
"Mona Lisa Overdrive"
In real life though, you don’t usually construct a data frame manually. It’s more typical to import a data frame from a CSV file or similar. To that end, we can take advantage of the CSV package to read data from a data file:
This starwars_csv object isn’t a data frame yet, it’s an object of type CSV.file. Data frames are columnar data structures (i.e., a collection of vectors, one per column), whereas a CSV.file is a rowwise data structure (i.e., a collection of CSV.row objects, one per row). You could test this for yourself by taking a look at the first element starwars_csv[1] to verify that it’s a representation of a single CSV row, but the output isn’t very interesting so I’m going to move on.
To convert this CSV.file object to a DataFrame object, we can simply pass it to DataFrame(), and this time around the data we end up with is a little bit richer than the last one (even if the Star Wars movies are incredibly boring compared to the infinitely superior Sprawl novels…)
starwars =DataFrame(starwars_csv)
87×11 DataFrame
62 rows omitted
Row
name
height
mass
hair_color
skin_color
eye_color
birth_year
sex
gender
homeworld
species
String31
Int64?
Float64?
String15?
String31
String15
Float64?
String15?
String15?
String15?
String15?
1
Luke Skywalker
172
77.0
blond
fair
blue
19.0
male
masculine
Tatooine
Human
2
C-3PO
167
75.0
missing
gold
yellow
112.0
none
masculine
Tatooine
Droid
3
R2-D2
96
32.0
missing
white, blue
red
33.0
none
masculine
Naboo
Droid
4
Darth Vader
202
136.0
none
white
yellow
41.9
male
masculine
Tatooine
Human
5
Leia Organa
150
49.0
brown
light
brown
19.0
female
feminine
Alderaan
Human
6
Owen Lars
178
120.0
brown, grey
light
blue
52.0
male
masculine
Tatooine
Human
7
Beru Whitesun Lars
165
75.0
brown
light
blue
47.0
female
feminine
Tatooine
Human
8
R5-D4
97
32.0
missing
white, red
red
missing
none
masculine
Tatooine
Droid
9
Biggs Darklighter
183
84.0
black
light
brown
24.0
male
masculine
Tatooine
Human
10
Obi-Wan Kenobi
182
77.0
auburn, white
fair
blue-gray
57.0
male
masculine
Stewjon
Human
11
Anakin Skywalker
188
84.0
blond
fair
blue
41.9
male
masculine
Tatooine
Human
12
Wilhuff Tarkin
180
missing
auburn, grey
fair
blue
64.0
male
masculine
Eriadu
Human
13
Chewbacca
228
112.0
brown
unknown
blue
200.0
male
masculine
Kashyyyk
Wookiee
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
76
San Hill
191
missing
none
grey
gold
missing
male
masculine
Muunilinst
Muun
77
Shaak Ti
178
57.0
none
red, blue, white
black
missing
female
feminine
Shili
Togruta
78
Grievous
216
159.0
none
brown, white
green, yellow
missing
male
masculine
Kalee
Kaleesh
79
Tarfful
234
136.0
brown
brown
blue
missing
male
masculine
Kashyyyk
Wookiee
80
Raymus Antilles
188
79.0
brown
light
brown
missing
male
masculine
Alderaan
Human
81
Sly Moore
178
48.0
none
pale
white
missing
missing
missing
Umbara
missing
82
Tion Medon
206
80.0
none
grey
black
missing
male
masculine
Utapau
Pau'an
83
Finn
missing
missing
black
dark
dark
missing
male
masculine
missing
Human
84
Rey
missing
missing
brown
light
hazel
missing
female
feminine
missing
Human
85
Poe Dameron
missing
missing
brown
light
brown
missing
male
masculine
missing
Human
86
BB8
missing
missing
none
none
black
missing
none
masculine
missing
Droid
87
Captain Phasma
missing
missing
none
none
unknown
missing
female
feminine
missing
Human
Subsetting data frames I
The core tools for working with data frames in Julia feel quite familiar coming from either Matlab or R. You can subset a data frame by passing it numeric indices, for instance:
starwars[1:6, 1:5]
6×5 DataFrame
Row
name
height
mass
hair_color
skin_color
String31
Int64?
Float64?
String15?
String31
1
Luke Skywalker
172
77.0
blond
fair
2
C-3PO
167
75.0
missing
gold
3
R2-D2
96
32.0
missing
white, blue
4
Darth Vader
202
136.0
none
white
5
Leia Organa
150
49.0
brown
light
6
Owen Lars
178
120.0
brown, grey
light
However, there are other methods for subsetting a data frame. You can also filter the rows of a data frame using logical expressions. Again, this is quite similar to how it works in base R. For instance, I can construct a boolean vector fair_skinned which indicates whether the corresponding row in starwars refers to a person with fair skin:1
fair_skinned = starwars.skin_color .=="fair";
Now that I have these indices, I can create a subset of the data frame containing only those rows referring to fair skinned person (or robot, or…)
starwars[fair_skinned, 1:5]
17×5 DataFrame
Row
name
height
mass
hair_color
skin_color
String31
Int64?
Float64?
String15?
String31
1
Luke Skywalker
172
77.0
blond
fair
2
Obi-Wan Kenobi
182
77.0
auburn, white
fair
3
Anakin Skywalker
188
84.0
blond
fair
4
Wilhuff Tarkin
180
missing
auburn, grey
fair
5
Han Solo
180
80.0
brown
fair
6
Wedge Antilles
170
77.0
brown
fair
7
Jek Tono Porkins
180
110.0
brown
fair
8
Boba Fett
183
78.2
black
fair
9
Mon Mothma
150
missing
auburn
fair
10
Arvel Crynyd
missing
missing
brown
fair
11
Qui-Gon Jinn
193
89.0
brown
fair
12
Finis Valorum
170
missing
blond
fair
13
Ric Olié
183
missing
brown
fair
14
Shmi Skywalker
163
missing
black
fair
15
Cliegg Lars
183
missing
brown
fair
16
Dooku
193
80.0
white
fair
17
Jocasta Nu
167
missing
white
fair
On the columns side, we also have more flexible options for subsetting a data frame. For example, instead of referring to columns using numerical indices, we can select the variables that we want to keep using their names:
starwars[1:6, [:name, :gender, :homeworld]]
6×3 DataFrame
Row
name
gender
homeworld
String31
String15?
String15?
1
Luke Skywalker
masculine
Tatooine
2
C-3PO
masculine
Tatooine
3
R2-D2
masculine
Naboo
4
Darth Vader
masculine
Tatooine
5
Leia Organa
feminine
Alderaan
6
Owen Lars
masculine
Tatooine
Referring to columns by name is very handy in practice, and there’s some hidden Julia concepts here that I didn’t talk about in the last post. So with that in mind I’ll digress slightly to talk about…
Symbols
Looking at the syntax in the last code cell, it’s fairly clear that [:name, :gender, :homeworld] is a vector of three… somethings, but it’s not immediately obvious what :name actually is. Much like R (and also inherited from Lisp) Julia has extensive Metaprogramming capabilities because it has the ability to represent Julia code as data structures within the language itself. In the simplest case, we have Symbols like :name, which are constructed using the quotation operator : and used to represent object names. So as you can see, :name is an object of type Symbol:
typeof(:name)
Symbol
Symbols can be assigned to variables, and those variables can be used as part of expressions to be evaluated. In the code below I create a variable colname that stores the symbolic representation of a column name that I can invoke later:
colname =:title
:title
As a simple example of how symbols can be used in practice, here’s a Julia implementation of something like the pull() function in the R package dplyr, which allows the user to extract a single column from a data frame:
In this code I’m using the getproperty() function to do the same job that the . operator would do in an expression like sprawl.title. So here it is in action:
pull(sprawl, :title)
3-element Vector{String}:
"Neuromancer"
"Count Zero"
"Mona Lisa Overdrive"
I know, it’s exciting right?
Okay yeah, at the moment this pull() function isn’t very useful at all – pull(sprawl, :title) is really not an improvement on sprawl.title – but a little bit later when I get around to talking about data wrangling pipelines it might turn out to be a little less silly.
Subsetting data frames II
Anyway, getting back on track, the key thing to realise is that when I wrote [:name, :gender, :homeworld] earlier what I was really doing is constructing a vector of symbols, and it’s those symbols that I was using to select the columns that I wanted to retain. The DataFrames package also supplies a various selector functions that can be used to extract a subset of the columns. For example, Not() will select every column except the ones that are passed to Not(). So if I want to drop the hair color, eye color, sex, and homeworld columns, I could do this:
The Between() selector does what you’d think. It returns all columns in between two named columns:
starwars[1:6, Between(:sex, :homeworld)]
6×3 DataFrame
Row
sex
gender
homeworld
String15?
String15?
String15?
1
male
masculine
Tatooine
2
none
masculine
Tatooine
3
none
masculine
Naboo
4
male
masculine
Tatooine
5
female
feminine
Alderaan
6
male
masculine
Tatooine
There’s also an All() selector that returns all columns, but that’s not super exciting. More interesting, I think, is the Cols() selector which takes a predicate function as input.2 The column names are passed to the function, and they are included in the output if that function returns true. So, for example, if I want to extract the columns in the data whose name ends in "color" I can do this:
starwars[1:6, Cols(x ->endswith(x, "color"))]
6×3 DataFrame
Row
hair_color
skin_color
eye_color
String15?
String31
String15
1
blond
fair
blue
2
missing
gold
yellow
3
missing
white, blue
red
4
none
white
yellow
5
brown
light
brown
6
brown, grey
light
blue
I find myself liking these selector functions. Coming from the tidyverse style in R where tidyselect is used to govern column selection it feels… not terribly different. Superficially different, perhaps, but the combination of All(), Not(), Between(), and Cols() seems to provide a fairly powerful and (I think?) user-friendly way to select columns.
Data wrangling I: groupby, combine
Up to this point I haven’t really done any data wrangling with the starwars data. Okay, yeah, to some extent there’s some data wrangling implied by the discussion of subsetting in the previous sections, but in truth none of that is how you’d normally go about it in a more real-world context. So to that end I’ll talk about some of the data wrangling functions that DataFrames supplies.
Let’s start with something simple, and not very useful. Suppose what I want to do here is group the data by gender and sex, and then for every unique combination of gender and sex that appears in the data set have Julia pick one row at random and report the corresponding mass. To do that is a two step operation. First, I need to use groupby() to describe the groups, and then I need to call combine() to tell Julia what function to apply separately for each group. This does the trick:
In the call to groupby(starwars, [:gender, :sex]) what Julia does is construct a grouped data frame (very similar to what you expect in R, really), and then this grouped data frame is passed to combine(). For each such group, we take the relevant subset of the :mass column and pass it to the rand() function, and by doing so a random mass is returned.
There’s some obvious limitations to note in my code here though. Firstly, I’m not using the pipe |> at all, and while it’s sort of fine in this context because there’s only two steps in my data wrangling exercise, the code is going to get very ugly very quickly if I try to do something fancier.3 So let’s start by fixing this.
As I mentioned in the first post in this series, one way I could transform this into a pipeline is to use the Pipe package, which supplies a pipe that behaves very similarly to the base pipe in R. However, I’m not going to do that. Instead, I’m going to adopt a workflow where I use the Julia base pipe together with anonymous functions. Here’s the same code expressed in this kind of pipeline:4
starwars |> d ->groupby(d, [:gender, :sex]) |> d ->combine(d, :mass => rand)
6×3 DataFrame
Row
gender
sex
mass_rand
String15?
String15?
Float64?
1
masculine
male
missing
2
masculine
none
32.0
3
masculine
hermaphroditic
1358.0
4
feminine
none
missing
5
feminine
female
missing
6
missing
missing
85.0
I genuinely wasn’t expecting this when I first learned about the restrictiveness of the Julia pipe, but I think I really like this syntax. Because you have to define an anonymous function at each step in the pipeline, I find myself noticing that:
It’s only slightly more verbose than the R style, and has the advantage (to my mind) that you can use this workflow without having to think too much about Julia macros
The input argument (in this case d) serves the same role that the placeholder (_ for the R base pipe and the Julia “Pipe-package-pipe”, or . for the R magrittr pipe)
You have the ability to subtly remind yourself of the internal workings of your pipeline by naming the input argument cleverly. If the input to this step in the pipeline is a data frame I tend to call the input argument d, but if – as sometimes happens in real life – at some point in the pipeline I pull out a column and do a bit of processing on that before returning the results, I might find it handy to use something else to remind myself that this step is applied to string variables.
As regards that third point, here’s an example using the pull() function that I defined earlier that does exactly this:
starwars |> d ->subset(d, :skin_color => x -> x.=="fair") |> d ->pull(d, :name) |> n ->map(x ->split(x, " ")[1], n)
Again, not the most exciting pipeline in the world – all I’m doing is returning the first names of all the fair-skinned characters – but it does highlight the fact that the combination of base pipe and anonymous function syntax in Julia works rather well if you’re inclined to write in this style.
In fact, the ability to name the input argument is especially helpful in the last line of the pipe where there are two separate functions being used, one of which is a call to map() applied to the :name column (and takes n as the input), and another that is used by map() when extracting the first name out of every name (where I’ve unimaginatively used x to name my input).
In any case, though it may not be to everyone’s tastes, I’ve found a pipe-centric style that I can use in Julia that I don’t mind, so it’s time to move on and look at some other functions available in the DataFrames package.
Data wrangling II: subset, select, sort
In the previous section I gave an example of a workflow that uses groupby() and combine() to compute summaries of a data frame. But there are other functions that come in very handy for data wrangling: I can use subset() to choose a subset of rows5, select() to choose a subset of columns, and sort() to order the rows according to some criterion. For example, here’s how I could find all the characters from Tattooine and sort them by weight:
starwars |> d ->select(d, [:name, :mass, :homeworld]) |> d ->subset(d, :homeworld => h -> h.=="Tatooine", skipmissing=true) |> d ->sort(d, :mass)
10×3 DataFrame
Row
name
mass
homeworld
String31
Float64?
String15?
1
R5-D4
32.0
Tatooine
2
C-3PO
75.0
Tatooine
3
Beru Whitesun Lars
75.0
Tatooine
4
Luke Skywalker
77.0
Tatooine
5
Biggs Darklighter
84.0
Tatooine
6
Anakin Skywalker
84.0
Tatooine
7
Owen Lars
120.0
Tatooine
8
Darth Vader
136.0
Tatooine
9
Shmi Skywalker
missing
Tatooine
10
Cliegg Lars
missing
Tatooine
Notice that this time I’ve been a little smarter about handling missing values. In the call to subset() I specified skipmissing=true to drop all cases where the homeworld is missing. The sort() function doesn’t have a skipmissing argument, so the results include the two cases where someone from Tatooine doesn’t have a stated weight. But hopefully it’s clear that I could easily subset the data again to remove any cases with missing values on the :mass column if I wanted to. In fact, the DataFrames package supplies functions dropmissing(), allowmissing(), and completecases() that could be used for that purpose. For example:
starwars |> d ->select(d, [:name, :mass, :homeworld]) |> d ->subset(d, :homeworld => h -> h.=="Tatooine", skipmissing=true) |> d ->sort(d, :mass, rev=true) |> d ->dropmissing(d, :mass)
8×3 DataFrame
Row
name
mass
homeworld
String31
Float64
String15?
1
Darth Vader
136.0
Tatooine
2
Owen Lars
120.0
Tatooine
3
Biggs Darklighter
84.0
Tatooine
4
Anakin Skywalker
84.0
Tatooine
5
Luke Skywalker
77.0
Tatooine
6
C-3PO
75.0
Tatooine
7
Beru Whitesun Lars
75.0
Tatooine
8
R5-D4
32.0
Tatooine
The missing :mass rows are now gone, and – just for my own personal amusement – this time I’ve sorted the results in order of descending weight by setting rev=true.
Data wrangling III: stack
Okay, now it’s time to start thinking about how to reshape a data frame in Julia. Consider this, as the beginnings of a pipeline:
starwars |> d ->select(d, [:name, :eye_color, :skin_color, :hair_color])
87×4 DataFrame
62 rows omitted
Row
name
eye_color
skin_color
hair_color
String31
String15
String31
String15?
1
Luke Skywalker
blue
fair
blond
2
C-3PO
yellow
gold
missing
3
R2-D2
red
white, blue
missing
4
Darth Vader
yellow
white
none
5
Leia Organa
brown
light
brown
6
Owen Lars
blue
light
brown, grey
7
Beru Whitesun Lars
blue
light
brown
8
R5-D4
red
white, red
missing
9
Biggs Darklighter
brown
light
black
10
Obi-Wan Kenobi
blue-gray
fair
auburn, white
11
Anakin Skywalker
blue
fair
blond
12
Wilhuff Tarkin
blue
fair
auburn, grey
13
Chewbacca
blue
unknown
brown
⋮
⋮
⋮
⋮
⋮
76
San Hill
gold
grey
none
77
Shaak Ti
black
red, blue, white
none
78
Grievous
green, yellow
brown, white
none
79
Tarfful
blue
brown
brown
80
Raymus Antilles
brown
light
brown
81
Sly Moore
white
pale
none
82
Tion Medon
black
grey
none
83
Finn
dark
dark
black
84
Rey
hazel
light
brown
85
Poe Dameron
brown
light
brown
86
BB8
black
none
none
87
Captain Phasma
unknown
none
none
Suppose what I want to do is transform this into a data set that has variables :name, :body_part, and :color. In other words I want to pivot this into a long-form data set where each character is represented by three rows, and has one row that specifies the colour of the relevant body part. We can do this with the stack() function:
starwars |> d ->select(d, [:name, :eye_color, :skin_color, :hair_color]) |> d ->stack(d, [:eye_color, :skin_color, :hair_color], variable_name=:body_part, value_name=:color )
261×3 DataFrame
236 rows omitted
Row
name
body_part
color
String31
String
String31?
1
Luke Skywalker
eye_color
blue
2
C-3PO
eye_color
yellow
3
R2-D2
eye_color
red
4
Darth Vader
eye_color
yellow
5
Leia Organa
eye_color
brown
6
Owen Lars
eye_color
blue
7
Beru Whitesun Lars
eye_color
blue
8
R5-D4
eye_color
red
9
Biggs Darklighter
eye_color
brown
10
Obi-Wan Kenobi
eye_color
blue-gray
11
Anakin Skywalker
eye_color
blue
12
Wilhuff Tarkin
eye_color
blue
13
Chewbacca
eye_color
blue
⋮
⋮
⋮
⋮
250
San Hill
hair_color
none
251
Shaak Ti
hair_color
none
252
Grievous
hair_color
none
253
Tarfful
hair_color
brown
254
Raymus Antilles
hair_color
brown
255
Sly Moore
hair_color
none
256
Tion Medon
hair_color
none
257
Finn
hair_color
black
258
Rey
hair_color
brown
259
Poe Dameron
hair_color
brown
260
BB8
hair_color
none
261
Captain Phasma
hair_color
none
Data wrangling IV: unstack
We can also go the other way. Let’s start with a slightly different data frame called census, one that counts the number of characters of each species on each homeworld
census = starwars |> d ->dropmissing(d, [:homeworld, :species]) |> d ->groupby(d, [:homeworld, :species]) |> d ->combine(d, :name => (n ->length(n)) =>:count)
51×3 DataFrame
26 rows omitted
Row
homeworld
species
count
String15
String15
Int64
1
Tatooine
Human
8
2
Tatooine
Droid
2
3
Naboo
Droid
1
4
Alderaan
Human
3
5
Stewjon
Human
1
6
Eriadu
Human
1
7
Kashyyyk
Wookiee
2
8
Corellia
Human
2
9
Rodia
Rodian
1
10
Nal Hutta
Hutt
1
11
Naboo
Human
5
12
Kamino
Human
1
13
Trandosha
Trandoshan
1
⋮
⋮
⋮
⋮
40
Geonosis
Geonosian
1
41
Mirial
Mirialan
2
42
Serenno
Human
1
43
Concord Dawn
Human
1
44
Zolan
Clawdite
1
45
Ojom
Besalisk
1
46
Kamino
Kaminoan
2
47
Skako
Skakoan
1
48
Muunilinst
Muun
1
49
Shili
Togruta
1
50
Kalee
Kaleesh
1
51
Utapau
Pau'an
1
So now, if I wanted a version of this data set with one row per :homeworld and a column for each :species that contains the :count of the number of characters of that species on the corresponding world, I could use unstack() to pivot from long-form to wide-form data like this:
unstack(census, :species, :count, fill=0)
46×37 DataFrame
21 rows omitted
Row
homeworld
Human
Droid
Wookiee
Rodian
Hutt
Trandoshan
Mon Calamari
Ewok
Sullustan
Neimodian
Gungan
Toydarian
Dug
Zabrak
Twi'lek
Aleena
Vulptereen
Xexto
Toong
Cerean
Nautolan
Tholothian
Iktotchi
Quermian
Kel Dor
Chagrian
Geonosian
Mirialan
Clawdite
Besalisk
Kaminoan
Skakoan
Muun
Togruta
Kaleesh
Pau'an
String15
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
Int64
1
Tatooine
8
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
Naboo
5
1
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
Alderaan
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
Stewjon
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
Eriadu
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
Kashyyyk
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
7
Corellia
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
8
Rodia
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
9
Nal Hutta
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
Kamino
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
11
Trandosha
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
12
Socorro
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
13
Bespin
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
35
Champala
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
36
Geonosis
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
37
Mirial
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
38
Serenno
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
39
Concord Dawn
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
40
Zolan
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
41
Ojom
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
42
Skako
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
43
Muunilinst
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
44
Shili
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
45
Kalee
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
46
Utapau
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
Here I’ve specified fill=0 to indicate missing values should be replaced with zeros, which is very sensible in this case because if there are no characters with a particular species/homeworld combination there wouldn’t be a row in census. Also, because I can, here’s a version that appears in a pipeline where I return only a subset of species, and – in act of appalling xenophobia – consider only planets inhabited by at least one human character:
census|> d ->unstack(d, :species, :count, fill=0) |> d ->select(d, [:homeworld, :Human, :Droid, :Ewok, :Wookiee, :Hutt]) |> d ->subset(d, :Human => h -> h .>0)
14×6 DataFrame
Row
homeworld
Human
Droid
Ewok
Wookiee
Hutt
String15
Int64
Int64
Int64
Int64
Int64
1
Tatooine
8
2
0
0
0
2
Naboo
5
1
0
0
0
3
Alderaan
3
0
0
0
0
4
Stewjon
1
0
0
0
0
5
Eriadu
1
0
0
0
0
6
Corellia
2
0
0
0
0
7
Kamino
1
0
0
0
0
8
Socorro
1
0
0
0
0
9
Bespin
1
0
0
0
0
10
Chandrila
1
0
0
0
0
11
Coruscant
2
0
0
0
0
12
Haruun Kal
1
0
0
0
0
13
Serenno
1
0
0
0
0
14
Concord Dawn
1
0
0
0
0
Hm. Not sure what those results say about the willingness of humans to mix with other species in the Star Wars universe but it’s probably not good to reflect on it too much.
Data wrangling V: transform
In all honesty I am getting exhausted with this post, and mildly irrited at the fact that I’ve spent so much time in the Star Wars universe rather than in a fictional universe that I actually enjoy. So it’s time to start wrapping this one up. There’s only one more topic I really want to mention and that’s the transform() function which you can use to add new columns to a data frame.
No. Just no. There was already a first post and there’s about to be a third post. I am not being paid for this and I do not have the energy to think of a witty and erudite way to wrap up the unloved middle child of the trilogy. So let us never speak of this again.
Footnotes
As an aside, notice that I’ve used .== rather than == as the equality test. This is because == is a scalar operator: it doesn’t work for vectors unless you broadcast it using .↩︎
In this context, a predicate function is just one that returns true or false.↩︎
The other issue is that my code doesn’t handle missing data gracefully, but that will come up later so I’m ignoring it for now.↩︎
At some point I want to take a look at Tidier.jl, but that’s a topic for the future.↩︎
There is also filter() which has a slightly different syntax.↩︎
@online{navarro2024,
author = {Navarro, Danielle},
title = {Working with Data in {Julia}},
date = {2024-03-02},
url = {https://blog.djnavarro.net/posts/2024-03-02_julia-data-frames/},
langid = {en}
}