Working with data in Julia

Part two of a three-part series on Julia, in which the author teaches herself the basics of wrangling rectangular data in Julia
Julia
Data Wrangling
Author
Published

March 2, 2024

This is the second of an impromptu three-part series in which, in a decision I am rapidly starting to regret as these posts get longer and longer, I decided it was time to teach myself how to use Julia. In the first part of the series I looked at some foundational concepts (types, functions, pipes, etc), though in a completely idiosyncratic way and ignoring concepts that I find boring (loops, conditionals). I mean, this is a blog where I write “notes to self”. It’s not a textbook.

Anyway… my thought for the second part of the series is to shift away from core programming concepts and instead look at a practical task that data analysts have to work with on a daily basis: wrangling rectangular data sets. In other words, I’m going to talk about data frames and tools for manipulating them.

Reimagined Mass Effect 1 cover showing the Normandy departing the Citadel

Mass Effect 1. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence. The astute observer will notice that at no point in this post does the main text reference the Mass Effect games. But I don’t care, because they are awesome.

Creating data frames

Unlike R, Julia doesn’t come with a native class to represent data frames. Instead, there is the DataFrames package which provides the functionality needed to represent tabular data. The DataFrame() function allows you to manually construct a data frame, with a syntax that feels very familiar to an R user. Vectors passed as inputs to DataFrame() must all have one element for every row in the data frame, or else be length one.

Continuing the vague science fiction theme that started in the previous post, I’ll start by constructing a small data frame listing the novels from William Gibson’s Sprawl trilogy, which I enjoyed considerably more than Asimov’s Foundation series that (very very loosely) inspired the TV show of the same name. Anyway, here’s how you do that:

using DataFrames

sprawl = DataFrame(
  title = ["Neuromancer", "Count Zero", "Mona Lisa Overdrive"],
  published = [1984, 1986, 1988], 
  author = "William Gibson"
)
3×3 DataFrame
Row title published author
String Int64 String
1 Neuromancer 1984 William Gibson
2 Count Zero 1986 William Gibson
3 Mona Lisa Overdrive 1988 William Gibson

Data frames have pretty print methods so the output looks quite nice here. But internally it’s essentially a collection of vectors, one for each column. For example, sprawl.title is a vector of three strings:

sprawl.title
3-element Vector{String}:
 "Neuromancer"
 "Count Zero"
 "Mona Lisa Overdrive"

In real life though, you don’t usually construct a data frame manually. It’s more typical to import a data frame from a CSV file or similar. To that end, we can take advantage of the CSV package to read data from a data file:

using CSV
starwars_csv = CSV.File("starwars.csv"; missingstring = "NA");

This starwars_csv object isn’t a data frame yet, it’s an object of type CSV.file. Data frames are columnar data structures (i.e., a collection of vectors, one per column), whereas a CSV.file is a rowwise data structure (i.e., a collection of CSV.row objects, one per row). You could test this for yourself by taking a look at the first element starwars_csv[1] to verify that it’s a representation of a single CSV row, but the output isn’t very interesting so I’m going to move on.

To convert this CSV.file object to a DataFrame object, we can simply pass it to DataFrame(), and this time around the data we end up with is a little bit richer than the last one (even if the Star Wars movies are incredibly boring compared to the infinitely superior Sprawl novels…)

starwars = DataFrame(starwars_csv)
87×11 DataFrame
62 rows omitted
Row name height mass hair_color skin_color eye_color birth_year sex gender homeworld species
String31 Int64? Float64? String15? String31 String15 Float64? String15? String15? String15? String15?
1 Luke Skywalker 172 77.0 blond fair blue 19.0 male masculine Tatooine Human
2 C-3PO 167 75.0 missing gold yellow 112.0 none masculine Tatooine Droid
3 R2-D2 96 32.0 missing white, blue red 33.0 none masculine Naboo Droid
4 Darth Vader 202 136.0 none white yellow 41.9 male masculine Tatooine Human
5 Leia Organa 150 49.0 brown light brown 19.0 female feminine Alderaan Human
6 Owen Lars 178 120.0 brown, grey light blue 52.0 male masculine Tatooine Human
7 Beru Whitesun Lars 165 75.0 brown light blue 47.0 female feminine Tatooine Human
8 R5-D4 97 32.0 missing white, red red missing none masculine Tatooine Droid
9 Biggs Darklighter 183 84.0 black light brown 24.0 male masculine Tatooine Human
10 Obi-Wan Kenobi 182 77.0 auburn, white fair blue-gray 57.0 male masculine Stewjon Human
11 Anakin Skywalker 188 84.0 blond fair blue 41.9 male masculine Tatooine Human
12 Wilhuff Tarkin 180 missing auburn, grey fair blue 64.0 male masculine Eriadu Human
13 Chewbacca 228 112.0 brown unknown blue 200.0 male masculine Kashyyyk Wookiee
76 San Hill 191 missing none grey gold missing male masculine Muunilinst Muun
77 Shaak Ti 178 57.0 none red, blue, white black missing female feminine Shili Togruta
78 Grievous 216 159.0 none brown, white green, yellow missing male masculine Kalee Kaleesh
79 Tarfful 234 136.0 brown brown blue missing male masculine Kashyyyk Wookiee
80 Raymus Antilles 188 79.0 brown light brown missing male masculine Alderaan Human
81 Sly Moore 178 48.0 none pale white missing missing missing Umbara missing
82 Tion Medon 206 80.0 none grey black missing male masculine Utapau Pau'an
83 Finn missing missing black dark dark missing male masculine missing Human
84 Rey missing missing brown light hazel missing female feminine missing Human
85 Poe Dameron missing missing brown light brown missing male masculine missing Human
86 BB8 missing missing none none black missing none masculine missing Droid
87 Captain Phasma missing missing none none unknown missing female feminine missing Human

Subsetting data frames I

The core tools for working with data frames in Julia feel quite familiar coming from either Matlab or R. You can subset a data frame by passing it numeric indices, for instance:

starwars[1:6, 1:5]
6×5 DataFrame
Row name height mass hair_color skin_color
String31 Int64? Float64? String15? String31
1 Luke Skywalker 172 77.0 blond fair
2 C-3PO 167 75.0 missing gold
3 R2-D2 96 32.0 missing white, blue
4 Darth Vader 202 136.0 none white
5 Leia Organa 150 49.0 brown light
6 Owen Lars 178 120.0 brown, grey light

However, there are other methods for subsetting a data frame. You can also filter the rows of a data frame using logical expressions. Again, this is quite similar to how it works in base R. For instance, I can construct a boolean vector fair_skinned which indicates whether the corresponding row in starwars refers to a person with fair skin:1

fair_skinned = starwars.skin_color .== "fair";

Now that I have these indices, I can create a subset of the data frame containing only those rows referring to fair skinned person (or robot, or…)

starwars[fair_skinned, 1:5]
17×5 DataFrame
Row name height mass hair_color skin_color
String31 Int64? Float64? String15? String31
1 Luke Skywalker 172 77.0 blond fair
2 Obi-Wan Kenobi 182 77.0 auburn, white fair
3 Anakin Skywalker 188 84.0 blond fair
4 Wilhuff Tarkin 180 missing auburn, grey fair
5 Han Solo 180 80.0 brown fair
6 Wedge Antilles 170 77.0 brown fair
7 Jek Tono Porkins 180 110.0 brown fair
8 Boba Fett 183 78.2 black fair
9 Mon Mothma 150 missing auburn fair
10 Arvel Crynyd missing missing brown fair
11 Qui-Gon Jinn 193 89.0 brown fair
12 Finis Valorum 170 missing blond fair
13 Ric Olié 183 missing brown fair
14 Shmi Skywalker 163 missing black fair
15 Cliegg Lars 183 missing brown fair
16 Dooku 193 80.0 white fair
17 Jocasta Nu 167 missing white fair

On the columns side, we also have more flexible options for subsetting a data frame. For example, instead of referring to columns using numerical indices, we can select the variables that we want to keep using their names:

starwars[1:6, [:name, :gender, :homeworld]]
6×3 DataFrame
Row name gender homeworld
String31 String15? String15?
1 Luke Skywalker masculine Tatooine
2 C-3PO masculine Tatooine
3 R2-D2 masculine Naboo
4 Darth Vader masculine Tatooine
5 Leia Organa feminine Alderaan
6 Owen Lars masculine Tatooine

Referring to columns by name is very handy in practice, and there’s some hidden Julia concepts here that I didn’t talk about in the last post. So with that in mind I’ll digress slightly to talk about…

Symbols

Looking at the syntax in the last code cell, it’s fairly clear that [:name, :gender, :homeworld] is a vector of three… somethings, but it’s not immediately obvious what :name actually is. Much like R (and also inherited from Lisp) Julia has extensive Metaprogramming capabilities because it has the ability to represent Julia code as data structures within the language itself. In the simplest case, we have Symbols like :name, which are constructed using the quotation operator : and used to represent object names. So as you can see, :name is an object of type Symbol:

typeof(:name)
Symbol

Symbols can be assigned to variables, and those variables can be used as part of expressions to be evaluated. In the code below I create a variable colname that stores the symbolic representation of a column name that I can invoke later:

colname = :title
:title

As a simple example of how symbols can be used in practice, here’s a Julia implementation of something like the pull() function in the R package dplyr, which allows the user to extract a single column from a data frame:

function pull(data::DataFrame, column::Symbol)
  getproperty(data, column)
end;

In this code I’m using the getproperty() function to do the same job that the . operator would do in an expression like sprawl.title. So here it is in action:

pull(sprawl, :title)
3-element Vector{String}:
 "Neuromancer"
 "Count Zero"
 "Mona Lisa Overdrive"

I know, it’s exciting right?

Okay yeah, at the moment this pull() function isn’t very useful at all – pull(sprawl, :title) is really not an improvement on sprawl.title – but a little bit later when I get around to talking about data wrangling pipelines it might turn out to be a little less silly.

Reimagined Mass Effect 2 cover showing the Normandy attacked by a Collector ship

Mass Effect 2. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence. Still the strangest of the three games: the main storyline with the Collectors is a hot mess, but it has the best side quests in the series, and the best romance too (Thane, obviously…)

Subsetting data frames II

Anyway, getting back on track, the key thing to realise is that when I wrote [:name, :gender, :homeworld] earlier what I was really doing is constructing a vector of symbols, and it’s those symbols that I was using to select the columns that I wanted to retain. The DataFrames package also supplies a various selector functions that can be used to extract a subset of the columns. For example, Not() will select every column except the ones that are passed to Not(). So if I want to drop the hair color, eye color, sex, and homeworld columns, I could do this:

starwars[1:6, Not([:hair_color, :eye_color, :sex, :homeworld])]
6×7 DataFrame
Row name height mass skin_color birth_year gender species
String31 Int64? Float64? String31 Float64? String15? String15?
1 Luke Skywalker 172 77.0 fair 19.0 masculine Human
2 C-3PO 167 75.0 gold 112.0 masculine Droid
3 R2-D2 96 32.0 white, blue 33.0 masculine Droid
4 Darth Vader 202 136.0 white 41.9 masculine Human
5 Leia Organa 150 49.0 light 19.0 feminine Human
6 Owen Lars 178 120.0 light 52.0 masculine Human

The Between() selector does what you’d think. It returns all columns in between two named columns:

starwars[1:6, Between(:sex, :homeworld)]
6×3 DataFrame
Row sex gender homeworld
String15? String15? String15?
1 male masculine Tatooine
2 none masculine Tatooine
3 none masculine Naboo
4 male masculine Tatooine
5 female feminine Alderaan
6 male masculine Tatooine

There’s also an All() selector that returns all columns, but that’s not super exciting. More interesting, I think, is the Cols() selector which takes a predicate function as input.2 The column names are passed to the function, and they are included in the output if that function returns true. So, for example, if I want to extract the columns in the data whose name ends in "color" I can do this:

starwars[1:6, Cols(x -> endswith(x, "color"))]
6×3 DataFrame
Row hair_color skin_color eye_color
String15? String31 String15
1 blond fair blue
2 missing gold yellow
3 missing white, blue red
4 none white yellow
5 brown light brown
6 brown, grey light blue

I find myself liking these selector functions. Coming from the tidyverse style in R where tidyselect is used to govern column selection it feels… not terribly different. Superficially different, perhaps, but the combination of All(), Not(), Between(), and Cols() seems to provide a fairly powerful and (I think?) user-friendly way to select columns.

Reimagined Mass Effect 3 cover showing the Normandy facing a fleet fo reapers

Mass Effect 3. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence. No, I will not be drawn into expressing a comment on the ending. I love ME3, in part because every Shepard I’ve ever played comes into this game already completely broken and makes unhinged choices because of it…

Data wrangling I: groupby, combine

Up to this point I haven’t really done any data wrangling with the starwars data. Okay, yeah, to some extent there’s some data wrangling implied by the discussion of subsetting in the previous sections, but in truth none of that is how you’d normally go about it in a more real-world context. So to that end I’ll talk about some of the data wrangling functions that DataFrames supplies.

Let’s start with something simple, and not very useful. Suppose what I want to do here is group the data by gender and sex, and then for every unique combination of gender and sex that appears in the data set have Julia pick one row at random and report the corresponding mass. To do that is a two step operation. First, I need to use groupby() to describe the groups, and then I need to call combine() to tell Julia what function to apply separately for each group. This does the trick:

combine(groupby(starwars, [:gender, :sex]), :mass => rand) 
6×3 DataFrame
Row gender sex mass_rand
String15? String15? Float64?
1 masculine male 74.0
2 masculine none 32.0
3 masculine hermaphroditic 1358.0
4 feminine none missing
5 feminine female missing
6 missing missing 110.0

In the call to groupby(starwars, [:gender, :sex]) what Julia does is construct a grouped data frame (very similar to what you expect in R, really), and then this grouped data frame is passed to combine(). For each such group, we take the relevant subset of the :mass column and pass it to the rand() function, and by doing so a random mass is returned.

There’s some obvious limitations to note in my code here though. Firstly, I’m not using the pipe |> at all, and while it’s sort of fine in this context because there’s only two steps in my data wrangling exercise, the code is going to get very ugly very quickly if I try to do something fancier.3 So let’s start by fixing this.

As I mentioned in the first post in this series, one way I could transform this into a pipeline is to use the Pipe package, which supplies a pipe that behaves very similarly to the base pipe in R. However, I’m not going to do that. Instead, I’m going to adopt a workflow where I use the Julia base pipe together with anonymous functions. Here’s the same code expressed in this kind of pipeline:4

starwars |>
  d -> groupby(d, [:gender, :sex]) |>
  d -> combine(d, :mass => rand)
6×3 DataFrame
Row gender sex mass_rand
String15? String15? Float64?
1 masculine male missing
2 masculine none 32.0
3 masculine hermaphroditic 1358.0
4 feminine none missing
5 feminine female missing
6 missing missing 85.0

I genuinely wasn’t expecting this when I first learned about the restrictiveness of the Julia pipe, but I think I really like this syntax. Because you have to define an anonymous function at each step in the pipeline, I find myself noticing that:

  • It’s only slightly more verbose than the R style, and has the advantage (to my mind) that you can use this workflow without having to think too much about Julia macros
  • The input argument (in this case d) serves the same role that the placeholder (_ for the R base pipe and the Julia “Pipe-package-pipe”, or . for the R magrittr pipe)
  • You have the ability to subtly remind yourself of the internal workings of your pipeline by naming the input argument cleverly. If the input to this step in the pipeline is a data frame I tend to call the input argument d, but if – as sometimes happens in real life – at some point in the pipeline I pull out a column and do a bit of processing on that before returning the results, I might find it handy to use something else to remind myself that this step is applied to string variables.

As regards that third point, here’s an example using the pull() function that I defined earlier that does exactly this:

starwars |>
  d -> subset(d, :skin_color => x -> x.=="fair") |>
  d -> pull(d, :name) |>
  n -> map(x -> split(x, " ")[1], n)
17-element Vector{SubString{String31}}:
 "Luke"
 "Obi-Wan"
 "Anakin"
 "Wilhuff"
 "Han"
 "Wedge"
 "Jek"
 "Boba"
 "Mon"
 "Arvel"
 "Qui-Gon"
 "Finis"
 "Ric"
 "Shmi"
 "Cliegg"
 "Dooku"
 "Jocasta"

Again, not the most exciting pipeline in the world – all I’m doing is returning the first names of all the fair-skinned characters – but it does highlight the fact that the combination of base pipe and anonymous function syntax in Julia works rather well if you’re inclined to write in this style.

In fact, the ability to name the input argument is especially helpful in the last line of the pipe where there are two separate functions being used, one of which is a call to map() applied to the :name column (and takes n as the input), and another that is used by map() when extracting the first name out of every name (where I’ve unimaginatively used x to name my input).

In any case, though it may not be to everyone’s tastes, I’ve found a pipe-centric style that I can use in Julia that I don’t mind, so it’s time to move on and look at some other functions available in the DataFrames package.

Data wrangling II: subset, select, sort

In the previous section I gave an example of a workflow that uses groupby() and combine() to compute summaries of a data frame. But there are other functions that come in very handy for data wrangling: I can use subset() to choose a subset of rows5, select() to choose a subset of columns, and sort() to order the rows according to some criterion. For example, here’s how I could find all the characters from Tattooine and sort them by weight:

starwars |> 
  d -> select(d, [:name, :mass, :homeworld]) |>
  d -> subset(d, :homeworld => h -> h.=="Tatooine", skipmissing=true) |>
  d -> sort(d, :mass)
10×3 DataFrame
Row name mass homeworld
String31 Float64? String15?
1 R5-D4 32.0 Tatooine
2 C-3PO 75.0 Tatooine
3 Beru Whitesun Lars 75.0 Tatooine
4 Luke Skywalker 77.0 Tatooine
5 Biggs Darklighter 84.0 Tatooine
6 Anakin Skywalker 84.0 Tatooine
7 Owen Lars 120.0 Tatooine
8 Darth Vader 136.0 Tatooine
9 Shmi Skywalker missing Tatooine
10 Cliegg Lars missing Tatooine

Notice that this time I’ve been a little smarter about handling missing values. In the call to subset() I specified skipmissing=true to drop all cases where the homeworld is missing. The sort() function doesn’t have a skipmissing argument, so the results include the two cases where someone from Tatooine doesn’t have a stated weight. But hopefully it’s clear that I could easily subset the data again to remove any cases with missing values on the :mass column if I wanted to. In fact, the DataFrames package supplies functions dropmissing(), allowmissing(), and completecases() that could be used for that purpose. For example:

starwars |> 
  d -> select(d, [:name, :mass, :homeworld]) |>
  d -> subset(d, :homeworld => h -> h.=="Tatooine", skipmissing=true) |>
  d -> sort(d, :mass, rev=true) |>
  d -> dropmissing(d, :mass)
8×3 DataFrame
Row name mass homeworld
String31 Float64 String15?
1 Darth Vader 136.0 Tatooine
2 Owen Lars 120.0 Tatooine
3 Biggs Darklighter 84.0 Tatooine
4 Anakin Skywalker 84.0 Tatooine
5 Luke Skywalker 77.0 Tatooine
6 C-3PO 75.0 Tatooine
7 Beru Whitesun Lars 75.0 Tatooine
8 R5-D4 32.0 Tatooine

The missing :mass rows are now gone, and – just for my own personal amusement – this time I’ve sorted the results in order of descending weight by setting rev=true.

Star wars X-wing and tie-fighters in front of the Death Star

Star Wars. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence. If I had more Mass Effect images to use here I would but alas, I do not.

Data wrangling III: stack

Okay, now it’s time to start thinking about how to reshape a data frame in Julia. Consider this, as the beginnings of a pipeline:

starwars |>
  d -> select(d, [:name, :eye_color, :skin_color, :hair_color])
87×4 DataFrame
62 rows omitted
Row name eye_color skin_color hair_color
String31 String15 String31 String15?
1 Luke Skywalker blue fair blond
2 C-3PO yellow gold missing
3 R2-D2 red white, blue missing
4 Darth Vader yellow white none
5 Leia Organa brown light brown
6 Owen Lars blue light brown, grey
7 Beru Whitesun Lars blue light brown
8 R5-D4 red white, red missing
9 Biggs Darklighter brown light black
10 Obi-Wan Kenobi blue-gray fair auburn, white
11 Anakin Skywalker blue fair blond
12 Wilhuff Tarkin blue fair auburn, grey
13 Chewbacca blue unknown brown
76 San Hill gold grey none
77 Shaak Ti black red, blue, white none
78 Grievous green, yellow brown, white none
79 Tarfful blue brown brown
80 Raymus Antilles brown light brown
81 Sly Moore white pale none
82 Tion Medon black grey none
83 Finn dark dark black
84 Rey hazel light brown
85 Poe Dameron brown light brown
86 BB8 black none none
87 Captain Phasma unknown none none

Suppose what I want to do is transform this into a data set that has variables :name, :body_part, and :color. In other words I want to pivot this into a long-form data set where each character is represented by three rows, and has one row that specifies the colour of the relevant body part. We can do this with the stack() function:

starwars |>
  d -> select(d, [:name, :eye_color, :skin_color, :hair_color]) |>
  d -> stack(d, [:eye_color, :skin_color, :hair_color],
    variable_name=:body_part,
    value_name=:color
  )
261×3 DataFrame
236 rows omitted
Row name body_part color
String31 String String31?
1 Luke Skywalker eye_color blue
2 C-3PO eye_color yellow
3 R2-D2 eye_color red
4 Darth Vader eye_color yellow
5 Leia Organa eye_color brown
6 Owen Lars eye_color blue
7 Beru Whitesun Lars eye_color blue
8 R5-D4 eye_color red
9 Biggs Darklighter eye_color brown
10 Obi-Wan Kenobi eye_color blue-gray
11 Anakin Skywalker eye_color blue
12 Wilhuff Tarkin eye_color blue
13 Chewbacca eye_color blue
250 San Hill hair_color none
251 Shaak Ti hair_color none
252 Grievous hair_color none
253 Tarfful hair_color brown
254 Raymus Antilles hair_color brown
255 Sly Moore hair_color none
256 Tion Medon hair_color none
257 Finn hair_color black
258 Rey hair_color brown
259 Poe Dameron hair_color brown
260 BB8 hair_color none
261 Captain Phasma hair_color none

Data wrangling IV: unstack

We can also go the other way. Let’s start with a slightly different data frame called census, one that counts the number of characters of each species on each homeworld

census = starwars |> 
  d -> dropmissing(d, [:homeworld, :species]) |>
  d -> groupby(d, [:homeworld, :species]) |> 
  d -> combine(d, :name => (n -> length(n)) => :count)
51×3 DataFrame
26 rows omitted
Row homeworld species count
String15 String15 Int64
1 Tatooine Human 8
2 Tatooine Droid 2
3 Naboo Droid 1
4 Alderaan Human 3
5 Stewjon Human 1
6 Eriadu Human 1
7 Kashyyyk Wookiee 2
8 Corellia Human 2
9 Rodia Rodian 1
10 Nal Hutta Hutt 1
11 Naboo Human 5
12 Kamino Human 1
13 Trandosha Trandoshan 1
40 Geonosis Geonosian 1
41 Mirial Mirialan 2
42 Serenno Human 1
43 Concord Dawn Human 1
44 Zolan Clawdite 1
45 Ojom Besalisk 1
46 Kamino Kaminoan 2
47 Skako Skakoan 1
48 Muunilinst Muun 1
49 Shili Togruta 1
50 Kalee Kaleesh 1
51 Utapau Pau'an 1

So now, if I wanted a version of this data set with one row per :homeworld and a column for each :species that contains the :count of the number of characters of that species on the corresponding world, I could use unstack() to pivot from long-form to wide-form data like this:

unstack(census, :species, :count, fill=0) 
46×37 DataFrame
21 rows omitted
Row homeworld Human Droid Wookiee Rodian Hutt Trandoshan Mon Calamari Ewok Sullustan Neimodian Gungan Toydarian Dug Zabrak Twi'lek Aleena Vulptereen Xexto Toong Cerean Nautolan Tholothian Iktotchi Quermian Kel Dor Chagrian Geonosian Mirialan Clawdite Besalisk Kaminoan Skakoan Muun Togruta Kaleesh Pau'an
String15 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
1 Tatooine 8 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 Naboo 5 1 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 Alderaan 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 Stewjon 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 Eriadu 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 Kashyyyk 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 Corellia 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 Rodia 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 Nal Hutta 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 Kamino 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0
11 Trandosha 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 Socorro 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 Bespin 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
35 Champala 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
36 Geonosis 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
37 Mirial 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
38 Serenno 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
39 Concord Dawn 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
40 Zolan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
41 Ojom 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
42 Skako 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
43 Muunilinst 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
44 Shili 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
45 Kalee 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
46 Utapau 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Here I’ve specified fill=0 to indicate missing values should be replaced with zeros, which is very sensible in this case because if there are no characters with a particular species/homeworld combination there wouldn’t be a row in census. Also, because I can, here’s a version that appears in a pipeline where I return only a subset of species, and – in act of appalling xenophobia – consider only planets inhabited by at least one human character:

census|>
  d -> unstack(d, :species, :count, fill=0) |>
  d -> select(d, [:homeworld, :Human, :Droid, :Ewok, :Wookiee, :Hutt]) |>
  d -> subset(d, :Human => h -> h .> 0)
14×6 DataFrame
Row homeworld Human Droid Ewok Wookiee Hutt
String15 Int64 Int64 Int64 Int64 Int64
1 Tatooine 8 2 0 0 0
2 Naboo 5 1 0 0 0
3 Alderaan 3 0 0 0 0
4 Stewjon 1 0 0 0 0
5 Eriadu 1 0 0 0 0
6 Corellia 2 0 0 0 0
7 Kamino 1 0 0 0 0
8 Socorro 1 0 0 0 0
9 Bespin 1 0 0 0 0
10 Chandrila 1 0 0 0 0
11 Coruscant 2 0 0 0 0
12 Haruun Kal 1 0 0 0 0
13 Serenno 1 0 0 0 0
14 Concord Dawn 1 0 0 0 0

Hm. Not sure what those results say about the willingness of humans to mix with other species in the Star Wars universe but it’s probably not good to reflect on it too much.

Darth Vader, with the slogan 'Don't be like your father'

Also Star Wars. By user lagota on Deviant Art, released under a CC-BY-NC-ND licence.

Data wrangling V: transform

In all honesty I am getting exhausted with this post, and mildly irrited at the fact that I’ve spent so much time in the Star Wars universe rather than in a fictional universe that I actually enjoy. So it’s time to start wrapping this one up. There’s only one more topic I really want to mention and that’s the transform() function which you can use to add new columns to a data frame.

starwars[1:6, [:name]] |> 
  d -> transform(d, :name => (n -> n.=="Darth Vader") => :lukes_father)
6×2 DataFrame
Row name lukes_father
String31 Bool
1 Luke Skywalker false
2 C-3PO false
3 R2-D2 false
4 Darth Vader true
5 Leia Organa false
6 Owen Lars false

There. It’s done.

Wrap up

No. Just no. There was already a first post and there’s about to be a third post. I am not being paid for this and I do not have the energy to think of a witty and erudite way to wrap up the unloved middle child of the trilogy. So let us never speak of this again.

Footnotes

  1. As an aside, notice that I’ve used .== rather than == as the equality test. This is because == is a scalar operator: it doesn’t work for vectors unless you broadcast it using .↩︎

  2. In this context, a predicate function is just one that returns true or false.↩︎

  3. The other issue is that my code doesn’t handle missing data gracefully, but that will come up later so I’m ignoring it for now.↩︎

  4. At some point I want to take a look at Tidier.jl, but that’s a topic for the future.↩︎

  5. There is also filter() which has a slightly different syntax.↩︎

Reuse

Citation

BibTeX citation:
@online{navarro2024,
  author = {Navarro, Danielle},
  title = {Working with Data in {Julia}},
  date = {2024-03-02},
  url = {https://blog.djnavarro.net/posts/2024-03-02_julia-data-frames},
  langid = {en}
}
For attribution, please cite this work as:
Navarro, Danielle. 2024. “Working with Data in Julia.” March 2, 2024. https://blog.djnavarro.net/posts/2024-03-02_julia-data-frames.