Mapping runkeeper data

Wherein the author attempts to convince herself that she hasn’t just been imagining things
R
Data Visualisation
Data Wrangling
Author
Published

January 5, 2026

One of my constant irritations, as someone who loves to run and loves to analyze data, is that despite the many wonderful apps and gadgets we have for taking detailed measurements of our exercise patterns, the analyses that get shown to us as end-users of this tech are… well, at best they are boring. We get shown some graphs counting the number of steps we’ve taken, or a map showing where we ran on a specific day, and that’s about as good as it gets. Sadly, these anodyne data visualisations are often mixed with analyses that make absolutely no statistical sense, and a whole lot of junk that is best characterised as noise.1 Fortunately, it’s usually possible to export your data from these platforms and then do whatever analyses you want.2

To that end, I decided the time has come to download my runkeeper data and use it to draw some maps I actually want to see, and to answer some questions that have been nagging at me lately.

Extracting the data

To my knowledge runkeeper doesn’t supply an API, but you can download your personal data here. The export arrives as a zip file containing one gpx file for each activity, and a bunch of csv files I’m not interested in. Because I have runkeeper data that goes back to 2014 and didn’t want to have a flat directory structure with several hundred gpx files, I organised mine by calendar year. I store these files in a private “runkeeper” repository (i.e., the data feels a bit too personal to cache in this public blog repo). These are the files for 2014:

runkeeper <- fs::path_norm(here::here("..", "runkeeper"))
fs::dir_tree(path = fs::path(runkeeper, "gpx", "2014"))
/home/danielle/GitHub/djnavarro/runkeeper/gpx/2014
├── 2014-09-13-075703.gpx
├── 2014-09-18-092104.gpx
├── 2014-09-28-060957.gpx
├── 2014-10-11-160417.gpx
├── 2014-10-12-160422.gpx
├── 2014-10-16-081259.gpx
├── 2014-10-23-122343.gpx
├── 2014-11-01-085017.gpx
├── 2014-11-08-085516.gpx
├── 2014-12-06-092452.gpx
├── 2014-12-09-190117.gpx
├── 2014-12-26-065104.gpx
└── 2014-12-27-133347.gpx

I didn’t run much back then, so there aren’t many files!

Back in 2014 I was living in Adelaide, but my love for Sydney was already starting to take shape back then: the 2014-09-18-092104.gpx file corresponds to one of my very first runs in Sydney (possibly the first time I went running here). I remember that run quite well. I was visiting UNSW,3 and took advantage of the visit to explore the Sydney beaches. That run was the first time I’d tried running along the coastal walk from Coogee Beach to Bondi Beach. It’s not the easiest run (dodging pedestrians, running up and down a lot of steps), but it’s truly gorgeous.

Anyway. The gpx files follow an xml format, and you can see the basic structure by printing out the first few lines:

coastal_run_gpx <- fs::path(runkeeper, "gpx", "2014", "2014-09-18-092104.gpx")
cat(brio::read_lines(coastal_run_gpx)[1:25], sep = "\n")
<?xml version="1.0" encoding="UTF-8"?>
<gpx
  version="1.1"
  creator="Runkeeper - http://www.runkeeper.com"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://www.topografix.com/GPX/1/1"
  xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"
  xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1">
<trk>
  <name><![CDATA[Running 9/18/14 9:21 am]]></name>
  <time>2014-09-17T23:21:04Z</time>
<trkseg>
<trkpt lat="-33.918281000" lon="151.260294000"><ele>25.0</ele><time>2014-09-17T23:21:04Z</time></trkpt>
<trkpt lat="-33.918277000" lon="151.260310000"><ele>26.3</ele><time>2014-09-17T23:21:06Z</time></trkpt>
<trkpt lat="-33.918217000" lon="151.260380000"><ele>27.2</ele><time>2014-09-17T23:21:15Z</time></trkpt>
<trkpt lat="-33.918120000" lon="151.260424000"><ele>28.0</ele><time>2014-09-17T23:21:19Z</time></trkpt>
<trkpt lat="-33.918029000" lon="151.260460000"><ele>28.6</ele><time>2014-09-17T23:21:23Z</time></trkpt>
<trkpt lat="-33.917941000" lon="151.260486000"><ele>29.1</ele><time>2014-09-17T23:21:27Z</time></trkpt>
<trkpt lat="-33.917852000" lon="151.260519000"><ele>29.9</ele><time>2014-09-17T23:21:31Z</time></trkpt>
<trkpt lat="-33.917770000" lon="151.260577000"><ele>30.7</ele><time>2014-09-17T23:21:35Z</time></trkpt>
<trkpt lat="-33.917695000" lon="151.260627000"><ele>31.5</ele><time>2014-09-17T23:21:39Z</time></trkpt>
<trkpt lat="-33.917622000" lon="151.260694000"><ele>32.4</ele><time>2014-09-17T23:21:43Z</time></trkpt>
<trkpt lat="-33.917547000" lon="151.260791000"><ele>33.2</ele><time>2014-09-17T23:21:47Z</time></trkpt>
<trkpt lat="-33.917494000" lon="151.260877000"><ele>34.0</ele><time>2014-09-17T23:21:50Z</time></trkpt>
<trkpt lat="-33.917468000" lon="151.260971000"><ele>34.0</ele><time>2014-09-17T23:21:53Z</time></trkpt>

Okay, that’s not too difficult. We can work with this. Here’s a simple function that parses a gpx file, and extracts the four fields of interest: time, latitute, longitude, and elevation:

parse_gpx <- function(path) {
  path |> 
    xml2::read_xml() |> # read xml file
    xml2::xml_find_all(xpath = "/*/*/*[3]") |> # extract trkseg
    xml2::as_list() |> # convert to list
    purrr::pluck(1) |> # extract nested list
    purrr::map_dfr(\(x) { # convert to data frame
      tibble::tibble(
        id = fs::path_file(path) |> fs::path_ext_remove(), # unique run id
        time = lubridate::ymd_hms(x$time[[1]]), # time in UTC
        lat = as.numeric(attr(x, "lat")), # latitude
        lon = as.numeric(attr(x, "lon")), # longitude
        ele = as.numeric(x$ele[[1]]) # elevation
      )
    }) 
}

Applying this function gives us a tidy data set. Each column corresponds to one of the four measurements (or an identifier column in case we should ever decide to merge this with data from other runs), and each row is a single observation:

run <- parse_gpx(coastal_run_gpx)
run
# A tibble: 501 × 5
   id                time                  lat   lon   ele
   <chr>             <dttm>              <dbl> <dbl> <dbl>
 1 2014-09-18-092104 2014-09-17 23:21:04 -33.9  151.  25  
 2 2014-09-18-092104 2014-09-17 23:21:06 -33.9  151.  26.3
 3 2014-09-18-092104 2014-09-17 23:21:15 -33.9  151.  27.2
 4 2014-09-18-092104 2014-09-17 23:21:19 -33.9  151.  28  
 5 2014-09-18-092104 2014-09-17 23:21:23 -33.9  151.  28.6
 6 2014-09-18-092104 2014-09-17 23:21:27 -33.9  151.  29.1
 7 2014-09-18-092104 2014-09-17 23:21:31 -33.9  151.  29.9
 8 2014-09-18-092104 2014-09-17 23:21:35 -33.9  151.  30.7
 9 2014-09-18-092104 2014-09-17 23:21:39 -33.9  151.  31.5
10 2014-09-18-092104 2014-09-17 23:21:43 -33.9  151.  32.4
# ℹ 491 more rows

The gpx file stores time as UTC: in local Sydney time I started running at 9:31am on the morning of September 18th 2014, which corresponds to 2014-09-17 23:21:04 in UTC. If I were less lazy I’d annotate the data set to specify the timezone, but it’s not super relevant to my current project so at this stage I didn’t bother.4

What I did end up doing was write some code to add two new computed columns that aren’t in the gpx file:

  • elapsed is the elapsed time (in minutes) from the start of the run, to the time that the measurement is taken
  • distance is the total distance run (in meters) up to the time that the measurement is taken

Here’s the code for that:

units::units_options(
  sep = c("~", "~"), 
  group = c("", ""),  
  negative_power = FALSE, 
  parse = TRUE
)

segment_length <- function(lat, lon, ele = NULL) {

  # convert lat, lon data to an sfc geometry 
  # ignore elevation per https://github.com/r-spatial/sf/issues/2564
  path <- cbind(lon, lat) |>
    sf::st_linestring(dim = "XY") |> 
    sf::st_sfc(crs = "WGS84")
  
  points <- sf::st_cast(path, "POINT")
  points_lagged <- points[-length(points)]
  points_lagged <- c(points[1], points_lagged)

  # calculate the XY length of each segment of the run (in meters)
  seg_len_xy <- sf::st_distance(points, points_lagged, by_element = TRUE) 
  if (is.null(ele)) return(seg_len_xy)

  # if elevation is specified, caclulate XYX distance
  ele_lagged <- dplyr::lag(ele, default = ele[1])
  seg_len_z <- units::as_units(ele - ele_lagged, "m")

  seg_len_xyz <- sqrt(seg_len_xy^2 + seg_len_z^2)
  return(seg_len_xyz)
}

run <- run |> 
  dplyr::mutate(
    elapsed = difftime(time, time[1], units = "mins"),
    distance = cumsum(segment_length(lat, lon, ele))
  )

run
# A tibble: 501 × 7
   id                time                  lat   lon   ele elapsed  distance
   <chr>             <dttm>              <dbl> <dbl> <dbl> <drtn>          m
 1 2014-09-18-092104 2014-09-17 23:21:04 -33.9  151.  25   0.00000…     0   
 2 2014-09-18-092104 2014-09-17 23:21:06 -33.9  151.  26.3 0.03333…     2.02
 3 2014-09-18-092104 2014-09-17 23:21:15 -33.9  151.  27.2 0.18333…    11.3 
 4 2014-09-18-092104 2014-09-17 23:21:19 -33.9  151.  28   0.25000…    22.9 
 5 2014-09-18-092104 2014-09-17 23:21:23 -33.9  151.  28.6 0.31666…    33.6 
 6 2014-09-18-092104 2014-09-17 23:21:27 -33.9  151.  29.1 0.38333…    43.7 
 7 2014-09-18-092104 2014-09-17 23:21:31 -33.9  151.  29.9 0.45000…    54.0 
 8 2014-09-18-092104 2014-09-17 23:21:35 -33.9  151.  30.7 0.51666…    64.6 
 9 2014-09-18-092104 2014-09-17 23:21:39 -33.9  151.  31.5 0.58333…    74.2 
10 2014-09-18-092104 2014-09-17 23:21:43 -33.9  151.  32.4 0.65000…    84.4 
# ℹ 491 more rows

Mapping the data

Stage 1: Reproduce something like the runkeeper maps

Next step is to create a pretty map. I’ll use the leaflet R package and javascript library as the mapping tool, and I’ll pull the map tiles from stadia maps. To do so, I wrote a convenient wrapper function that constructs the URL template for a specific stadia maps style:

stadia_tile_url <- function(style = "stamen_toner", key = TRUE) {

  # URL template (without API key)
  base_url <- "https://tiles.stadiamaps.com/tiles"
  pattern  <- "{z}/{x}/{y}{r}.png"
  tile_url <- paste(base_url, style, pattern, sep = "/")
  if (!key) return(tile_url)

  # pull API key from private file
  api_key <- fs::path(runkeeper, ".Renviron") |> 
    brio::read_lines() |> 
    stringr::str_remove("^[^=]*=")

  #return the full URL template
  glue::glue("{tile_url}?api_key={api_key}")
}

# show URL template (without the API key)
stadia_tile_url(key = FALSE)
[1] "https://tiles.stadiamaps.com/tiles/stamen_toner/{z}/{x}/{y}{r}.png"

As you can see from the code above, I finally caved and set up an account with stadia maps, and to that end I now have an API key that I can supply. In many cases you don’t actually need one though.5 In my “runkeeper” repository the API key is provided as an environment variable via the .Renviron file,6 but since that environment variable doesn’t exist in the R environment for this blog post, the function reads it directly from the .Renviron file in the other repository.

In the map I want to create, I’d like to add some markers on the run corresponding to specific milestones. In runkeeper itself, the maps display distance markers (1km, 2km, etc.) overlaid on the route. Just to mix things up a little, I’ll display time markers (5min, 10min, etc.) on my map, with the distance information at the relevant time point included in the marker label. To do that, I’ll need a data frame that specifies the marker information:

run_markers <- run |> 
  dplyr::mutate(
    n5min = as.numeric(floor(elapsed/5)), 
    new5min = n5min - dplyr::lag(n5min, default = 0)
  ) |> 
  dplyr::filter(new5min == 1) |> 
  dplyr::select(lat, lon, elapsed, distance) |> 
  dplyr::mutate(label = paste0(
    format(round(elapsed)), 
    ", ", 
    format(round(distance))
  )) 

run_markers
# A tibble: 6 × 5
    lat   lon elapsed        distance label            
  <dbl> <dbl> <drtn>                m <chr>            
1 -33.9  151.  5.016667 mins     711. " 5 mins,  711 m"
2 -33.9  151. 10.000000 mins    1488. "10 mins, 1488 m"
3 -33.9  151. 15.033333 mins    2270. "15 mins, 2270 m"
4 -33.9  151. 20.116667 mins    3028. "20 mins, 3028 m"
5 -33.9  151. 25.066667 mins    3865. "25 mins, 3865 m"
6 -33.9  151. 30.000000 mins    4617. "30 mins, 4617 m"

So now I can create my map. In the map below, the base layer contains tiles supplied by stadia maps, and over the top of that I’ve plotted the route taken on my coastal run, and added markers to show key milestones during the run:

map <- leaflet::leaflet() |> 
  leaflet::addTiles(
    urlTemplate = stadia_tile_url(),
    options = leaflet::tileOptions(opacity = .5)
  ) |> 
  leaflet::addPolylines(
    data = run, 
    lat = ~lat, 
    lng = ~lon, 
    color = "red", 
    opacity = 0.5, 
    weight = 5
  ) |> 
  leaflet::addCircleMarkers(
    data = run_markers,
    lat = ~lat,
    lng = ~lon,
    label = ~label,
    color = "red", 
    opacity = 0.5, 
    radius = 5
  )

map


Neat. Obviously, this map could be refined further and made prettier, but as a proof of concept it serves its purpose.

Stage 2: Do something fun and show many runs at once

In the previous example I used the run data frame constructed from a single gpx file. To create a map that shows many runs in a single map, I’ll need to parse more gpx files. Happily, in my actual runkeeper repo I’ve already done this, and have converted each of the gpx files to a tidy csv that contains the elapsed time and cumulative distance field. Given that, we don’t have to bother with the parsing step this time. Instead we can simply import the preprocessed data:

# paths to all csv files
csv_files <- fs::dir_ls(
  path = fs::path(runkeeper, "csv"), 
  recurse = TRUE, 
  type = "file"
)

# read everything into a single data frame
runs <- csv_files |> 
  purrr::map(\(file) {
    readr::read_csv(
      file = file,
      show_col_types = FALSE
    )   
  }) |> 
  dplyr::bind_rows()

# function that detects if a run is in sydney (in a lazy way)
near_sydney <- function(lat, lon, tol = 2) {
  if (abs(mean(lat) + 33.865) > tol) return(FALSE)
  if (abs(mean(lon) - 151.21) > tol) return(FALSE)
  TRUE
}

# append some handy information, and sort by run length; retain
# only those runs that are at least 2km (anything shorter than that
# usually indicates injury or some other extraneous factor)
runs_sydney <- runs |> 
  dplyr::mutate(
    sydney = near_sydney(lat, lon),
    run_len = dplyr::last(distance),
    .by = "id"
  ) |> 
  dplyr::filter(sydney == TRUE, run_len > 2000) |> 
  dplyr::arrange(run_len, id, time)

runs_sydney
# A tibble: 272,386 × 9
   id          time                  lat   lon   ele elapsed distance sydney
   <chr>       <dttm>              <dbl> <dbl> <dbl>   <dbl>    <dbl> <lgl> 
 1 2024-05-09… 2024-05-09 02:48:50 -33.9  151.  22.5   0         0    TRUE  
 2 2024-05-09… 2024-05-09 02:48:57 -33.9  151.  22.4   0.117     9.83 TRUE  
 3 2024-05-09… 2024-05-09 02:49:01 -33.9  151.  22.2   0.183    21.4  TRUE  
 4 2024-05-09… 2024-05-09 02:49:05 -33.9  151.  21.9   0.25     31.7  TRUE  
 5 2024-05-09… 2024-05-09 02:49:10 -33.9  151.  21.6   0.333    42.7  TRUE  
 6 2024-05-09… 2024-05-09 02:49:16 -33.9  151.  21.5   0.433    54.3  TRUE  
 7 2024-05-09… 2024-05-09 02:49:21 -33.9  151.  21.3   0.517    65.2  TRUE  
 8 2024-05-09… 2024-05-09 02:49:26 -33.9  151.  20.9   0.6      75.2  TRUE  
 9 2024-05-09… 2024-05-09 02:49:31 -33.9  151.  20.4   0.683    86.9  TRUE  
10 2024-05-09… 2024-05-09 02:49:37 -33.9  151.  19.7   0.783    98.6  TRUE  
# ℹ 272,376 more rows
# ℹ 1 more variable: run_len <dbl>

Using the runs_sydney data set, I can now draw a fun map showing all the places I’ve gone running in Sydney. In the map below, paths in blue correspond to my half-marathon runs; all other runs are shown in orange:

purrr::reduce(
  .x = runs_sydney |> 
    dplyr::group_by(run_len, id) |>
    dplyr::group_split(),
  .f = \(map, run) {
    map |> 
      leaflet::addPolylines(
      data = run, 
      lat = ~lat, 
      lng = ~lon, 
      color = ifelse(
        run$run_len[1] > 21000, 
        "#006699", 
        "#C76E00"
      ), 
      opacity = .6, 
      weight = 4
    )
  },
  .init = leaflet::leaflet() |> 
    leaflet::addTiles(
      urlTemplate = stadia_tile_url("stamen_watercolor"),
      options = leaflet::tileOptions(opacity = .3)
    )  
) |> 
  leaflet::setView(
    lat = -33.863, 
    lng = 151.235,
    zoom = 12
  )