library(tabulizer)
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
library(janitor)
In the last post I talked about something very grim, and to be honest it’s not at all what wanted to be writing about yesterday. My intention when I woke up yesterday was to write about tabulizer, an R package you can use to extract tables from a pdf document. This isn’t my favourite of data wrangling tasks: pdf is not a very good format in which to store data, but it’s awfully common to find yourself in a situation where the data you want to work with exists only as a table in a pdf document. Because this is a thing that happens, it’s nice to have tools that make it a little easier.
To extract a table, we must first create the universe
The tabulizer package works by supplying bindings to tabula-java, a java library for extracting tables from pdfs. So if you want tabulizer to work in R you need a working installation of Java, and you need to have the rJava package to provide the R-to-Java bindings.
My experience in the past has been that getting all this setup can be a bit finicky. Happily for me, I’m on Ubuntu and Andrew Collier has a blog post that walks you through the process step by step. Following his guide, my first step was to install the Java runtime environment and the Java development kit:
sudo apt-get install -y default-jre default-jdk
This worked smoothly, so I moved onto the next step and ensured that R knows where to find Java:
sudo R CMD javareconf
Only now is it possible to install the rJava package:
install.packages("rJava")
Andrew’s post suggests that you need to restart RStudio after doing this, so I did that too. Having done so, I could finally install the tabulizer package itself:
::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer")) remotes
Let there be tables
Now that I have the tabulizer package installed, I’ll load it along with the other packages I’ll be using in this post:
To check that it works, I’ll need a pdf file to work with. As a convenience, the tabulizer package comes with a bundled “data.pdf” file that we can use for this purpose:
<- system.file("examples", "data.pdf", package = "tabulizer") pdf_data
I’ve embedded a copy of the “data.pdf” file in this post, and as you can see it’s very simple test case (by design). The file contains four tables, and only those four tables:
We can use this data file as a way to check that the package works and does what we expect. The workhorse function in the package is extract_tables()
. We pass it the path to the pdf file as the first argument, and use the various other arguments to provide details about how the file should be processed. In this case, the only other argument I’ll specify is output = "data.frame"
, which tells the extract_tables()
function to return a list of data frames rather than a list of matrices (the default behaviour). Let’s see if it’s working:
<- pdf_data |>
pdf_tables extract_tables(output = "data.frame") |>
map(as_tibble)
pdf_tables
[[1]]
# A tibble: 31 × 10
mpg cyl disp hp drat wt qsec vs am gear
<dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
1 21 6 160 110 3.9 2.62 16.5 0 1 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3
8 24.4 4 147. 62 3.69 3.19 20 1 0 4
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
# ℹ 21 more rows
[[2]]
# A tibble: 6 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
[[3]]
# A tibble: 6 × 6
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<int> <dbl> <dbl> <dbl> <dbl> <chr>
1 145 6.7 3.3 5.7 2.5 virginica
2 146 6.7 3 5.2 2.3 virginica
3 147 6.3 2.5 5 1.9 virginica
4 148 6.5 3 5.2 2 virginica
5 149 6.2 3.4 5.4 2.3 virginica
6 150 5.9 3 5.1 1.8 virginica
[[4]]
# A tibble: 14 × 1
supp
<chr>
1 VC
2 VC
3 VC
4 VC
5 VC
6 VC
7 VC
8 VC
9 VC
10 VC
11 VC
12 VC
13 VC
14 VC
That looks nice. With very little effort we’ve extracted all four tables from the pdf file, and returned a list of tibbles containing the data. Yay! 🎉
Wild caught pdf files are trickier to work with
Okay, let’s try a harder example. One of the two reports I referred to in yesterdays blog post is a survey of LGBTQ people conducted by Data For Progress. Unlike the test file, it contains additional text that is not part of any table, and the tables within the report have a lot of fancier formatting that isn’t present in the test file. I’ve cached a local copy of the pdf file as “dfp_lgbtq_survey.pdf”, and you can take a look yourself to see what we’re working with this time:
The data I used in that post comes from question 4, so I’ll try to extract the data for that table from the pdf file. This turns out to be a little harder to do. My first attempt tried to automatically pull all the tables from the second page by setting pages = 2
, and this is what happened:
<- "dfp_lgbtq_survey.pdf"
pdf_file extract_tables(pdf_file, pages = 2)
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] "79" "85" "82" "70" "87" "89" "77" "79" "83" "69" "80"
[2,] "17" "12" "13" "29" "7" "6" "19" "14" "15" "27" "17"
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "Response" "Topline" "African" "or" "White" "Female" "Male"
[2,] "" "" "" "" "" "" ""
[3,] "" "" "American" "Latino/a" "" "" ""
[,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "" "" "identify as" "" "" "" "" "65+"
[2,] "binary" "transgender" "" "24" "39" "54" "64" ""
[3,] "" "" "transgender" "" "" "" "" ""
Okay, that’s definitely not the data we want. To make this work we’re going to have to give extract_tables()
a little more information. One way to do this is to explicitly specify the area
of the pdf file that contains the table to be extracted. To that end, it’s helpful to first call the get_page_dims()
function, which gives us the dimensions of each page in the pdf document:
get_page_dims(pdf_file)
[[1]]
[1] 595 842
[[2]]
[1] 595 842
[[3]]
[1] 595 842
Now that we have the dimensions for each page we can specify a rectangular region as a vector containing the top, left, bottom and right coordinates of the rectangle:
<- c(250, 0, 450, 595) region
The command we want looks like this:
<- extract_tables(
mat file = pdf_file,
pages = 2,
guess = FALSE,
area = list(region)
1]] )[[
This time around, in addition to setting pages = 2
, we’ve set guess = FALSE
in order to stop extract_tables()
from trying to automatically detect regions containing tabular data, and also passed a list of regions (in this case just the one region) as the area
argument, thereby telling extract_tables()
to look in that specific part of the document.
Let’s take a look at the result:
mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "" "" "Black or" "Hispanic" "" "" "" "" "" "Does not" "" "" "" "" ""
[2,] "" "" "" "" "" "" "" "Non-" "Identifies as" "" "18-" "25-" "40-" "55-" ""
[3,] "Response" "Topline" "African" "or" "White" "Female" "Male" "" "" "identify as" "" "" "" "" "65+"
[4,] "" "" "" "" "" "" "" "binary" "transgender" "" "24" "39" "54" "64" ""
[5,] "" "" "American" "Latino/a" "" "" "" "" "" "transgender" "" "" "" "" ""
[6,] "Yes, I have considered moving" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[7,] "" "27" "24" "28" "27" "26" "20" "44" "43" "24" "41" "28" "18" "17" "17"
[8,] "out of my community or state" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[9,] "No, I have not considered" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[10,] "moving out of my community or" "61" "57" "59" "64" "60" "75" "37" "40" "65" "43" "60" "74" "80" "70"
[11,] "state" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[12,] "I have already moved out of my" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[13,] "community or state as a result of" "5" "6" "4" "3" "3" "3" "13" "8" "4" "8" "4" "2" "1" "9"
[14,] "anti-LGBTQ+ legislation" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[15,] "Not sure" "7" "14" "8" "6" "11" "3" "6" "8" "7" "9" "8" "7" "2" "4"
[16,] "Weighted N" "1,036" "93" "217" "632" "426" "368" "135" "166" "870" "249" "425" "186" "93" "83"
It’s not quite organised the way we want, but it’s definitely the right data.
Even better, you don’t actually have to do all this messing about trying to figure out the precise region containing the table. If you have the Shiny and miniUI packages installed, you can work interactively using a command like this:
extract_areas("dfp_lgbtq_survey.pdf", pages = 2)
After using click and drag to select the region of the page containing the table, R returns the same data contained in the mat
matrix shown earlier.
Cleaning the table
Once we have the data in this matrix form, it’s slightly tedious to wrangle it into the format we want, but it’s not conceptually difficult once we have a few helper functions to make our lives easier. The first step is to split the matrix into a list of matrices, each of which contains the data that should belong in a single row of the final data set. The row_split()
function below takes a matrix as input, and splits it up into a list of matrices specified by the list argument rows
, where each element of rows
is a vector containing the indices of the rows that should be included in the relevant element of the output:
<- function(x, rows) {
row_split lapply(rows, \(r) {
if(length(r) == 1) return(matrix(x[r, ], nrow = 1))
x[r, ]
})
}<- list(1:5, 6:8, 9:11, 12:14, 15, 16)
groups
<- row_split(mat, rows = groups)
mat_split mat_split
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "" "" "Black or" "Hispanic" "" "" "" "" "" "Does not" "" "" "" "" ""
[2,] "" "" "" "" "" "" "" "Non-" "Identifies as" "" "18-" "25-" "40-" "55-" ""
[3,] "Response" "Topline" "African" "or" "White" "Female" "Male" "" "" "identify as" "" "" "" "" "65+"
[4,] "" "" "" "" "" "" "" "binary" "transgender" "" "24" "39" "54" "64" ""
[5,] "" "" "American" "Latino/a" "" "" "" "" "" "transgender" "" "" "" "" ""
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "Yes, I have considered moving" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[2,] "" "27" "24" "28" "27" "26" "20" "44" "43" "24" "41" "28" "18" "17" "17"
[3,] "out of my community or state" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[[3]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "No, I have not considered" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[2,] "moving out of my community or" "61" "57" "59" "64" "60" "75" "37" "40" "65" "43" "60" "74" "80" "70"
[3,] "state" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[[4]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "I have already moved out of my" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[2,] "community or state as a result of" "5" "6" "4" "3" "3" "3" "13" "8" "4" "8" "4" "2" "1" "9"
[3,] "anti-LGBTQ+ legislation" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[[5]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "Not sure" "7" "14" "8" "6" "11" "3" "6" "8" "7" "9" "8" "7" "2" "4"
[[6]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "Weighted N" "1,036" "93" "217" "632" "426" "368" "135" "166" "870" "249" "425" "186" "93" "83"
The second helper function is col_paste()
which takes a matrix with one or more rows as input and collapses it to a vector by pasting the contents of all cells in the same column together:
<- function(x, ...) {
col_paste apply(x, 2, \(y) {as.vector(paste(y, ...))})
}
To illustrate the idea, let’s take mat_split[[1]]
, a five-row matrix that contains the data that should eventually become our column names, and convert it to a character vector using col_paste()
col_paste(mat_split[[1]], collapse = " ")
[1] " Response "
[2] " Topline "
[3] "Black or African American"
[4] "Hispanic or Latino/a"
[5] " White "
[6] " Female "
[7] " Male "
[8] " Non- binary "
[9] " Identifies as transgender "
[10] "Does not identify as transgender"
[11] " 18- 24 "
[12] " 25- 39 "
[13] " 40- 54 "
[14] " 55- 64 "
[15] " 65+ "
Finally, we can use the row_combine()
function below that takes a list of vectors and combines them into a matrix.
<- function(x, ...) {
row_combine matrix(unlist(x), nrow = length(x), byrow = TRUE)
}
Equipped with these helpers, the following pipeline takes the raw output mat
and converts it into a tibble dat
containing the data in the format we want it to be:
<- mat |>
dat row_split(groups) |> # split into list of matrices
map(\(x) {col_paste(x, collapse = " ")}) |> # paste into character vector
row_combine() |> # combine vectors into one matrix
as_tibble(.name_repair = "minimal") |> # convert to tibble
row_to_names(row_number = 1) |> # use first row as names
clean_names() |> # clean the names
rename( # shorten some names
"black" = "black_or_african_american",
"hispanic" = "hispanic_or_latino_a",
"trans" = "identifies_as_transgender",
"not_trans" = "does_not_identify_as_transgender"
|>
) mutate(across(!response, \(x) {as.numeric(gsub(",", "", x))})) # numeric data
dat
# A tibble: 5 × 15
response topline black hispanic white female male non_binary trans not_trans x18_24 x25_39 x40_54 x55_64 x65
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Yes, I have considered moving … 27 24 28 27 26 20 44 43 24 41 28 18 17 17
2 No, I have not considered movi… 61 57 59 64 60 75 37 40 65 43 60 74 80 70
3 I have already moved out of my… 5 6 4 3 3 3 13 8 4 8 4 2 1 9
4 Not sure 7 14 8 6 11 3 6 8 7 9 8 7 2 4
5 Weighted N 1036 93 217 632 426 368 135 166 870 249 425 186 93 83
Et voilà!
Reuse
Citation
@online{navarro2023,
author = {Navarro, Danielle},
title = {Extracting Tables from Pdf Files with Tabulizer},
date = {2023-06-16},
url = {https://blog.djnavarro.net/posts/2023-06-16_tabulizer/},
langid = {en}
}