Extracting tables from pdf files with tabulizer

Something nicer than the last post
R
Data Wrangling
Author
Published

June 16, 2023

In the last post I talked about something very grim, and to be honest it’s not at all what wanted to be writing about yesterday. My intention when I woke up yesterday was to write about tabulizer, an R package you can use to extract tables from a pdf document. This isn’t my favourite of data wrangling tasks: pdf is not a very good format in which to store data, but it’s awfully common to find yourself in a situation where the data you want to work with exists only as a table in a pdf document. Because this is a thing that happens, it’s nice to have tools that make it a little easier.

To extract a table, we must first create the universe

The tabulizer package works by supplying bindings to tabula-java, a java library for extracting tables from pdfs. So if you want tabulizer to work in R you need a working installation of Java, and you need to have the rJava package to provide the R-to-Java bindings.

My experience in the past has been that getting all this setup can be a bit finicky. Happily for me, I’m on Ubuntu and Andrew Collier has a blog post that walks you through the process step by step. Following his guide, my first step was to install the Java runtime environment and the Java development kit:

sudo apt-get install -y default-jre default-jdk

This worked smoothly, so I moved onto the next step and ensured that R knows where to find Java:

sudo R CMD javareconf

Only now is it possible to install the rJava package:

install.packages("rJava")

Andrew’s post suggests that you need to restart RStudio after doing this, so I did that too. Having done so, I could finally install the tabulizer package itself:

remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))

Let there be tables

Now that I have the tabulizer package installed, I’ll load it along with the other packages I’ll be using in this post:

library(tabulizer)
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
library(janitor)

To check that it works, I’ll need a pdf file to work with. As a convenience, the tabulizer package comes with a bundled “data.pdf” file that we can use for this purpose:

pdf_data <- system.file("examples", "data.pdf", package = "tabulizer")

I’ve embedded a copy of the “data.pdf” file in this post, and as you can see it’s very simple test case (by design). The file contains four tables, and only those four tables:


Unable to display PDF file. Download instead.


We can use this data file as a way to check that the package works and does what we expect. The workhorse function in the package is extract_tables(). We pass it the path to the pdf file as the first argument, and use the various other arguments to provide details about how the file should be processed. In this case, the only other argument I’ll specify is output = "data.frame", which tells the extract_tables() function to return a list of data frames rather than a list of matrices (the default behaviour). Let’s see if it’s working:

pdf_tables <- pdf_data |> 
  extract_tables(output = "data.frame") |>
  map(as_tibble)

pdf_tables
[[1]]
# A tibble: 31 × 10
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
   <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4
# ℹ 21 more rows

[[2]]
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 

[[3]]
# A tibble: 6 × 6
      X Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
  <int>        <dbl>       <dbl>        <dbl>       <dbl> <chr>    
1   145          6.7         3.3          5.7         2.5 virginica
2   146          6.7         3            5.2         2.3 virginica
3   147          6.3         2.5          5           1.9 virginica
4   148          6.5         3            5.2         2   virginica
5   149          6.2         3.4          5.4         2.3 virginica
6   150          5.9         3            5.1         1.8 virginica

[[4]]
# A tibble: 14 × 1
   supp 
   <chr>
 1 VC   
 2 VC   
 3 VC   
 4 VC   
 5 VC   
 6 VC   
 7 VC   
 8 VC   
 9 VC   
10 VC   
11 VC   
12 VC   
13 VC   
14 VC   

That looks nice. With very little effort we’ve extracted all four tables from the pdf file, and returned a list of tibbles containing the data. Yay! 🎉

Wild caught pdf files are trickier to work with

Okay, let’s try a harder example. One of the two reports I referred to in yesterdays blog post is a survey of LGBTQ people conducted by Data For Progress. Unlike the test file, it contains additional text that is not part of any table, and the tables within the report have a lot of fancier formatting that isn’t present in the test file. I’ve cached a local copy of the pdf file as “dfp_lgbtq_survey.pdf”, and you can take a look yourself to see what we’re working with this time:


Unable to display PDF file. Download instead.


The data I used in that post comes from question 4, so I’ll try to extract the data for that table from the pdf file. This turns out to be a little harder to do. My first attempt tried to automatically pull all the tables from the second page by setting pages = 2, and this is what happened:

pdf_file <- "dfp_lgbtq_survey.pdf"
extract_tables(pdf_file, pages = 2)
[[1]]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] "79" "85" "82" "70" "87" "89" "77" "79" "83" "69"  "80" 
[2,] "17" "12" "13" "29" "7"  "6"  "19" "14" "15" "27"  "17" 

[[2]]
     [,1]       [,2]      [,3]       [,4]       [,5]    [,6]     [,7]  
[1,] "Response" "Topline" "African"  "or"       "White" "Female" "Male"
[2,] ""         ""        ""         ""         ""      ""       ""    
[3,] ""         ""        "American" "Latino/a​" ""      ""       ""    
     [,8]     [,9]          [,10]         [,11] [,12] [,13] [,14] [,15]
[1,] ""       ""            "identify as" ""    ""    ""    ""    "65+"
[2,] "binary" "transgender" ""            "24"  "39"  "54"  "64"  ""   
[3,] ""       ""            "transgender" ""    ""    ""    ""    ""   

Okay, that’s definitely not the data we want. To make this work we’re going to have to give extract_tables() a little more information. One way to do this is to explicitly specify the area of the pdf file that contains the table to be extracted. To that end, it’s helpful to first call the get_page_dims() function, which gives us the dimensions of each page in the pdf document:

get_page_dims(pdf_file)
[[1]]
[1] 595 842

[[2]]
[1] 595 842

[[3]]
[1] 595 842

Now that we have the dimensions for each page we can specify a rectangular region as a vector containing the top, left, bottom and right coordinates of the rectangle:

region <- c(250, 0, 450, 595)

The command we want looks like this:

mat <- extract_tables(
  file = pdf_file, 
  pages = 2, 
  guess = FALSE,
  area = list(region)
)[[1]]

This time around, in addition to setting pages = 2, we’ve set guess = FALSE in order to stop extract_tables() from trying to automatically detect regions containing tabular data, and also passed a list of regions (in this case just the one region) as the area argument, thereby telling extract_tables() to look in that specific part of the document.

Let’s take a look at the result:

mat
      [,1]                                [,2]      [,3]       [,4]       [,5]    [,6]     [,7]   [,8]     [,9]            [,10]         [,11] [,12] [,13] [,14] [,15]
 [1,] ""                                  ""        "Black or" "Hispanic" ""      ""       ""     ""       ""              "Does not"    ""    ""    ""    ""    ""   
 [2,] ""                                  ""        ""         ""         ""      ""       ""     "Non-"   "Identifies as" ""            "18-" "25-" "40-" "55-" ""   
 [3,] "Response"                          "Topline" "African"  "or"       "White" "Female" "Male" ""       ""              "identify as" ""    ""    ""    ""    "65+"
 [4,] ""                                  ""        ""         ""         ""      ""       ""     "binary" "transgender"   ""            "24"  "39"  "54"  "64"  ""   
 [5,] ""                                  ""        "American" "Latino/​a" ""      ""       ""     ""       ""              "transgender" ""    ""    ""    ""    ""   
 [6,] "Yes, I have considered moving"     ""        ""         ""         ""      ""       ""     ""       ""              ""            ""    ""    ""    ""    ""   
 [7,] ""                                  "27"      "24"       "28"       "27"    "26"     "20"   "44"     "43"            "24"          "41"  "28"  "18"  "17"  "17" 
 [8,] "out of my community or state"      ""        ""         ""         ""      ""       ""     ""       ""              ""            ""    ""    ""    ""    ""   
 [9,] "No, I have not considered"         ""        ""         ""         ""      ""       ""     ""       ""              ""            ""    ""    ""    ""    ""   
[10,] "moving out of my community or"     "61"      "57"       "59"       "64"    "60"     "75"   "37"     "40"            "65"          "43"  "60"  "74"  "80"  "70" 
[11,] "state"                             ""        ""         ""         ""      ""       ""     ""       ""              ""            ""    ""    ""    ""    ""   
[12,] "I have already moved out of my"    ""        ""         ""         ""      ""       ""     ""       ""              ""            ""    ""    ""    ""    ""   
[13,] "community or state as a result of" "5"       "6"        "4"        "3"     "3"      "3"    "13"     "8"             "4"           "8"   "4"   "2"   "1"   "9"  
[14,] "anti-LGBTQ+ legislation"           ""        ""         ""         ""      ""       ""     ""       ""              ""            ""    ""    ""    ""    ""   
[15,] "Not sure"                          "7"       "14"       "8"        "6"     "11"     "3"    "6"      "8"             "7"           "9"   "8"   "7"   "2"   "4"  
[16,] "Weighted N"                        "1,036"   "93"       "217"      "632"   "426"    "368"  "135"    "166"           "870"         "249" "425" "186" "93"  "83" 

It’s not quite organised the way we want, but it’s definitely the right data.

Even better, you don’t actually have to do all this messing about trying to figure out the precise region containing the table. If you have the Shiny and miniUI packages installed, you can work interactively using a command like this:

extract_areas("dfp_lgbtq_survey.pdf", pages = 2)

After using click and drag to select the region of the page containing the table, R returns the same data contained in the mat matrix shown earlier.

Cleaning the table

Once we have the data in this matrix form, it’s slightly tedious to wrangle it into the format we want, but it’s not conceptually difficult once we have a few helper functions to make our lives easier. The first step is to split the matrix into a list of matrices, each of which contains the data that should belong in a single row of the final data set. The row_split() function below takes a matrix as input, and splits it up into a list of matrices specified by the list argument rows, where each element of rows is a vector containing the indices of the rows that should be included in the relevant element of the output:

row_split <- function(x, rows) {
  lapply(rows, \(r) {
    if(length(r) == 1) return(matrix(x[r, ], nrow = 1))
    x[r, ]
  })
}
groups <- list(1:5, 6:8, 9:11, 12:14, 15, 16)

mat_split <- row_split(mat, rows = groups)
mat_split
[[1]]
     [,1]       [,2]      [,3]       [,4]       [,5]    [,6]     [,7]   [,8]     [,9]            [,10]         [,11] [,12] [,13] [,14] [,15]
[1,] ""         ""        "Black or" "Hispanic" ""      ""       ""     ""       ""              "Does not"    ""    ""    ""    ""    ""   
[2,] ""         ""        ""         ""         ""      ""       ""     "Non-"   "Identifies as" ""            "18-" "25-" "40-" "55-" ""   
[3,] "Response" "Topline" "African"  "or"       "White" "Female" "Male" ""       ""              "identify as" ""    ""    ""    ""    "65+"
[4,] ""         ""        ""         ""         ""      ""       ""     "binary" "transgender"   ""            "24"  "39"  "54"  "64"  ""   
[5,] ""         ""        "American" "Latino/​a" ""      ""       ""     ""       ""              "transgender" ""    ""    ""    ""    ""   

[[2]]
     [,1]                            [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "Yes, I have considered moving" ""   ""   ""   ""   ""   ""   ""   ""   ""    ""    ""    ""    ""    ""   
[2,] ""                              "27" "24" "28" "27" "26" "20" "44" "43" "24"  "41"  "28"  "18"  "17"  "17" 
[3,] "out of my community or state"  ""   ""   ""   ""   ""   ""   ""   ""   ""    ""    ""    ""    ""    ""   

[[3]]
     [,1]                            [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "No, I have not considered"     ""   ""   ""   ""   ""   ""   ""   ""   ""    ""    ""    ""    ""    ""   
[2,] "moving out of my community or" "61" "57" "59" "64" "60" "75" "37" "40" "65"  "43"  "60"  "74"  "80"  "70" 
[3,] "state"                         ""   ""   ""   ""   ""   ""   ""   ""   ""    ""    ""    ""    ""    ""   

[[4]]
     [,1]                                [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "I have already moved out of my"    ""   ""   ""   ""   ""   ""   ""   ""   ""    ""    ""    ""    ""    ""   
[2,] "community or state as a result of" "5"  "6"  "4"  "3"  "3"  "3"  "13" "8"  "4"   "8"   "4"   "2"   "1"   "9"  
[3,] "anti-LGBTQ+ legislation"           ""   ""   ""   ""   ""   ""   ""   ""   ""    ""    ""    ""    ""    ""   

[[5]]
     [,1]       [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "Not sure" "7"  "14" "8"  "6"  "11" "3"  "6"  "8"  "7"   "9"   "8"   "7"   "2"   "4"  

[[6]]
     [,1]         [,2]    [,3] [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "Weighted N" "1,036" "93" "217" "632" "426" "368" "135" "166" "870" "249" "425" "186" "93"  "83" 

The second helper function is col_paste() which takes a matrix with one or more rows as input and collapses it to a vector by pasting the contents of all cells in the same column together:

col_paste <- function(x, ...) {
  apply(x, 2, \(y) {as.vector(paste(y, ...))})
}

To illustrate the idea, let’s take mat_split[[1]], a five-row matrix that contains the data that should eventually become our column names, and convert it to a character vector using col_paste()

col_paste(mat_split[[1]], collapse = " ")
 [1] "  Response  "                      
 [2] "  Topline  "                       
 [3] "Black or  African  American"       
 [4] "Hispanic  or  Latino/​a"            
 [5] "  White  "                         
 [6] "  Female  "                        
 [7] "  Male  "                          
 [8] " Non-  binary "                    
 [9] " Identifies as  transgender "      
[10] "Does not  identify as  transgender"
[11] " 18-  24 "                         
[12] " 25-  39 "                         
[13] " 40-  54 "                         
[14] " 55-  64 "                         
[15] "  65+  "                           

Finally, we can use the row_combine() function below that takes a list of vectors and combines them into a matrix.

row_combine <- function(x, ...) {
  matrix(unlist(x), nrow = length(x), byrow = TRUE)
}

Equipped with these helpers, the following pipeline takes the raw output mat and converts it into a tibble dat containing the data in the format we want it to be:

dat <- mat |> 
  row_split(groups) |>                         # split into list of matrices
  map(\(x) {col_paste(x, collapse = " ")}) |>  # paste into character vector
  row_combine() |>                             # combine vectors into one matrix
  as_tibble(.name_repair = "minimal") |>       # convert to tibble
  row_to_names(row_number = 1) |>              # use first row as names
  clean_names() |>                             # clean the names
  rename(                                      # shorten some names
    "black" = "black_or_african_american",
    "hispanic" = "hispanic_or_latino_a", 
    "trans" = "identifies_as_transgender",
    "not_trans" = "does_not_identify_as_transgender"
  ) |>
  mutate(across(!response, \(x) {as.numeric(gsub(",", "", x))})) # numeric data

dat
# A tibble: 5 × 15
  response                        topline black hispanic white female  male non_binary trans not_trans x18_24 x25_39 x40_54 x55_64   x65
  <chr>                             <dbl> <dbl>    <dbl> <dbl>  <dbl> <dbl>      <dbl> <dbl>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>
1 Yes, I have considered moving …      27    24       28    27     26    20         44    43        24     41     28     18     17    17
2 No, I have not considered movi…      61    57       59    64     60    75         37    40        65     43     60     74     80    70
3 I have already moved out of my…       5     6        4     3      3     3         13     8         4      8      4      2      1     9
4 Not sure                              7    14        8     6     11     3          6     8         7      9      8      7      2     4
5 Weighted N                         1036    93      217   632    426   368        135   166       870    249    425    186     93    83

Et voilà!

Reuse

Citation

BibTeX citation:
@online{navarro2023,
  author = {Navarro, Danielle},
  title = {Extracting Tables from Pdf Files with Tabulizer},
  date = {2023-06-16},
  url = {https://blog.djnavarro.net/posts/2023-06-16_tabulizer},
  langid = {en}
}
For attribution, please cite this work as:
Navarro, Danielle. 2023. “Extracting Tables from Pdf Files with Tabulizer.” June 16, 2023. https://blog.djnavarro.net/posts/2023-06-16_tabulizer.