Tabula
Tabula is a tool for liberating data tables locked inside PDF files.
Contents
How Can Tabula Help Me?[edit | edit source]
If you've ever tried to do anything with data provided to you in PDFs, you know how painful it is -- there's no easy way to copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux.
Who Uses Tabula?[edit | edit source]
Tabula is used to power investigative reporting at news organizations of all sizes, including ProPublica, The Times of London, Foreign Policy, La Nacion (Argentina), The New York Times and the St. Paul (MN) Pioneer Press.
Grassroots organizations like SchoolCuts.org rely on Tabula to turn clunky documents into human-friendly public resources.
And researchers of all kinds use Tabula to turn PDF reports into Excel spreadsheets, CSVs, and JSON files for use in analysis and database applications.
How to Use Tabula[edit | edit source]
Start the local application[edit | edit source]
cd ~/bin/tabula
java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -jar tabula.jar
Launch the local website[edit | edit source]
- point your browser to http://localhost:8080/
Once you do this, you're able to interact with the graphical browser-based version of Tabula
. You can also use the command-line version tabula-java
[1]
- Upload a PDF file containing a data table.
- Browse to the page you want, then select the table by clicking and dragging to draw a box around the table. (Note it will auto-scan the document so text selection may not even be necessary.)
- Click "Preview & Export Extracted Data". ...
- Click the "Export" button.
Now you can work with your data as text file or a spreadsheet rather than a PDF!