Tabula : Scraping Table Data From PDF Files

Reading Time: 3 minutes

In this Blog , You will learn the best way to scrape tables from PDF files to the panda’s data frame .

Fetching tables from PDF files is no more a difficult task, you can do this using a single line in python.

What you will learn

  1. Installing a tabula-py library.
  2. Importing library.
  3. Reading a PDF file.
  4. Reading a table on a particular page of a PDF file.
  5. Reading multiple tables on the same page of a PDF file.
  6. Converting PDF files directly to a CSV file.

Tabula

Tabula is one of the useful packages which not only allows you to scrape tables from PDF files but also convert a PDF file directly into a CSV file.

So let’s get started…

1. Install tabula-py library

pip install tabula-py

2. Importing tabula library

import tabula

3. Reading a PDF file

lets scrap this PDF into pandas Data Frame.

code to read this file

4. Reading a Table on a particular page of a PDF File.

let say we need to scrap this PDF FILE which contains multiple pages in it.

df = tabula.read_pdf("FoodCaloriesList.pdf", pages='1')
df

output will get only the content of first Page

df = tabula.read_pdf("FoodCaloriesList.pdf", pages='2')
df

5. What if there are multiple tables on the same page of a PDF file?

let’s say we need to scrape these 2 tables which are on the same page of a PDF file.

df = tabula.read_pdf("FoodCaloriesList.pdf", pages='2', multiple_tables=True)
df

output:





To read multiple tables we need to add extra parameter

multiple_tables = True -> Read multiple tables as independent tables
multiple_tables = False -> Read multiple tables as single table

5.1. Reads Multiple Tables as Independent tables

df = tabula.read_pdf("FoodCaloriesList.pdf", pages='2', multiple_tables=True)
df[0]

output:

5.2 Reading Multiple Table as a Single Table





df = tabula.read_pdf("FoodCaloriesList.pdf",  multiple_tables= False)
df

6. Covert a PDF file directly to a CSV file

we can directly convert a file containing tabular data directly to a CSV file using convert_into() method in tabula library.

1. Converting tables in 1 page to CSV
tabula.convert_into("FoodCaloriesList.pdf", "output.csv", output_format="csv", pages='1')

output: a file output.csv will be created as:

2. Converting all table to CSV
tabula.convert_into("FoodCaloriesList.pdf", "output.csv", output_format="csv", pages='all')

Conclusion

I hope you learned a great way to scrape pdf file tables using a single line in python.

Check out my related articles on Python.

Written by 

Lokesh Kumar is intern in AI/ML studio at Knoldus. He is passionate about Artificial Intelligence and Machine Learning , having knowledge of C , C++ , Python and Data Analytics and much more. He is recognised as a good team player, a dedicated and responsible professional, and a technology enthusiast. He is a quick learner & curious to learn new technologies.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading