In this Blog , You will learn the best way to scrape tables from PDF files to the panda’s data frame .
Fetching tables from PDF files is no more a difficult task, you can do this using a single line in python.
What you will learn
- Installing a tabula-py library.
- Importing library.
- Reading a PDF file.
- Reading a table on a particular page of a PDF file.
- Reading multiple tables on the same page of a PDF file.
- Converting PDF files directly to a CSV file.
Tabula
Tabula is one of the useful packages which not only allows you to scrape tables from PDF files but also convert a PDF file directly into a CSV file.
So let’s get started…
1. Install tabula-py library
pip install tabula-py
2. Importing tabula library
import tabula
3. Reading a PDF file
lets scrap this PDF into pandas Data Frame.

code to read this file



4. Reading a Table on a particular page of a PDF File.
let say we need to scrap this PDF FILE which contains multiple pages in it.
df = tabula.read_pdf("FoodCaloriesList.pdf", pages='1')
df
output will get only the content of first Page



df = tabula.read_pdf("FoodCaloriesList.pdf", pages='2')
df



5. What if there are multiple tables on the same page of a PDF file?
let’s say we need to scrape these 2 tables which are on the same page of a PDF file.



df = tabula.read_pdf("FoodCaloriesList.pdf", pages='2', multiple_tables=True)
df
output:



To read multiple tables we need to add extra parameter
multiple_tables = True -> Read multiple tables as independent tables
multiple_tables = False -> Read multiple tables as single table
5.1. Reads Multiple Tables as Independent tables
df = tabula.read_pdf("FoodCaloriesList.pdf", pages='2', multiple_tables=True)
df[0]
output:



5.2 Reading Multiple Table as a Single Table
df = tabula.read_pdf("FoodCaloriesList.pdf", multiple_tables= False)
df



6. Covert a PDF file directly to a CSV file
we can directly convert a file containing tabular data directly to a CSV file using convert_into() method in tabula library.
1. Converting tables in 1 page to CSV
tabula.convert_into("FoodCaloriesList.pdf", "output.csv", output_format="csv", pages='1')
output: a file output.csv will be created as:



2. Converting all table to CSV
tabula.convert_into("FoodCaloriesList.pdf", "output.csv", output_format="csv", pages='all')



Conclusion
I hope you learned a great way to scrape pdf file tables using a single line in python.
Check out my related articles on Python.