Tabula : Scraping Table Data From PDF Files

Table of contents

Reading Time: 3 minutes

In this Blog , You will learn the best way to scrape tables from PDF files to the panda’s data frame .

Fetching tables from PDF files is no more a difficult task, you can do this using a single line in python.

What you will learn

Installing a tabula-py library.
Importing library.
Reading a PDF file.
Reading a table on a particular page of a PDF file.
Reading multiple tables on the same page of a PDF file.
Converting PDF files directly to a CSV file.

Tabula

Tabula is one of the useful packages which not only allows you to scrape tables from PDF files but also convert a PDF file directly into a CSV file.

So let’s get started…

1. Install tabula-py library

pip install tabula-py

2. Importing tabula library

import tabula

3. Reading a PDF file

lets scrap this PDF into pandas Data Frame.

code to read this file

4. Reading a Table on a particular page of a PDF File.

let say we need to scrap this PDF FILE which contains multiple pages in it.

df = tabula.read_pdf("FoodCaloriesList.pdf", pages='1')
df

output will get only the content of first Page

df = tabula.read_pdf("FoodCaloriesList.pdf", pages='2')
df

5. What if there are multiple tables on the same page of a PDF file?

let’s say we need to scrape these 2 tables which are on the same page of a PDF file.

df = tabula.read_pdf("FoodCaloriesList.pdf", pages='2', multiple_tables=True)
df

output:

To read multiple tables we need to add extra parameter

multiple_tables = True -> Read multiple tables as independent tables
multiple_tables = False -> Read multiple tables as single table

5.1. Reads Multiple Tables as Independent tables

df = tabula.read_pdf("FoodCaloriesList.pdf", pages='2', multiple_tables=True)
df[0]

output:

5.2 Reading Multiple Table as a Single Table

df = tabula.read_pdf("FoodCaloriesList.pdf",  multiple_tables= False)
df

6. Covert a PDF file directly to a CSV file

we can directly convert a file containing tabular data directly to a CSV file using convert_into() method in tabula library.

1. Converting tables in 1 page to CSV

tabula.convert_into("FoodCaloriesList.pdf", "output.csv", output_format="csv", pages='1')

output: a file output.csv will be created as:

2. Converting all table to CSV

tabula.convert_into("FoodCaloriesList.pdf", "output.csv", output_format="csv", pages='all')

Conclusion

I hope you learned a great way to scrape pdf file tables using a single line in python.

Check out my related articles on Python.

High performance systems

Data Engineering, Strategy and Analytics

Intelligence Driven Decisioning - AI/ML

Cloud Engineering

Architecture Strategy, Audit & Academy

Platforms

KDP

KDSP

Products

Premon

Studio9

Tech Hub

Akka

Scala

Rust

Spark

Functional Java

Kafka

Flink

ML/AI

DevOps

Data Warehouse

Travel

Retail

Finance

Healthcare

Media and Publishing

Consumer Internet

Hi-tech & IoT

Case Studies

Blogs

Books

Community

Resources

OS contributions

Webinars

Knolx

Check out our open positions

Services

Go to Overview

Accelerators

Go to Overview

Platforms

Products

TechHub

Industries

Go to Overview

Travel

Insights

Go to Overview

Tabula : Scraping Table Data From PDF Files

What you will learn

Tabula

So let’s get started…

1. Install tabula-py library

2. Importing tabula library

3. Reading a PDF file

4. Reading a Table on a particular page of a PDF File.

5. What if there are multiple tables on the same page of a PDF file?

5.1. Reads Multiple Tables as Independent tables

5.2 Reading Multiple Table as a Single Table

6. Covert a PDF file directly to a CSV file

1. Converting tables in 1 page to CSV

2. Converting all table to CSV

Conclusion

Share the Knol:

Related

Written by Lokesh Kumar

COMPANY

Sign up to our newsletter

Certificates

Partners

© 2023 Knoldus, Inc. All Rights Reserved.

Part of NashTech

Privacy Policy | Sitemap

Discover more from Knoldus Blogs

Check out our
open positions