Extract Text from PDF file using Selenium Webdriver in Scala


If we want to verify PDF content during the testing  or you want to test the PDF file in scala then you have to follow these code because selenium webdriver does not provide the direct method to extract the text from PDF.

First of all we have to add dependency for pdfbox in build.sbt

libraryDependencies ++=  Seq(
jdbc,
ws,
cache,
"org.apache.pdfbox" % "pdfbox" % "1.8.2"
)

Code for Extract text from pdf file-


package SeleniumTest

import java.io.File
import org.apache.commons.io.FileUtils
import org.openqa.selenium.OutputType
import org.openqa.selenium.firefox.FirefoxDriver
import org.openqa.selenium.remote.Augmenter
import org.scalatest.FlatSpec
import play.api.test.FakeApplication
import play.api.test.Helpers.HTMLUNIT
import play.api.test.Helpers.inMemoryDatabase
import play.api.test.Helpers.running
import play.api.test.TestServer
import setup.Testsetup
import java.io.BufferedInputStream
import java.util.concurrent.TimeUnit
import java.net.URL
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;

class Pdfformat extends FlatSpec with Testsetup {

running(TestServer(port, FakeApplication(additionalConfiguration = inMemoryDatabase())), HTMLUNIT) { browser =>
"Application" should "Extract Text from pdf file" in {

val driver = new FirefoxDriver()
driver.manage().window().maximize()
driver.get("http://kmmc.in/wp-content/uploads/2014/01/lesson2.pdf");
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
val url = new URL(driver.getCurrentUrl());
val fileToParse=new BufferedInputStream(url.openStream());
val parser = new PDFParser(fileToParse);
parser.parse();
val output=new PDFTextStripper().getText(parser.getPDDocument());
println(output);
parser.getPDDocument().close();
driver.manage().timeouts().implicitlyWait(100, TimeUnit.SECONDS);
}
}

PDF

Advertisements
This entry was posted in Scala and tagged , , . Bookmark the permalink.