Scraping a website with Akka and JSoup

2018-01-10T00:29:14+05:30

Always interested in web-scraping projects — I’ve been doing these on and off for twenty years now. Just wanted to comment that (usually) the double consonant gets a short sound — scrapping a web site would mean getting rid of it (e.g. turning it into scrap or junk) — and that you should use the single consonant “scraping” or “scraper” to get the long A sound. Despite being a native speaker, I’m often tripped by these rules and their exceptions, but this one is pretty solid. English is a ridiculous language, to be fair, but this is the difference between “hoping” and “hopping”…
As for the scraping approach itself, you appear to be looking at a breadth-first search by retrieving all of the seed URLs and then adding the child URLs for each. You may prefer a depth-first approach, so that if the job is interrupted, you at least have complete data on the seed URLs or their children that have been completely scraped. Depth-first is also easier to write in a way that won’t exceed memory/connection limits (in my opinion).

2018-01-10T01:52:53+05:30

Thank you so much Ken. I’ll update the keyword from scrappy to scrapy (I don’t have the access to make changes after publish, need to ask admin 😛 ). Thanks for pointing that out.
For your suggestion about the use of depth first search, my approach isn’t completely breadth first search. Once I’ve the seed links, I start the moving towards the end link (where is my actual information) and once I reach there I process that information right then and there and then that thread(which actor is running on) leaves the CPU resources.
This crawling process can be envisioned as- one branch gets created from seed link to the each information page (which can be labeled as leaf page). This branch has many actors in between and so many points of processing. Once the information has been acquired from leaf page the link from its parent disappears and the parent itself vanishes when all the children of that actor process the leaf pages they are intended to. After the parent actor (one level above the leaf page) is done with its responsibility it’s also stopped. And in this way all branches goes poof from leaf to seed page links.
If we run this same logic on single thread it’d be depth first search. But I should, indeed, make it complete depth first search scraping that way I’d be sure with the categories which I’ve processed already. Hmmm… I’ll make some changes in my code.
Thank you so much for guiding me 🙂

2018-01-10T20:25:52+05:30

Reblogged this on Coding, Unix & Other Hackeresque Things.

	package com.jobllers.service.jobsscrapper.impl.scrapper.actors

	import akka.actor.{Actor, PoisonPill, Props}
	import com.jobllers.service.jobsscrapper.impl.scrapper.actors.JobLinkExtractorActor.JobPageCrawlInfo
	import com.jobllers.service.jobsscrapper.impl.scrapper.actors.SeedLinkExtractorActor.{FileName, SeedPage, SeedPageCSVDirectory, SeedPageOptional}
	import com.jobllers.service.jobsscrapper.impl.scrapper.actors.SeedLinkExtractorActorChild.{GetSeedLinksFromPage, SendSeedLinks}
	import com.jobllers.service.jobsscrapper.impl.scrapper.naukri.Headers
	import com.jobllers.service.jobsscrapper.impl.scrapper.util.RemoteContentPuller
	import org.joda.time.DateTime
	import org.jsoup.Jsoup

	class SeedLinkExtractorActor extends Actor with RemoteContentPuller {

	var numberOfSeedLinks = 0
	var startingMillisec: Long = 0L
	var jobLinkExtractorActorCounter = 0

	override def receive: Receive = {

	case SeedPageCSVDirectory(csvFile: FileName) ⇒ self ! readSeedPagesFromCSVDiectory(csvFile)

	case seedPages: List[SeedPageOptional] ⇒
	var counter = 0
	startingMillisec = DateTime.now().getMillis

	seedPages.foreach { seedPageOpt ⇒
	counter += 1
	context.actorOf(Props(classOf[SeedLinkExtractorActorChild]), s"childOfSeedLinkExtractorActor$counter")
	.tell(GetSeedLinksFromPage(seedPageOpt.optSeedPage.get), self)
	}

	case sentSeedLinks: SendSeedLinks ⇒
	val seedLinks = sentSeedLinks.linkList
	jobLinkExtractorActorCounter += 1
	context.actorOf(JobLinkExtractorActor.props, s"jobLinkExtractorActor$jobLinkExtractorActorCounter")
	.tell(JobPageCrawlInfo(seedLinks, "div.srp_container.fl div.row a.content"), self)
	numberOfSeedLinks += seedLinks.size
	println(s"$numberOfSeedLinks links extracted in ${DateTime.now().getMillis – startingMillisec} milliseconds")

	}

	private def readSeedPagesFromCSVDiectory(fileName: FileName): List[SeedPageOptional] = {
	import scala.io.Source
	val csvDirectorySource = Source.fromFile(fileName.name)
	val linesFromDirectory = csvDirectorySource.getLines().toList
	csvDirectorySource.close()

	linesFromDirectory.map { line =>
	val linkSelector = line.split(",")
	if (linkSelector.size == 2) {
	SeedPageOptional(Some(SeedPage(linkSelector.head, linkSelector(1))))
	} else {
	SeedPageOptional(None)
	}
	}
	}
	}

	object SeedLinkExtractorActor {

	val props = Props(classOf[SeedLinkExtractorActor])

	case class FileName(name: String)

	case class SeedPage(uri: String, selector: String)

	case class SeedPageOptional(optSeedPage: Option[SeedPage])

	case class SeedPageList(seedPages: List[Option[SeedPage]])

	case class SeedPageCSVDirectory(file: FileName)

	}

	class SeedLinkExtractorActorChild extends Actor with RemoteContentPuller {

	import scala.collection.JavaConversions._

	var numberOfSeedLinks = 0

	override def receive: Receive = {
	case GetSeedLinksFromPage(seedPage: SeedPage) ⇒
	val seedLinks = Jsoup.connect(seedPage.uri).headers(Headers.naukariHeaders).timeout(60000).get()
	.select(seedPage.selector)
	.foldLeft(List.empty[String])((listOfLinks, link) ⇒ addVerifiedLinkToList(listOfLinks, link))

	sender() ! SendSeedLinks(seedLinks)
	self ! PoisonPill //Use –> context stop self <– instead
	}
	}

	object SeedLinkExtractorActorChild {

	case class SendSeedLinks(linkList: List[String])

	case class GetSeedLinksFromPage(seedPage: SeedPage)

	}

	package com.jobllers.service.jobsscrapper.impl.scrapper.actors

	import java.io.IOException
	import java.util.concurrent.TimeoutException

	import akka.actor.{Actor, Props}
	import com.jobllers.service.jobsscrapper.impl.scrapper.actors.JobLinkExtractorActor.JobPageCrawlInfo
	import com.jobllers.service.jobsscrapper.impl.scrapper.actors.JobLinkExtractorChildActor.CrawlJobPage
	import com.jobllers.service.jobsscrapper.impl.scrapper.naukri.Headers
	import com.jobllers.service.jobsscrapper.impl.scrapper.util.RemoteContentPuller
	import org.jsoup.Jsoup
	import org.jsoup.nodes.Element

	import scala.util.Try
	import scala.util.control.NonFatal

	class JobLinkExtractorActor extends Actor with RemoteContentPuller {

	var counter = 0

	override def receive: Receive = {

	case jobPagesCrawlInfo: JobPageCrawlInfo ⇒
	jobPagesCrawlInfo.uriList.foreach { jobPageURI ⇒
	counter = counter + 1
	println(s"[~~~~~~~~~~~~~~~~~~~~~~~~~~~] -> Listing will extracted from listing page.")
	context.actorOf(JobLinkExtractorChildActor.props, s"JobLinkExtractorActorChild$counter")
	.tell(CrawlJobPage(jobPageURI, jobPagesCrawlInfo.selector), self)
	}

	}
	}

	object JobLinkExtractorActor {
	val props = Props(classOf[JobLinkExtractorActor])

	case class JobPageCrawlInfo(uriList: List[String], selector: String)

	}


	class JobLinkExtractorChildActor extends Actor with RemoteContentPuller {

	var retryCount = 0

	override def receive: Receive = {
	case CrawlJobPage(uri: String, selector: String) ⇒ getJobLinksFromPage(uri, selector)
	case x: Any ⇒ println(s"Something weird received. $x")
	}

	private def getJobLinksFromPage(uri: String, selector: String): Unit = Try {
	import scala.collection.JavaConversions._
	val page = Jsoup.connect(uri).headers(Headers.naukariHeaders).timeout(60000).get()
	val links = page.select(selector)
	.foldLeft(List.empty[String])((listOfLinks, link) ⇒ addVerifiedLinkToList(listOfLinks, link))

	val nextPageLinkElement: List[Element] = page.select("div.srp_container.fl div.pagination a:last-child").toList

	context.actorOf(JobLinkHandlerActor.props) ! links

	//Start crawling the "NEXT" page if there are more links else kill itself.
	if (nextPageLinkElement.nonEmpty) {
	println(s"-> -> -> -> Next on link found.")
	self ! CrawlJobPage(nextPageLinkElement.last.attr("href"), selector)
	} else {
	println(s". . . . Stopping actor.")
	context stop self
	}
	} recover {
	case NonFatal(_: TimeoutException) ⇒
	if (retryCount < 5) {
	println(s"Couldn't connect to the $uri while trying for 60 seconds. Retrying.")
	self ! CrawlJobPage(uri, selector)
	} else {
	context stop self
	}
	retryCount += 1
	case NonFatal(_: IOException) ⇒
	println(s"Got IOException while extracting job links from $uri")
	context stop self
	}
	}

	object JobLinkExtractorChildActor {
	val props = Props(classOf[JobLinkExtractorChildActor])

	case class CrawlJobPage(uri: String, selector: String)

	}

	class JobLinkHandlerActor extends Actor {
	override def receive = {
	case links: List[String] ⇒
	//Save these links into the database / file
	println(s"*********************** ${links.size} links received successfully ***********************")
	}
	}

	object JobLinkHandlerActor {
	val props = Props(classOf[JobLinkHandlerActor])
	}

High performance systems

Data Engineering, Strategy and Analytics

Intelligence Driven Decisioning - AI/ML

Cloud Engineering

Architecture Strategy, Audit & Academy

Platforms

KDP

KDSP

Products

Premon

Studio9

Tech Hub

Akka

Scala

Rust

Spark

Functional Java

Kafka

Flink

ML/AI

DevOps

Data Warehouse

Travel

Retail

Finance

Healthcare

Media and Publishing

Consumer Internet

Hi-tech & IoT

Case Studies

Blogs

Books

Community

Resources

OS contributions

Webinars

Knolx

Check out our open positions

Services

Go to Overview

Accelerators

Go to Overview

Platforms

Products

TechHub

Industries

Go to Overview

Travel

Insights

Go to Overview

Scraping a website with Akka and JSoup

Share the Knol:

Related

Written by JustinB

3 thoughts on “Scraping a website with Akka and JSoup5 min read”

COMPANY

Sign up to our newsletter

Certificates

Partners

© 2023 Knoldus, Inc. All Rights Reserved.

Part of NashTech

Privacy Policy | Sitemap

Discover more from Knoldus Blogs

Check out our
open positions