Scraping a website with Akka and JSoup

Do you want to catch up with your opponents by looking at their product listings or generate your own content from someone else’s website or want to run some analytics or just want to scrap a website for the sake of it? Well, it’s not a rare scenario to scrap a website. Though, the question is, how to scrap a website! When we Google for – “How to crawl a website?”, we get a lot of libraries for various programming languages. These libraries are pretty useful but rigid in use. Sure these libraries are fun to use and do the work for you. But don’t you sometimes feel that you can have more control over things. A scraping library might force you to download all web pages to the database and then process them. It increases the workload and those damn web pages sit in your expensive hard disk.

Lately, I have a been thinking to find faster and less resource consuming solution to scrap and process data to get information from a website. And I found one! My target to scrap was a job listing website. I used Akka with JSoup and processed web pages that sum up around 0.5 GB in size, in half an hour at my home (with top internet speed ~400KBps). I wanted to scrap all the jobs listed on that job site. To start scraping I needed the starting point, so I gathered my seed links. The homepage could be the topmost link and work as the seed link, but as I knew what I was going to scrap, unlike web crawlers, I scraped the topmost pages of the website that divide the jobs into different categories. The example of such page(s) could be category page on an e-commerce website like Amazon’s Shop by Category. All the category links of the Amazon can be selected with CSS selector ul.nav_cat li a.nav_a For my target job site, I used the JSoup to scrap the root-category-webpage, without any actors though, to give me the seed links I needed. Once I had my seed links, I saved them in a .csv file along with the next CSS selector page on that seed link page to select the further HTML Elements. For example, if I select the Kid’s Watches from Amazon’s category page, then links of the watches from the listing can be selected with div.a-row div.a-column a.a-link-normal.a-text-normal CSS selector. And once I get these links I’ve access to the actual watch details. The details of the products are themselves stored in web pages in a structured way. Voila! Finally, we have the information we needed.

Screenshot from 2018-01-08 22-00-37.png
Shop by categories (Categories are the seed links)
Screenshot from 2018-01-08 22-04-49.png
The listing of watches ( links of these listings get you the information of each)
Screenshot from 2018-01-08 22-07-55.png
Each watch has its information on the final link

Now comes the code. Haven’t we been waiting for that? (Aah! I knew that :D)

The seed links in the CSV file look somewhat like this:
https://www.example.com/x-y--z--a,div.multiColumn.colCount_four a
https://www.example.com/x-y--z--b,div.multiColumn.colCount_four a
https://www.example.com/x-y--z--c,div.multiColumn.colCount_four a
https://www.example.com/x-y--z--d,div.multiColumn.colCount_four a

I have replaced the name of the job website with example, and other obvious words with letters such as x, y, and z. I am just being cautious so that one doesn’t put the load on the website with unnecessary scraping. My whole list can be seen on Pastebin.

Below are the first two actors which first extract the seed links from the CSV file and then further hyperlinks available on the seed pages. SeedLinkExtractorActor gets the SeedPageCSVDirectory as the first-most message. Upon this, actor gets the list of seed links and CSS selectors from the CSV file as another message in its queue; it does so by telling itself with self! readSeedPagesFromCSVDiectory(csvFile). Once actor gets the list of seed links, it starts getting the links out of the seed pages with help of the child actors. The child actors keep running on different threads and keep pushing the extracted links to parent actor. That’s SeedLinkExtractorActor delegates the job to extract the links from the seed pages to child actors (made with SeedLinkExtractorActorChild.props). Once a child actor sends the extracted links to parent actor it has completed its work and leaves its thread to be used by some other actor.

There are three more actors which extract the actual job description information – first, extracting the links from the job listings and then, from those individual pages the information can be extracted.

The second class does behave more in the same manner. It delegates the work to child actors and when child actors complete their work, they stop themselves, after giving results back to parent actor and/or saving the result to database.

P.S.: The code above was written as out of curiosity. More emphasis was on to put the actors to work and getting juice out of my machine. There are many unused variables/ lines of code, many long println()’s, some CSS selector were used directly in code rather than keeping in a configuration file – no specific attention was paid to keep the code too clean. Next time, I’ll keep that in mind.

Where should we go from here on? Well, it was quite fascinating putting things at work and see the terminal with those println()’s but it’s still just the beginning. We’ve many things to do, in production we need to watch our actors. We need to create a feedback mechanism so that parent actor doesn’t push too much work on child actors (or there could be out of memory exception). Also, we need to create a fail-safe mechanism so that if system crashes or network breaks or job site blocks our IP for some time (due to too much traffic from on IP, this happened to me 😛), we can restart the things where left. This can be done with the help of the Akka Persistence.

Please comment to give any feedback or if anything you want me to do in future with actors.

knoldus-advt-sticker

Written by 

Principal Architect at Knoldus Inc

3 thoughts on “Scraping a website with Akka and JSoup

  1. Always interested in web-scraping projects — I’ve been doing these on and off for twenty years now. Just wanted to comment that (usually) the double consonant gets a short sound — scrapping a web site would mean getting rid of it (e.g. turning it into scrap or junk) — and that you should use the single consonant “scraping” or “scraper” to get the long A sound. Despite being a native speaker, I’m often tripped by these rules and their exceptions, but this one is pretty solid. English is a ridiculous language, to be fair, but this is the difference between “hoping” and “hopping”…
    As for the scraping approach itself, you appear to be looking at a breadth-first search by retrieving all of the seed URLs and then adding the child URLs for each. You may prefer a depth-first approach, so that if the job is interrupted, you at least have complete data on the seed URLs or their children that have been completely scraped. Depth-first is also easier to write in a way that won’t exceed memory/connection limits (in my opinion).

    1. Thank you so much Ken. I’ll update the keyword from scrappy to scrapy (I don’t have the access to make changes after publish, need to ask admin 😛 ). Thanks for pointing that out.
      For your suggestion about the use of depth first search, my approach isn’t completely breadth first search. Once I’ve the seed links, I start the moving towards the end link (where is my actual information) and once I reach there I process that information right then and there and then that thread(which actor is running on) leaves the CPU resources.
      This crawling process can be envisioned as- one branch gets created from seed link to the each information page (which can be labeled as leaf page). This branch has many actors in between and so many points of processing. Once the information has been acquired from leaf page the link from its parent disappears and the parent itself vanishes when all the children of that actor process the leaf pages they are intended to. After the parent actor (one level above the leaf page) is done with its responsibility it’s also stopped. And in this way all branches goes poof from leaf to seed page links.
      If we run this same logic on single thread it’d be depth first search. But I should, indeed, make it complete depth first search scraping that way I’d be sure with the categories which I’ve processed already. Hmmm… I’ll make some changes in my code.
      Thank you so much for guiding me 🙂

Leave a Reply

%d bloggers like this: