Do you want to catch up with your opponents by looking at their product listings or generate your own content from someone else’s website or want to run some analytics or just want to scrap a website for the sake of it? Well, it’s not a rare scenario of website scraping. Though, the question is, how to scrap a website! When we Google for – “How to crawl a website?”, we get a lot of libraries for various programming languages. These libraries are pretty useful but rigid in use. Sure these libraries are fun to use and do the work for you. But don’t you sometimes feel that you can have more control over things. A scraping library might force you to download all web pages to the database and then process them. It increases the workload and those damn web pages sit in your expensive hard disk.
Lately, I have a been thinking to find faster and less resource consuming solution to scrap and process data to get information from a website. And I found one! My target to scrap was a job listing website. I used Akka with JSoup and processed web pages that sum up around 0.5 GB in size, in half an hour at my home (with top internet speed ~400KBps). I wanted to scrap all the jobs listed on that job site. To start scraping I needed the starting point, so I gathered my seed links. The homepage could be the topmost link and work as the seed link, but as I knew what I was going to scrap, unlike web crawlers, I scraped the topmost pages of the website that divide the jobs into different categories. The example of such page(s) could be category page on an e-commerce website like Amazon’s Shop by Category. All the category links of the Amazon can be selected with CSS selector
ul.nav_cat li a.nav_a For my target job site, I used the JSoup to scrap the root-category-webpage, without any actors though, to give me the seed links I needed. Once I had my seed links, I saved them in a
.csv file along with the next CSS selector page on that seed link page to select the further HTML Elements. For example, if I select the Kid’s Watches from Amazon’s category page, then links of the watches from the listing can be selected with
div.a-row div.a-column a.a-link-normal.a-text-normal CSS selector. And once I get these links I’ve access to the actual watch details. The details of the products are themselves stored in web pages in a structured way. Voila! Finally, we have the information we needed.
Now comes the code. Haven’t we been waiting for that? (Aah! I knew that :D)
The seed links in the CSV file look somewhat like this:
I have replaced the name of the job website with example, and other obvious words with letters such as x, y, and z. I am just being cautious so that one doesn’t put the load on the website with unnecessary scraping. My whole list can be seen on Pastebin.
Below are the first two actors which first extract the seed links from the CSV file and then further hyperlinks available on the seed pages. SeedLinkExtractorActor gets the SeedPageCSVDirectory as the first-most message. Upon this, actor gets the list of seed links and CSS selectors from the CSV file as another message in its queue; it does so by telling itself with self! readSeedPagesFromCSVDiectory(csvFile). Once actor gets the list of seed links, it starts getting the links out of the seed pages with help of the child actors. The child actors keep running on different threads and keep pushing the extracted links to parent actor. That’s SeedLinkExtractorActor delegates the job to extract the links from the seed pages to child actors (made with SeedLinkExtractorActorChild.props). Once a child actor sends the extracted links to parent actor it has completed its work and leaves its thread to be used by some other actor.
There are three more actors which extract the actual job description information – first, extracting the links from the job listings and then, from those individual pages the information can be extracted.
The second class does behave more in the same manner. It delegates the work to child actors and when child actors complete their work, they stop themselves, after giving results back to parent actor and/or saving the result to database.
P.S.: The code above was written as out of curiosity. More emphasis was on to put the actors to work and getting juice out of my machine. There are many unused variables/ lines of code, many long println()’s, some CSS selector were used directly in code rather than keeping in a configuration file – no specific attention was paid to keep the code too clean. Next time, I’ll keep that in mind.
Where should we go from here on? Well, it was quite fascinating putting things at work and see the terminal with those println()’s but it’s still just the beginning. We’ve many things to do, in production we need to watch our actors. We need to create a feedback mechanism so that parent actor doesn’t push too much work on child actors (or there could be out of memory exception). Also, we need to create a fail-safe mechanism so that if system crashes or network breaks or job site blocks our IP for some time (due to too much traffic from on IP, this happened to me 😛), we can restart the things where left. This can be done with the help of the Akka Persistence.
Please comment to give any feedback or if anything you want me to do in future with actors.