We needed to handle large data files reaching size Gigabytes in scala based Akka application of ours. We are interested in reading data from the file and then operating on it. In our application a single line in a file forms a unit of data to be worked upon. That means that we can only operate on lines in our big data file.
These are the requirements.
- We cannot read an entire file in memory
- A line in a file is a unit of data to be processed
- It is ok to loose a bit of data in each line for big data processing
- We need to implement it in such a way that its optimal
The example big data file snippet can be of the form shown below. It can potentially be of size reaching Gigabytes of data and more and we need to process it.
We decided to use RandomAccessFile api in java. It provides capability to seek to a particular location in a file and read data in bytes from that location onwards. Particularly two api calls are important as shown below.
But we need to figure out where to seek i.e. seekIndex for code listing above and the size of byteArray. Most importantly how to split the bytes read in number of lines.
The way to handle seekIndex is to know the number of bytes in big data file and then divide it in chunks of data we want to read in one go. There is a defaultBlockSize constant variable and we divide entire file in chunks of this defaultBlockSize. The seekIndex is then multiples of defaultBlockSize. The code listing for number of seeks is shown below.
If there is a 40 MB file then numberOfChunks be 40 in number. This means that now we will read 1 MB byte sized chunks forty times each and then operate on that. For operating on each line we will have to extract lines from the read byte buffer. In this case we would like to do a simple operation on each line extracted. For simplicity sake: we do a word count on each line extracted from chunk of data read from file.
Now putting it all together, this is the complete code listing:
Above is an implementation of operating on a large file and extracting lines on chunk of data read from file and doing a simple word count operation on each line. It meets first three requirements outlined in the beginning except the last one of making it optimal. Now, lets use some Akka code to speed up our Application.
Akka uses Actors and using actors we can do parallel processing to speed up the performance. Let’s use FileWorker Actor to read data from file and operate on it. We need to pass to FileWorker the big data file path, the chunk index and total number of chunks.
Let’s look at the code listing of the FileWorker Actor and the message scala case class.
When FileWorker Actor received the BigDataMessage it read lines and then do a word count on each line extracted. The code listing for reading lines and doing a word count is same as the previous version.
To measure performance i.e. the time elapsed we also require a Diagnostics Actor. It will calculate total time elapsed for big data file processing. Here is a code listing for the Diagnostics Actor.
For executing big data processing we now need a BigDataProcessor. It will be responsible for starting the FileWorker actors, a Diagnostic Actor, a cyclic router and send messages of type BigDataMessage scala case class.
Below is the code listing for the BigDataProcessor.
Combining the pieces here is a complete code listing for this Application.
To execute the application run both the versions as scala application either from eclipse or from the terminal. The codebase is also on my github. It is a sbt based project after you have downloaded the code execute following command in the terminal
When you run both the versions you will notice that with increase in big data file size Akka based version performance starts becoming better than the standalone version. There is a LameBigDataProcessor as well you can run that with the following command in the terminal
Here is my run of both of these versions. Compare their elapsed timings they are in milliseconds.
Akka massive scalable architecture using Actors to achieve parallelism enables to scale up the system by optimal use of cores on a single box.
7 thoughts on “Handling Large Data File Using Scala and Akka8 min read”
Isn’t doing blocking I/O inside actors considered a bad practice (see for example http://stackoverflow.com/questions/9485039/blocking-io-in-akka)? How about scanning over the file sequentially, while sending a message for each chunk read (or even line read) to the Actor pool to be processed?
I think you could end up with the best of both worlds: a single thread do sequential I/O on a large file, while asynchronously and concurrently processing the data read.
Hi Sander Mak,
Thanks for your suggestion of doing sequential read and passing the data around, infact that was our first approach as well. Actually, In our situation the file resides in a shared disk and we would have to move large chunks of data over network, we noticed massive degradation of performance with increase of payload size in message and hence stayed away from it.
Thank you for sharing this great topic.Sir, i have a question which might not sound okie to ask but i am really wondering how to get this code written in java code.I have never used scala before neither akka , i am a fresh student and i am learning java and currently i am facing this kinda problem where i want to load a huge file to be parsed + inserted into sql-database in a shared disk environment.Please try to help if you can. Thank you
router ! BigDataMessage(bigdataFilePath, i, totalChunks)
! operation is non blocking, this just ensures that message is sent to actor, doesn’t capture time taken to process message by actor.
Comments are closed.