We needed to handle large data files reaching size Gigabytes in scala based Akka application of ours. We are interested in reading data from the file and then operating on it. In our application a single line in a file forms a unit of data to be worked upon. That means that we can only operate on lines in our big data file.
These are the requirements.
- We cannot read an entire file in memory
- A line in a file is a unit of data to be processed
- It is ok to loose a bit of data in each line for big data processing
- We need to implement it in such a way that its optimal
The example big data file snippet can be of the form shown below. It can potentially be of size reaching Gigabytes of data and more and we need to process it.
We decided to use RandomAccessFile api in java. It provides capability to seek to a particular location in a file and read data in bytes from that location onwards. Particularly two api calls are important as shown below.
But we need to figure out where to seek i.e. seekIndex for code listing above and the size of byteArray. Most importantly how to split the bytes read in number of lines.
The way to handle seekIndex is to know the number of bytes in big data file and then divide it in chunks of data we want to read in one go. There is a defaultBlockSize constant variable and we divide entire file in chunks of this defaultBlockSize. The seekIndex is then multiples of defaultBlockSize. The code listing for number of seeks is shown below.
If there is a 40 MB file then numberOfChunks be 40 in number. This means that now we will read 1 MB byte sized chunks forty times each and then operate on that. For operating on each line we will have to extract lines from the read byte buffer. In this case we would like to do a simple operation on each line extracted. For simplicity sake: we do a word count on each line extracted from chunk of data read from file.
Now putting it all together, this is the complete code listing:
Above is an implementation of operating on a large file and extracting lines on chunk of data read from file and doing a simple word count operation on each line. It meets first three requirements outlined in the beginning except the last one of making it optimal. Now, lets use some Akka code to speed up our Application.
Akka uses Actors and using actors we can do parallel processing to speed up the performance. Let’s use FileWorker Actor to read data from file and operate on it. We need to pass to FileWorker the big data file path, the chunk index and total number of chunks.
Let’s look at the code listing of the FileWorker Actor and the message scala case class.
When FileWorker Actor received the BigDataMessage it read lines and then do a word count on each line extracted. The code listing for reading lines and doing a word count is same as the previous version.
To measure performance i.e. the time elapsed we also require a Diagnostics Actor. It will calculate total time elapsed for big data file processing. Here is a code listing for the Diagnostics Actor.
For executing big data processing we now need a BigDataProcessor. It will be responsible for starting the FileWorker actors, a Diagnostic Actor, a cyclic router and send messages of type BigDataMessage scala case class.
Below is the code listing for the BigDataProcessor.
Combining the pieces here is a complete code listing for this Application.
To execute the application run both the versions as scala application either from eclipse or from the terminal. The codebase is also on my github. It is a sbt based project after you have downloaded the code execute following command in the terminal
When you run both the versions you will notice that with increase in big data file size Akka based version performance starts becoming better than the standalone version. There is a LameBigDataProcessor as well you can run that with the following command in the terminal
Here is my run of both of these versions. Compare their elapsed timings they are in milliseconds.
Akka massive scalable architecture using Actors to achieve parallelism enables to scale up the system by optimal use of cores on a single box.