Boost Factorial Calculation with Spark


We all know that, Apache Spark is a fast and a general engine for large-scale data processing. It can process data up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

But, is that the only task (i.e., MapReduce) for which Spark can be used ? The answer is: No. Spark is not only a Big Data processing engine. It is a framework which provides a distributed environment to process data. This means we can perform any type of task using Spark.

For example, lets take Factorial. We all know that calculating Factorial for Large numbers is cumbersome in any programming language and on top of that, CPU takes a lot of time to complete the calculations. So, what can be the solution ?

Well, Spark can be the solution to this problem. Lets see that in form of code.

First, we will try to implement Factorial using only Scala in a Tail Recursive way.

def factorial(number: Int): BigInt = {
  def recursiveFactorial(number: Int, accumulator: BigInt): BigInt = {
    if(number == 0)  accumulator
    else  recursiveFactorial((number - 1), accumulator * number)
  }
  recursiveFactorial(number, 1)
}

The time taken by above code to find the Factorial of 200000 on my machine (Quad Core Intel i5) was about 20.21s.

Now, lets implement the same function using Spark.

def factorial(number: Int): BigInt = {
  val list = if(number == 0 ) List(BigInt(1)) else (BigInt(1) to number).toList
  val rdd = sparkContext.parallelize(list)
  rdd.reduce(_ * _)
}

The time taken by Spark to find the factorial of 200000 on the same machine was only 5.41s, which is almost 4x faster than using Scala alone.

Of course, the calculation time can vary depending on the H/W we are using. But, still we have to admit that Spark not only reduced the calculation time, but also gave a much cleaner way to code it.

This entry was posted in Scala, Spark and tagged , . Bookmark the permalink.

5 Responses to Boost Factorial Calculation with Spark

  1. Nirmalya Sengupta (@baatchitweet) says:

    I like the blogs that you guys at Knoldus, write: I have learnt about many things from them and will keep doing so in the future.
    I have some observations about this particular one:
    1) ‘.. Spark gave a much cleaner way to do it’: some may argue that the Recursive version leans more towards how a Factorial is understood mathematically! If the code is meant for the next guy who comes to read it (and not the machine or framework), it is debatable which is cleaner. 🙂
    2) You have not mentioned if you gave enough time to JVM to warm itself up before you measured the performance. That is necessary for all the inlining and JITs to be happy! Please make your readers wiser.
    3) Factorial is not really a good example to establish Spark’s ability to be a ‘generalized’ distributed environment to process data (as you claim). It is an associative computation, and inherently parallelizable. If you use Scala Library’s parallel operations on Collection for your regular Factorial calculation and then compare the results, you are being fair! 🙂
    4) Did you measure the performance of the Spark version, including the _collect()_ call? Just inquisitive.
    5) Finally, in the _recursiveFactorial_ function, why are we not checking for (number == 1), instead of (number == 0)? Am I missing something obvious?

    These are just observations. You may want to elaborate on your proposition of Spark being a Distributed Computing framework, using another example. That will be very interesting.

    My 2 cents.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s