With the massive amount of increase in big data technologies today, it is becoming very important to use the right tool for every process. The process can be anything like Data ingestion, Data processing, Data retrieval, Data Storage, etc. Today we are going to focus on one of those popular big data technologies i.e., Apache Spark.
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
You can know more about Spark internals through this blog: https://blog.knoldus.com/spark-u/
Working on big data isn’t cheap as it requires expensive resources and working with spark takes it to a new level due to its extensive use of RAM. In such scenarios to cut cost, for storing Big data a good preference is to use some file system instead of some database. Spark is also well compatible with the file systems and with specific APIs, it works mostly like a database (e.g. Spark-SQL). Hence many big data developers prefer using file systems like HDFS as storage (input/output) while using Spark. This point onwards we’ll be assuming Spark being used with some file system.
There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and data science to a large dataset, connecting BI/Visualization tools, etc.
But its no panacea, right?
Let’s consider the cases where using Spark would be no less than a nightmare.
Following use cases are based on the architecture where Spark is directly talking to a file system eg. HDFS:
1. Random Access:
What do we mean by Random access? If there are billions of records and the user is interested in just one specific record.
Consider the example of some movie ticket booking app where the user is interested in viewing its booked ticket for one fine evening show.
Will Spark work here? It definitely will. The request will be put to Spark cluster for some ‘x’ record and it will first load all the records from the file system and then search for that particular record.
If you are doing ad-hoc analysis for searching something, this is still fine but if you want query per/sec to find data to put on your website to end users, you don’t want to use spark for that.
Why? Because for each request Spark will load the file data and then search for the one single record in it. We can imagine how much time it is going to take.
So, what is the solution?
Use a Database!!
Let Spark talk to the database and database talk to the data. Accessing a record from a database is faster than a file system. So, whenever there will be a request at Spark, it will pass it to the underlying database which will search for the record in an optimized manner.
– SQL databases:
create an index for your key column.
– Key-Value NoSQL databases:
retrieves the value of a key efficiently out of the box.
2. Frequent inserts:
Another thing to think about when you use spark just to store data in file systems. There are certainly no issues in storing a bulk of data in one go using Spark. But what if you insert one record at a time using spark from that bulk of data. Let me give you one spoiler. Spark creates a new file for every insertion so, what’s gonna happen here is: every single time you are doing insert in your table, it is going to create a new file to just write the new data into the folder where your spark SQL contains its data.
These inserts are incredibly fast, it doesn’t take time to write it into one file but if you are familiar with spark query speed and optimizing it if you have 50k or few kilobytes files then spark is not gonna query so faster.
Each time an insert query is run, it’ll create a new file corresponding to that record in the same folder of the data. And when you want to query on these many records, Spark will traverse through each of the files to find the particular record you asked for.
The bottleneck for your spark job would be to opening the files and reading the file rather than processing the data that you wanted to do.
So, what’s the solution?
- Use a database to support the inserts.
- Routinely compact your Spark SQL table files.
If you are trying to serve external reporting under high load with Spark. Say, for example, if you have the whole bunch of ‘n’ users and queries/sec coming to the website that wants to do a query on a certain dataset and would you point them directly at spark and have them read files into HDFS, that’s not gonna work so well.
Because the way spark works are you get your first request for a query, get scheduled on a Spark cluster. Then, keeping more before the previous job is done, so eventually, your jobs are going to queue up and just gonna slow down the system.
Too many concurrent requests will overload the Spark!
Let’s consider again the example of the Online Ticketing System. It involves:
– Query per second
– Frequent ticket booking
– Viewing the ticket
So, what’s the solution?
Write out to a DB to handle the load.
4. Searching Content:
Lastly, for searching for content purpose. We are very familiar with the autocompletion feature in the search bar where we write just a few letters and it shows us all the options. The query for that would be something like:
So, each time there’s a request on the Spark cluster for this, it will be a job and for each job, it will talk to the file system and load its content. Rest we know.
Surely, we don’t want to get trapped in that circle, right! So, what’s the solution?
Use a Search Engine like Apache Solr and ElasticSearch.
Remember to use the right tool for the right purpose. We might think that one technology is extremely popular and can be used in your case too because we haven’t thought it through. So if it doesn’t suit our use case, we probably end up messing our Big Data Architecture.
The world is a Big Data Problem.
I hope this would have helped you. Feel free to share your thoughts in the comments box below. 🙂