Big Data has now evolved into a buzz word and it seems everyone is either working on it or want to work on it. However, most of the people associate Big Data with some of the popular tool sets like Hadoop, Spark, NoSql databases like Hive, Cassandra , HBase etc.
HDFS made Big Data popular as it gave us an option to distribute the data over multitude of systems and then Hadoop provided a way to work with that data in a predictable time. However the main strength of HDFS was the abstraction of the distributed file system as a normal file system. All the complexity of distributed system was hidden from the user and for him it was a normal file IO operations, he was not at all concerned on how HDFS is distributing the files in different blocks over different nodes and then assembling it to return as a single file.
Over the past decade there are a host of Big Data technologies that have evolved and it becomes difficult to categorise them. In this blog I will try to give some structure and classification to these technology set.
Broadly we can categorise the tool set into following Headers
Storage: Big Data is all about analytics and no matter how fancy your algorithms are they can’t perform without robust storage to support them. In storage space its not like that one size will fit all its rather you have to choose your storage carefully on the basis of the use case that you are trying to solve. There is a plethora for storage solutions available which try to solve a particular problem. The image below shows some of the popular databases in their respective categories.
Processing : Processing is actually making data ready for the consumption for our analytics and because of large scale data it requires an extremely high-performance computing environment that can be managed with the greatest ease and can be performance tuned with linear scalability. The data from the source may need to be ingested into the Data Store or may need to be filtered, transformed or enriched for further use. In the Data Warehouses we used to run ETL jobs (Batch Processing) to load the data into DWH but now there are a lot of streaming solutions that make it possible to process the data in almost real-time. The Processing can be classified as following
- ETL Jobs : These are tools that helps you extract, transform and load data into the storage systems in batches, some of the popular tools are Spark, Flink, Pig, Talend etc. You can define and schedule jobs in these solutions to do the processing.
- Massive Parallelization : Since we are dealing with data that is of the size of Peta Bytes we will not be able to process the data unless we use parallel processing across different nodes. Hadoop/Map Reduce was one of the earliest solution that did this and was hugely popular. It was able to distribute the processing over multiple nodes of a cluster to do the computation. One of the popular MPP solutions is Spark that do a lot of in-memory computations and optimisations to achieve great results.
- Data Logistics : Data Logistics tools help us move data from one system to another system or we can say from one storage to another storage. They are meant for bulk transfer of data e.g. Apache NiFi was built to automate the flow of data between systems in an automated and managed way.
- Stream Processing : Real-time data holds potentially high value for business but it also comes with a perishable expiration date. If the value of this data is not realized in a certain window of time, its value is lost and the decision or action which was needed as a result never occurs. Such data comes continuously and quite quickly, therefore, we call it streaming data. Data streaming requires special attention as sensor reading changing rapidly, blip in log file, sudden price change holds immense value but only if it alerted in time. There are many popular solutions that helps in stream processing like Spark Streaming, Kafka Streams , Storm and Flink etc.
- Data Workflows : Every data platform needs to build Model pipelines that provides data sources, such as lakes and warehouses, and data stores, such as an application database. In a nutshell these pipelines connects various data sources and there are processing steps with in the flow of data. When building a pipeline, it’s useful to be able to schedule a task to run, ensure that any dependencies for the pipeline have already completed, and to backfill historic data if needed. While it’s possible to perform these types of tasks manually, there are data workflow tools that have been developed to improve the management of data science workflows. Some of them are Airflow, Apache Beam and Oozie.
- The figure below depicts the classification of the processing tools.
Analytics : Data is meaningless until it turns into useful information and knowledge which can aid the management in decision making. For this purpose, we have several top big data software available in the market. This software help in analyzing, reporting and doing a lot more with data.
The Analytics tool can be classifies in 3 categories
- Algorithms: Big Data could have a long gone concept if Algorithms weren’t developed to support this idea. The intelligent algorithms do not only decode the data but analyze and provide the right output at the right time. Algorithms are what helps us derive sense out of data by acting as the implementing agent in the system. These Algorithms are the back bone of the analytics system as they provide information that can help businesses grow or help their customers whether it is predictive analysis, AI or information processing all it done using them. Some of the popular Machine Learning Libraries are Spark MLib, Mahout, Pandas, Tensor Flow etc.
- Analytics Front-ends : These tools help to see and understand the data. They are powerful and can connect to almost any database (at least the popular one), provide drag and drop to create visualizations, and share with a click. You can slice and dice the data and see different aspects of it. They can be used for generating reports as well as create the dashboards. Most popular are Tableau, Pentaho, Superset and Redash.
In this blog I have tried to define the Big Data landscape and have collated some of the tools in their respective headers. There are many options available and the space is growing bigger day by day but the broad general classification will remain same.