Challenges to Monitoring a Fast Data Application

Table of contents

Reading Time: 5 minutes

In the present landscape, the buzzword is “Fast Data” and it is nothing but data that is not at rest. And since the data is not a rest, the traditional techniques of working on the data that is rest are no longer efficient and relevant. The importance of streaming has grown, as it provides a competitive advantage that reduces the time gap between data arrival and information analysis.The business enterprises demand the availability, scalability and resilience as implicit characteristics in the applications and this is catered with the micro service architectures. And these micro services are needing to deal with this real-time requirement of dealing with fast data.

The integration of the fast data processing tools and micro services leads to a system that is Fast Data Application. These Fast Data applications process and extract value from data in near-real-time. Technologies such as Apache Spark, Apache Kafka, and Apache Cassandra have grown to process that data faster and more effectively.These applications are earthing real time insights to drive profitability. But these applications pose a big challenge of monitoring and managing the overall system. The traditional techniques fail because they have been based on monolithic applications and are unable to effectively manage the new distributed, clustered and tangled inter-connected systems.

What is the issue?
The main challenge is how to assure the continuous health, availability, and performance of these modern, distributed Fast Data applications.

Let us get into little more detail of how do these applications actually pose a challenge. The possibility of the application to be streaming in data from more than a dozen of sources is extremely high and these sources could be hundreds of individual, distributed micro services, data sources, and external endpoints. Now once we have these sources, we have technologies such as Apache Spark, Apache Mesos, Akka, Apache Cassandra, and Apache Kafka (a.k.a. the “SMACK” stack) in place to get a powerful data processing tool. Now comes the actual point of contention in the system. These technologies are varied, distributed, complex and pose the following bottlenecks:

Evolving System: Rapidly growing stack leads to scarcity of domain knowledge as the number keeps growing, understanding the business value behind them is not a few days task.

Data Pipeline: In-depth understanding of data-pipeline is vital, each stage is input and output for the other and can fail at any stage choking the system. This requires metrics computation on each stage of the pipeline.

Architecture: Manual monitoring or setting up such system is a strict no since the entire architecture of the system is highly dynamic and not static, you can configure things manually.

Complexity in interconnection: An error in one part of the system could possibly be because of choking in some other part, this is highly difficult to identify and debug. It is difficult to know the origin point and where to start looking from.

Distribution & Clustered: Each of the component in the system itself framework, it is deployed in a distributed way. Now imagine each of the framework deployed in a distributed way over multiple nodes and we have multiple of such frameworks working together and intertwined in the application, the debugging with traditional logging systems could be a nightmare. Correlating issues to understand dependencies and analyzing root causes is difficult.

Overwhelming Amount of Information: With each framework in the system generating its own metrics, we have a flood of information and gaining insight into it could be futile.

How is monitoring Fast Data Application different from Traditional Monitoring?

In a monolithic application, when an error occurs, the resolution is based on the monolithic design comprised of the database layer, application layer, and front-end web layer. We get a clear call stack from beginning to end allowing to find the error in time, this is because the flow is very definitive and deterministic.

The challenge posed by the Fast Data Application is entirely different, here the system is often asynchronous and composed of components like micro-services, data frameworks, telemetry, machine learning, the streaming platform etc. There were a few Application Performance Monitoring (APM) tools available, but they are rendered useless now because of inability to monitor asynchronous, streaming systems running on distributed clusters

Various other tools like infrastructure monitoring tools, log analysis and network performance monitoring tools have failed because of highly purpose-built and architectural interdependencies.

What is the Solution and how should it look?
And to resolve the issue, we need is an extremely insightful and powerful visualization layer that is able to understand and analyze the end to end health of the system which includes the availability and performance of each of the app component.Now that we have analyzed the potential issue and we know the abstract solution, let us try to put some high-level implementations into the solution.

The first issue we could resolve is the Overwhelming Amount of Information. Organization of the information into a hierarchy of concerns could be really helpful. For every component in the Fast Data Application we could organize the information into a hierarchy of concerns which could be broadly like :

Data Health: That covers the details like is the throughput of the component matching the expected, is the processing meeting the timeframe requirements, is the data streaming in posing some problems etc

Dependency Health: Are the dependent components like memory cache or endpoints healthy or are they crossing the threshold etc

Service Health: Is the component able to distribute and rebalance the workloads effectively and efficiently.

Application Health: These are the operating parameters under the normal threshold or exceeding the values that can potentially affect the system adversely.

Topology Health: If the resources in the distributed system optimally utilised. Are the performance parameters for the topology healthy?

Node System Health: Are the key parameters like load, CPU, memory, net-i/o, disk-i/o, disk free operating normally.

Other than the organization of Information, there are other dimensions of monitorization required which could be briefly segregated as :

Deep Visibility: Get the real time status that is live system insights

Domain Specific: Identify the important metric to monitor the components and add custom metrics.

Automatic Monitoring: Components should be auto identified and monitored.

Real Time Integrated View: View the health of entire application i.e all components in a single view and real time.

Quick Troubleshooting: minimize the uptime and repair time of the system and the system should be smart enough to learn from previous failures the fixations required.

In the next blog, I shall discuss one of the solutions for End-To-End Monitoring of Fast Data Applications.