Google Cloud Operations Suite

gcp operation suite
Reading Time: 6 minutes

Introduction

Google Cloud’s operations suite (formerly Stackdriver) is a set of tools to help you monitor, debug, and trace your applications and infrastructure running in Google Cloud Platform (GCP) to ensure good performance and availability.  

What is the operations suite?

Google Cloud’s operations suite is made up of products to monitor, troubleshoot and operate your services at scale, enabling your DevOps, SREs, or ITOps teams to utilize the Google SRE best practices.
It offers integrated capabilities for monitoring, logging, and advanced observability services like trace, debugger and profiler.

Google Cloud’s operations suite combines metrics, logs, and metadata. Whether you’re running on Google Cloud, Amazon Web Services, on-premises infrastructure, or a hybrid cloud, you can quickly understand service behaviours and issues from a single comprehensive view of your environment and take action if needed.

Operations Suite Key Components

Operations Suite consists of a collection of managed services that are designed to make the observation of cloud applications easier. There are six such services, as shown in the diagram below. Let’s take a deeper look into these components.

Figure 1: Operations Suite Components

Cloud Monitoring

Cloud Monitoring collects metrics, events, and metadata from Google Cloud, Amazon Web Services (AWS), hosted uptime probes, and application instrumentation. It provides visibility into metrics such as CPU use, disk I/O, memory, network traffic, uptime and other custom metrics. It is based on collectd, an open source daemon that gathers system and application performance data. Users receive customizable alerts when Cloud Monitoring discovers performance issues.

Figure 2: Chart representing CPU utilisation of a Compute Engine virtual machine

Features:

  • Collect metrics from multi cloud and hybrid infrastructure in real time.
  • Metrics, events, and metadata are displayed with rich query language that helps identify issues and uncover significant patterns.
  • Reduces time spent navigating between systems with one integrated service for metrics, uptime monitoring, dashboards, and alerts.

Cloud Logging

Cloud Logging is a fully managed scalable service that allows you to store, search, analyse, monitor, and alert on logging data and events from Google Cloud and Amazon Web Services. Cloud Logging includes a centralised error management interface that provides real-time visibility into cloud application production errors. It also has sorting and content filtering capabilities based on the number of errors, when an error was first and last seen, and the error’s code location.

Figure 3: Logs Preview Sample

Features:

  • Write any custom log, from any source, into Cloud Logging using the public write APIs.
  • You can search, sort, and query logs through query statements, along with rich histogram visualisations, simple field explorers, and the ability to save the queries.
  • Integrates with Cloud Monitoring to set alerts on the logs events and logs-based metrics you have defined.
  • You can export data in real-time to BigQuery to perform advanced analytics and SQL-like query tasks.
  • Cloud Logging helps you see the problems with your mountain of data using Error Reporting. It helps you automatically analyse your logs for exceptions and intelligently aggregate them into meaningful error groups.

Error Reporting

Errors are essential to finding edge cases, overlooked use cases, and boundaries of running applications. Therefore, you need to collect exceptions and errors from all your applications and intelligently aggregate them. Error Reporting helps you group and visualise errors with their sources, stack traces, and occurrences.

Error Reporting aggregates and displays errors produced in the running cloud services. Error Reporting provides a centralised error management interface, to help find the application’s top or new errors so that they can be fixed faster.

Cloud Trace

Cloud Trace is a distributed tracing system for Google Cloud that collects latency data from applications and displays it in near real-time in the Google Cloud Console. The analysis for all the application’s traces is done automatically, and the results displayed in form of latency reports. Cloud Trace provides visualisation and analysis to understand request flow, service topology and latency issues in your app.

Cloud Trace can track how requests propagate through the application and receive detailed near real-time performance insights.Cloud Trace automatically analyses all of the application’s traces to generate in-depth latency reports to surface performance degradation’s and can capture traces from all the VMs, containers, or App Engines.

Cloud Trace creates a trace to generate latency reports in order to find critical paths and performance degradation’s as shown below:

Figure 4: Cloud Trace overview

Cloud Debugger

Cloud Debugger inspects the state of an application deployed in Google App Engine or GCE, using production data and source code. During production, snapshots are taken of an application’s state and linked to a line location in the source code, without having to add logging statements.
This inspection doesn’t affect the application’s performance.

With Cloud Debugger, you can set breakpoints, as well as see the call stack and variables, without deteriorating the instances’ performance in the cloud. Cloud Debugger supports Cloud Source Repositories, GitHub, Bitbucket, or GitLab as the source code repository. If the source code repository is not supported, the source files can be uploaded.

Figure 5: Cloud Debugger

To determine why the loop executes too many times, set the snapshot to trigger on the following condition: left > right. Then click the camera icon camera_alt to prepare Debugger for the snapshot.

Cloud Profiler

Cloud Profiler is a statistical, low-overhead profiler that continuously gathers CPU usage and memory-allocation information from your production applications. It attributes that information to the application’s source code, helping you identify the parts of the application consuming the most resources, and otherwise illuminating the performance characteristics of the code. This helps in identifying which part of the application is taking up the most resources, so that appropriate the said part of the code might be changed.

Cloud Profiler continually analyses your code’s performance on each service, so that you can improve its speed and reduce your costs. And it is designed to run in production with effectively no performance impact.

Figure 6: Cloud Profiler interface

Use cases

Monitor your infrastructure

Cloud Logging and Cloud Monitoring provide your IT Ops/SRE/DevOps teams with out-of-the box observability needed to monitor your infrastructure and applications. Cloud Logging automatically ingests Google Cloud audit and platform logs so that you can get started right away. Cloud Monitoring provides a view of all Google Cloud metrics at zero cost and integrates with a variety of providers for non Google Cloud monitoring.

Troubleshoot your applications

Reduce Mean Time to Recover (MTTR) and optimize your application’s performance with the full suite of cloud ops tools. Use dashboards to gain insights into your applications with both service and custom application metrics. Use Monitoring SLOs and alerting to help identify errors.

Conclusion

Google Cloud Operations Suite offers great options to monitor and troubleshoot the applications in the cloud.  Cloud Operations Suite works on applications and infrastructure hosted on cloud platforms. The various Cloud Operation features are always on, keeping track of the said applications. The various logs and metrics are compiled and can then be used elsewhere. Google Cloud services like Pub/Sub, BigQuery, and Google Cloud Storage can make use of the data provided by Operations Suite to improve performance and resolve errors.

Written by 

I am an DevOps engineer having experience working with the DevOps tool and technologies like Kubernetes, Docker, Ansible, AWS cloud, prometheus, grafana etc. Flexible towards new technologies and always willing to update skills and knowledge to increase productivity.