How Cruise ships started crunching data locally with Apache Spark on Apache Mesos And Docker

Analytics on the edge – How Apache Mesos enabled ships to crunch data

Reading Time: 5 minutes

Introduction & the Problem

One of our key customers, a large cruise line has ships sail with capacity running into few thousands of people on board. They are going through a successful digital transformation which includes managing full life cycle of a guest on mobile, data science-driven personalization, etc and we are fortunate to be part of the whole journey. These ships generate varieties of data, for example, streaming IoT data from sensors onboard, business transaction data, streaming picture data, mobile app click data, etc.

One of the key challenge to expand on the AI usage using streaming ship data is the satellite bandwidth and onboard processing power. Thought the ships host few thousand passengers, they only can carry few servers connected to cloud using a satellite internet connection. These connection bandwidths vary in capacity from ship to ship. Given this constraint of bandwidth, streaming data to shore (cloud) is a no-no. At the same time, it is expensive to buy and maintain hardware on ships to process any meaningful sized data on the ship itself. For example, implementing a recommendation engine with evolving model on a particular ship is impossible given the crunching the algorithm requires on ship.

The Solution

The solution could be the following

  • Use a dedicated set of servers.
  • Ship data to Shore using satellite
  • Summarize the data locally and ship only summarized data
  • Wait till ship reaches the port and use the local wired connection
  • Share infrastructure somehow on the existing infrastructure

While each of them is feasible under certain conditions, we chose to leverage the Apache Mesos infrastructure already built for the digital program, which is currently powering all the enterprise API.

A word on Apache Mesos. Mesos is an open-source resource management framework, similar to Kubernetes and Hadoop Yarn. There are certain key differences between the three, which could be a separate blog.

Architecture

The architecture looks as follows.

  • Each ship has a datacenter (Essentially a small room with a rack full of blade servers) which hosts all of ship applications.
  • Apache Mesos runs on these servers and is used for provisioning the needed resources (CPUs/RAM/Disk/Ports etc) required for applications running constantly (Example: APIs powering the mobile apps)
  • Apache Zeppelin provides the notebooks loved by data scientists and provides a way to analyze data on ship realtime using production data stores (Datastax Cassandra, Couchbase and Kafka). Zeppelin requires minimal resources (We launched zeppelin with 0.1 CPU and 1G RAM and 100MB disk in production)
  • Zeppelin launches the Spark Clusters dynamically whenever a spark interpreter is launched. Spark clusters disappear when the interpreter is stopped (After data scientists finish their work) thus consuming zero resources when not in use.

As you can see, the key benefit is the Apache Spark clusters do not consume any resources except for the period it is running the job. Also, the jobs can be launched with the desired resources, thus allowing dynamic tuning of resources consumed. For example, a well-parallelized spark jobs are launched with fine-grained resources (Example: for our data accuracy checks reporting, we used 1 core and 1G ram with 4 executors which allows Mesos to allocate executors easily whereas our backup/migration kind of bulk loads which are submitted as batch jobs from Metronome Scheduler are allocated 4 cores and 8 G RAM).

Deploying Applications and DevOps Challenge

Deploying batch jobs from command line is what most organizations use. However, it does not answer all the questions below. A complete DevOps pipeline design for spark is fairly tricky and complex, especially when it involves 50+ ships (Datacenters).

  • Source control of spark application
  • Building the application with all tests performed
  • Making the Jar file available for the docker containers on ship environments
  • Targeting deployment to multiple environments
  • Versioning the code such that the application can be backed out
  • Injecting secrets and environment variables
  • Providing desired dependencies

There are more issues that are not listed here. Knoldus DevOps practice lead (Mayank Biragi) designed a beautiful mechanism to address all the issues above. Please contact us for details (Hey Mayank, You should write a blog about it)

Deploying Notebooks in Zeppelin: Currently, we are in the process of implementing notebook deployment. Please watch this section for further updates. Zeppelin provides wonderful API to Refer here

Storage Challenge

Most big data applications use NFS, HDFS or S3 as the storage layer. For spark jobs to run and achieve parallelization, a distributed storage is necessary. However, in our case, these are not the choices to implement on resource-constrained ship environments.

Apache Minio provides a good alternative storage solution, as it provides S3 API and provides easy integration with AWS S3. The nice part of this solution is, Minio provided a ‘Sync’ functionality, with which we were able to implement file synchronization pattern, thus allowing us to compare and contrast data across ships. The footprint for apache Minio can be pretty minimal as it is a dockerized solution.

We used Minio extensively to store temporary data from spark pipelines, synchronization of data, integration, and backup with AWS S3, storing application Jars, etc.

Design Details

The Deployment of the solution required the following components

Spark Mesos Dispatcher

This is a program provided by Spark that accepts spark jobs using spark-submit.

spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher --port $PORT_DRIVER --webui-port $PORT_UI --master $MESOSURL --name spark --properties-file /etc/spark.conf

The above is run as a Mesos container, thus running constantly and listening on designed webui port. Spark applications are submitted to this as follows.

bin/spark-submit  \
  --master mesos://mycompany.sparkdispatcher.com/drive \
  --deploy-mode cluster \
  --verbose \
  --conf spark.master.rest.enabled=true \
  --conf spark.mesos.executor.docker.image=mycompany.registry.com/sparkexecutor \
  --conf spark.cores.max=8 \
  --conf spark.driver.memory=4G \
  --conf spark.driver.cores=2 \
  --conf spark.mesos.executor.home=/opt/spark \
  --conf spark.executorEnv.save=true \
  --conf spark.executor.heartbeatInterval=30s \
--conf spark.driver.extraClassPath=/opt/spark/jars/* \
--conf spark.executor.extraClassPath=/opt/spark/jars/* \
--class org.apache.spark.examples.SparkPi \
  https://dev1.mesos.rccl.com/nexus/repository/linux-bin/spark-examples_2.11-2.3.1.jar \
  1000

Zeppelin

Using Zeppelin with Spark running in Mesos, using docker executor has several challenges related to version compatibilities. In our case, we had the following frameworks that need to work with each other.

  • Spark Version (We went with the 2.4.4 , the latest at the time)
  • Zeppelin (0.8.2 which is the latest)
  • Hadoop embedded in the spark executors (2.8.5). A note here. HDFS libraries are required since spark uses Hadoop AWS libraries to interact with Minio. Hence we had to go with an older version of Hadoop.
  • Apache Mesos
  • Integrations
    • Cassandra
    • Couchbase
    • Kafka
    • Minio
    • AWS S3

As you try to build the correct docker image of the zeppelin, you would encounter the version interferences of various dependencies. Use the Maven Shade plugin to build your application and mask the dependencies carefully to ensure compatibility. This probably by far is the most challenging part of implementing this, as debugging is hard when version incompatibilities occur.

Sample zeppelin-env.sh configuration. Update your configuration as per your setup.

export MESOS_NATIVE_LIBRARY=/usr/lib/libmesos.so
export MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so
export HADOOP_CONF_DIR=/opt/hadoop/conf
export MASTER=mesos://zk://master.mesos:2181/mesos
export SPARK_SUBMIT_OPTIONS="--conf spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so --conf spark.driverEnv.MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so --conf spark.mesos.executor.home=/opt/spark --conf spark.driver.memory=${SPARK_DRIVER_MEMORY}"
export SPARK_PUBLIC_DNS=${SPARK_LOCAL_IP}

Spark Executor Docker Image

At the heart of running spark clusters, we need to provide a docker image that can be pulled by Mesos to instantiate a new spark framework/cluster. The executor needs the spark installed and all the ‘provided’ dependencies are installed in the executor.

The executor included following components

  • Hadoop
  • Spark
  • Minio
  • libmesos.so (Mesos binary)
  • Common dependencies that are repeatedly used (To reduce launch time) . In our case
    • Cassandra Driver
    • AWS Libraries to access S3 and Minio

Conclusion

The above details and architecture provide broad guidance, however, in reality, it required much more debugging and tuning to get it to work. We would like to open-source the code in the future, hence watch this space.

What was achieved, is pretty close to having a Databricks type notebook environment on all the ships with very limited infrastructure. Since resources can be fine-grained, we can deploy multiple jobs at the same time.

Some of the benefits and use cases achieved are

  • Ability to combine information from various sources, including Kafka, Cassandra, and couchbase. This is enabling teams to dig through data and resolve the issue much faster (Several hours shrunk to a few minutes) as developers are able to now write a simple script in scala and run it in Notebooks.
  • Ability to fix production issues fast. When unforeseen events happen, we can now write a quick script to analyze and fix the issues on the fly.
  • Ability to watch issues realtime: We are now able to pipe Kafka stream s to spark and aggregate information and generate alerts.
  • In the future, we intend to use the same infrastructure to analyze IoT, business events and binary content right on the ship.
  • In the future, we intend to use the infrastructure for summarizing and moving only critical information back to shore, thus saving bandwidth.

I would like to acknowledge contributions from various Knoldus and customer engineers who enabled this. The following are the part of the effort.

  • Mayank Bairagi
  • Jyothi Kunaparaju
  • Neha Bharadwaj
  • Suresh Mulukaladu
  • Charan Datla
+ posts

As an Engineer, I help customers in architecting platforms using Spark, Mesos, Cassandra, Kafka (And their commercial versions). As a partner, I guide customers in setting up the organization, processes and build top-notch teams that solve complex problems or deliver digital transformation.

My interests and expertise are in Mathematics, Machine learning, Microservices, Linked data, distributed cloud infrastructure, Real-time enterprise data integration.

Written by 

As an Engineer, I help customers in architecting platforms using Spark, Mesos, Cassandra, Kafka (And their commercial versions). As a partner, I guide customers in setting up the organization, processes and build top-notch teams that solve complex problems or deliver digital transformation. My interests and expertise are in Mathematics, Machine learning, Microservices, Linked data, distributed cloud infrastructure, Real-time enterprise data integration.