How to audit DC/OS Services?

Reading Time: 4 minutes

DC/OS is a distributed systems kernel that lets you manage multiple machines as if they were a single computer. Its included web interface facilitates user to interact with its services. Now, this is where it gets quite messy to keep track of the users who are interacting with the services running in the cluster. It is the responsibility of the cluster-admin to keep track and audit the DC/OS services.

How to make auditing of DC/OS Service possible?

Currently, there is no way to track service changes via DC/OS web interface, so we will be using an old school approach of digging into logs.

The approach involves parsing the DC/OS logs from the master nodes and extracting the information which will help us to audit service changes.
Our main focus will be on the logs generated by dcos-adminrouter.service which is one of the components of the DC/OS cluster.

Why dcos-adminrouter.service ?

This component proxies HTTP requests from outside the DC/OS cluster to the individual DC/OS services inside the cluster. It means that we could actually filter out the service change requests i.e the PUT requests.

//adjust the time factor as per you requirement

$ journalctl  -u "dcos-adminrouter.service" -r  --since "1 hour ago" | grep type=audit | grep PUT

The above command will give to logs similar to this one below.

// Formatted and edited for better understanding

Dec 27 07:53:40 [Master-Node-FQDN][91289]: 2020/12/27 07:53:40 [notice] 21300#0: *113742601 [lua] ee.lua:57: auditlog(): 
    reason="IAM PQ response" 
    request_uri=/service/marathon/v2/apps//test-service-audit?force=true&partialUpdate=false while sending to client, 
server: master.mesos, 
request: "PUT /service/marathon/v2/apps//test-service-audit?force=true&partialUpdate=false HTTP/1.1", 
host: "", 
referrer: ""

Now as we know that master nodes work in a quorum to provide consistency of cluster coordination, so it is possible that you don’t find any logs on a specific master. In this scenario do check other masters nodes for the service logs.

Let’s track … !!!

Now we have the required info with us, its time to present it in some meaning full format. I have chosen 3 ways in which service-audit can be carried out to keep track of the user’s interaction with a service at cluster level in DCOS.

  • Local Audit
  • Auditing via Prometheus.
  • Auditing via Filebeat.

The first one is an on-screen audit and can only be done by cluster admin or the one having access to master nodes.

In next 2 ways we will ship the info extracted to Prometheus and Filebeat.
These 2 are the all time favourite tools to gather metrics/logs and make them available to be represented in better format in Grafana or Kibana.

To keep things at groud level only the local auditing will be covered in this blog( part-1 ). The service auditing via prometheus and grafana will be covered in the next part of this blog.

How to do a local audit DC/OS services?

  • ssh into one of the master node of you dcos cluster.
  • run the following script with sudo privilages.

journalctl  -u "dcos-adminrouter.service" -r  --since "1 day ago"  | grep type=audit | grep PUT > $AUDIT_LOG_FILE
echo "--------------------------------------------------------------------------------------------"
if [ -s "$AUDIT_LOG_FILE" ]
      CLUSTER=$(grep -m 1 "host" $AUDIT_LOG_FILE | awk -F'host: ' '{print $2}' | cut -d'.' -f1 | cut -d'"' -f2)
      echo -e "\t\t\tService Audit Result for ${CLUSTER^^}"
    echo "--------------------------------------------------------------------------------------------"
      printf "$fmt" TIMESTAMP USER SERVICE
      echo "--------------------------------------------------------------------------------------------"     
       while read log
        USER=$(echo $log | awk -F'uid=' '{print $2}' | cut -d"@" -f1 | cut -d" " -f1)
        SVC=$(echo $log | awk -F'apps/' '{print $2}' | cut -d" " -f1 | cut -d"?" -f1)
        TIMESTAMP=$(echo $log | cut -c 1-15 | tr ' ' '_')
        printf "$fmt" $TIMESTAMP $USER $SVC 
    done < $AUDIT_LOG_FILE
    echo -e "Unable to find the logs related to service changes.\nThings you can try:"
    echo -e "\t1.Change the time factor and re-run the script"
    echo -e "\t2.Run this script on other master as logs comes in distributed fashion."
echo "--------------------------------------------------------------------------------------------"
  • by default, the script will audit for a span of 1day, change as per your need.
  • the audit results will be displayed on your terminal itself within a couple of seconds.

Audit Results

  • You will get a similar result if the audit is successful on a particular master node.
    DC/OS Service Audit Pass
  • If the audit fails you will get the following output.
    DC/OS Service Audit Fail