Introducing OmniSight Telemetry (OST): Advanced, Decoupled Monitoring for the Mule Runtime Engine


Decoupling monitoring infrastructure from an application’s behavior and infrastructure is a foundational technique for scaling a distributed system. There is unfortunately no escape from the “Observer Effect”, even in application development. Attempting to “observe” an application generally involves either adding code with the associated complexity costs or using agent based techniques. Either of which, when applied incorrectly, can introduce unnecessary performance overhead on both the application and the upstream monitoring systems they are feeding data into.

A particular challenge we’ve seen with our clients over the last year or so has been directly coupled central monitoring aggregation services impacting the runtimes of Mule applications. We’ve also seen API traffic patterns having a non-constant (and substantially non-linear) impact on the amount of log volume being pushed into these monitoring systems, creating performance and scaling challenges for both the Mule runtimes as well as central monitoring services.

As such it generally makes sense to immediately and efficiently offload monitoring metrics to a different system for analysis. By not directly coupling the application we’re monitoring to the monitoring infrastructure, we can safely scale both components independently and minimize the impact of the “Observer Effect”. As a result of this approach, you can realize the ability to distribute and surface the monitoring data in different ways, ranging from real time alerting to predictive analytics using machine learning techniques.

Given these experiences and this approach, we have decided to expand our OmniSuite Application Overlay (OmniSuite AO), with the OmniSight Telemetry Agent (OST) to extend the Mule Runtime Engine (MRE) to support advanced and performant real time monitoring use cases. While the default Mule Agent and Anypoint Runtime Manager support monitoring, there are challenges when deployed to non-trivial, high traffic microservice and real-time transactional environments. We specifically developed OST to address the following:

  • Memory Pressure & Garbage Collection - Monitoring notifications buffered in the Mule JVM’s heap increase memory pressure and have negative consequences to Garbage Collection that can impact normal Mule application behavior;
  • Thread & CPU Contention - CPU overhead for the Mule agents’ to send data to the various external operational systems “in band” of the application’s normal activity create thread and CPU contention with deployed Mule applications;
  • Ease of Integration - Integration with monitoring and distributed tracing systems common in microservice and Platform-as-a-Service architecture.
  • Sampling - Deterministic monitoring sampling to reduce / eliminate load to downstream monitoring traffic aggregators (e.g. Splunk.)
  • Extended Latency Monitoring & Granular Reporting - The ability to monitor and report on latency of backend services as well as granular reporting on the amount of overhead Mule is introducing when mediating between systems

Architecture

Hexagonal-API-Led-Architecture

OST introduces a custom Mule agent that immediately offloads all monitoring, API notifications, logging and realtime JMX data to either a low-latency, high throughput disk based queue or an external messaging technology like Kafka or JMS.

A separate JVM process subscribes to these notifications and provides the ability to define temporal SQL queries to aggregate and route the notifications to arbitrary downstream systems. The JVM process can also be run in a totally separate VM or container using the sidecar pattern.

Example Configuration

OST uses a JSON-like configuration file to specify “sinks” where aggregated monitoring data is sent. Data to send to sinks are specified via “queries” that operate in realtime over the event data being sent from the Mule agent.

The following configuration sets up the following sinks:

  • InfluxDB: InfluxDB is a time series database that can be used to store monitoring data in large microservice deployments. In this example we have two InfluxDB collections that we’ll be sending data to. The first, “influxdb-tps”, will store rolled up transaction-per-second metrics for every Mule Runtime Engine. The second, “influxdb-heap”, will be used to store rolled up memory metrics sampled from the Mule Runtime Engine.
  • Splunk HEC: Splunk is the ubiquitous machine data storage and reporting solution for the enterprise. Splunk exposes the HTTP Event Collector which can be used to programmatically receive monitoring data via its REST API. The “splunk-hec” sink will be used to send rolled up transaction-per-second metrics.
  • PagerDuty: PagerDuty is a SaaS based incident response platform. The pager-duty-sink will let us send alerts in realtime to PagerDuty via its API.

To select how data is sent to the sinks we need to define SQL queries that operate over the real-time event stream coming out of the Mule Runtime Engine. This configuration has the following queries defined:

  • "Average Memory Roll Up" uses the OmniSight Mule Agent’s sampling of heap data to send 5 seconds of average memory utilization to InfluxDB.
  • "API Transactions per Second" rolls up the count of API transactions every second and sends them to InfluxDB and Splunk
  • "High Memory Alert" will generate a PagerDuty incident when memory utilization goes past the 80% threshold for a Mule Runtime Engine.

collector {

  queue-directory: "/data/omnisight",

  sinks: {
    "influxdb-tps": {
      type: "influxdb-tps",
      url: "https://influx.omni3tech.com:8086",
      username: "root",
      password: "root",
      database: "mule",
      measurement: "mule_avg_mem"
    },
    "influxdb-heap": {
      type: "influxdb-heap",
      url: "https://influx.omni3tech.com:8086",
      username: "root",
      password: "root",
      database: "mule",
 measurement: "mule_tps"
    },
    "splunk-hec": {
      type: "splunk-hec",
      url: "https://foo.cloud.splunk.com:8088/services/collector/event",
      token: "123456-1234-1234-abcd-123456"
    },
    "pager-duty-sink": {
      sid: "foo",
      token: "foo"
    }
  },
  queries: [
    {
      name: "Average Memory Roll Up",
      query: "SELECT TUMBLE_END(rowtime, INTERVAL '5' SECOND) AS ts,AVG(mem_heap_percent_used) AS avgMem FROM jmx GROUP BY TUMBLE(rowtime, INTERVAL '5' SECOND)",
      fields: "ts,avg_mem",
      sinks: [
        "influxdb-heap"
      ]
    },
    {
      name: "API Transactions per Second",
      query: "SELECT TUMBLE_END(rowtime, INTERVAL '1' SECOND) AS ts,COUNT(*) AS tps FROM agent GROUP BY TUMBLE(rowtime, INTERVAL '1' SECOND)",
      fields: "ts,tps",
      sinks: [
        "influxdb-tps",
        "splunk-hec"
      ]
    },
    {
      name: "High Memory Alert",
      query: "SELECT message.id AS MULE_MESSAGE_ID FROM jmx WHERE mem_heap_percent_used > 80",
      fields: "ts,avg_mem",
      sinks: [
        "pager-duty-sink"
      ]
    }
  ]
}

Queries using the “GROUP BY” clause will ensure a deterministic amount of data is sent to the target sink. In the case of the "API Transactions per Second", for instance, regardless of 1 or 1,000,000 Mule transactions only a single API call will be made to the Splunk HTTP Event Collector. The aggregation of the data will only impact the heap of the separate OmniSight Collector instance, with no impact to the Mule Runtime Engine’s performance.

Our load testing has shown that the OST Collector uses on average between 50-80 megs of heap in a configuration similar to the above under stress tests, with minimal CPU impact. This makes it plausible to run on the same VM or container that the Mule Runtime Engine is running on.

OST was designed to make monitoring Microservices, API Gateways and Hybrid Integration Platforms easier when deployed in highly distributed and transactional infrastructure. As OST evolves we plan to add support for other frameworks besides Mule, adding support for more upstream monitoring systems and layering in machine learning capabilities for predictive alerting.

Trying it Out

Please contact ost-trial@omni3tech.com to discuss trial options.