SRE, APIMs, and Microservices - Countering Microservice and API Sprawl to Realize Your API-Led Vision


With the proliferation of service meshes composed of Microservices, Application Programming Interfaces (API’s) and the API Management (APIM) solutions to develop, deploy, manage and govern both API’s and Microservices, the need for Site Reliability Engineering (SRE) best practices and approaches to converge with APIM produced machine data (data/application logging) has never been greater. Without the ability to intelligently collect, infer, inform and extract insights to improve the performance of the enterprise ecosystem of applications and services that make up your service mesh, the sprawl of API’s and microservices will inexorably overwhelm your ability to manage, much less govern, their evolution….or face the risk of API and microservices “sprawl” overwhelming even “Shadow IT” initiatives.

This blog post focuses on Omni3’s established practices and patterns around integrating SRE and the MuleSoft APIM. In addition, I spend time highlighting OmniSuite products that were designed to be added as an application overlay to your service-mesh to produce a single pane of glass for Ops teams to manage their Microservices and API-Led enterprise ecosystems.

We will review a number of foundational concepts and offer contrasting (but not mutually exclusive) approaches to create the operational excellence that all enterprises strive to achieve in order to provide their Lines of Business (LoB’s) and end-users the most highly available, scalable, and performant service mesh of API’s and Microservices possible.

White vs Black Box Monitoring

White box monitoring is based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics. In contrast Black Box Monitoring tests externally visible behavior from a user’s perspective. A combination of both approaches is necessary for a successful monitoring strategy.

White Box Monitoring

White box monitoring, in the form of BMC Patrol, Splunk, TCL scripts, etc...represents the most familiar foundational machine data collection and analysis in most enterprise infrastructure(s). While they cover a great deal of elements, we generally recommend the following checklist to review with your white box monitoring applications, to holistically improve the white box monitoring coverage:

  • Custom Scripts (Python, TcL, etc…)
    • Monitoring interval should be aligned with Service Level Agreements (and Service Level Objectives / Indicators (SLOs / SLIs), as needed, to avoid potential gaps in expectations and alerting
      • Significant impact in terms of committed SLAs for uptime
    • The scripts are binary and don’t expose granular data about the performance / behavior of critical subsystems.
  • Runtime Health Check Application
  • The following additional monitors/SLIs should be introduced
    • JVM Memory
      • Max
      • Min
      • Restarts
    • Host
      • Uptime
      • Disk I/O
      • Network IO
      • Syscall Count
      • Egress network connections / sessions
      • Size of /etc/groups, /etc/password
      • Directory count in /home, file/directory count in $MULE_HOME, /root
    • Transactions
      • Historical Transaction Counts per API instance and aggregate
        • 15, 60, 300, 600 seconds
      • Historical count per error type
        • 15, 60, 300, and 600 second
      • Max transaction time
        • 15, 60, 300, and 600 seconds
      • Total
      • TPS
    • API(s) / Microservice(s) Introduced Latency
      • 15, 60, 300, and 600 seconds
      • Per Operation
      • Aggregate
    • Backend Service
      • Latency
    • Requests
      • min,mean,max byte
      • Per Operation
      • Aggregate
    • Responses
      • min,mean,max byte
      • Per Operation
      • Aggregate
    • Connection
      • Duration
    • Errors
      • RAML validation failure on proxies

Black Box Monitoring

Black Box monitoring is in place in the form of LTM health checks on server level and a Runtime Health Check app the exposes binary health of a runtime. As mentioned above the Runtime Health Check could be extended into an “actuator” that exposes standardized monitoring data for every API.

The following improvements could be made to holistically improve the white box monitoring coverage:

  • Anypoint Monitoring
    • Leverage Anypoint Functional Monitoring to build API tests that assert the API’s client facing behavior and whether or not they actually conform to their contracts.
  • “Actuator” based Runtime Health Checks
    • Expand runtime healthcheck to function as a common library deployable across the API proxies as well as the backend services to facilitate multiple levels of black box monitoring on a per-API basis.

Dashboards

The current Splunk dashboarding should be expanded to display aggregate “4 Golden Signals” across the following SLOs:

  1. Latency
  2. Traffic
  3. Errors
  4. Saturation

These can be reported across the following dimensions:

  • Enterprise
    • Mule Runtime Engines
    • All hosted API proxies
  • Line of Business
    • All API’s per LoB
    • Individually for each API

Depending on the adoption of a time series database and the sophistication of dashboards desired a more sophisticated / dedicated dashboard / visualization tool could be used:

These solutions aren’t mutually exclusive, and some lend themselves to more sophisticated graphing and analysis, as well as more comprehensive and refined graphics.

Distributed Tracing - Addressing the Downside of API-Led and Microservices Architectures

While microservice and API-led architectural approaches provide enormous flexibility and scalability, the downside is that they trade business agility for operational complexity. They also inherit much of the complexity common with distributed systems, the most vexing of which is tracing errors/requests across a multitude of service invocations.

As such, a variety of frameworks have evolved to trace API requests across a distributed API/microservice ecosystem. These frameworks “tag” requests with metadata and propagate that metadata across a chain of service invocations. The metadata, in the case of HTTP, is propagating using HTTP headers. The OpenTracing standard has emerged to define a common set of headers to support this (https://opentracing.io/).

In general, we leverage the following in order of preference:

Decoupled Monitoring

As enterprises deploy microservices and APIs across their ecosystem, we have encountered a tendency among clients to tightly couple their preferred solution for log analysis and monitoring (e.g. Splunk, Appdynamcis, etc..). However, in order to facilitate a deeper level of analytics, intelligence and predictive capabilities the monitoring data should ideally be stored in a time-series database independent of point monitoring systems.

This can be accomplished orthogonally to avoid disruption of the current monitoring infrastructure by installing OmniSuite Telemetry agents to normalize and ingest the data into your monitoring systems, and then leverage the APIs of the monitoring system to load the data into a time-series database like InfluxDB. Machine learning, predictive analytics, “smart alerting”, etc can be implemented off this database as required, augmented by real-time streaming analytics provided by our OmniExhaust streaming product. The following depicts such an architecture:

Screen-Shot-2019-05-06-at-12.29.30-PM

This architecture leaves the current monitoring infrastructure untouched and instead introduces a Mule application that either polls, or gets data pushed, from the various monitoring systems-of-record. This application, after normalizing the monitoring data to a canonical format, can then potentially do the following:

  • Store the stream of events in real time to a time series database
  • Emit a stream of events over a messaging channel (AnypointMQ, JMS, Kafka, etc) that other systems can operate over on real time. An example might be an Apache Spark application that analyzes the event stream and generates an alert if it notices an anomaly or variance in monitoring data.
    • These “composite” events can be sent to a time series database for storage and offline analysis
  • Offline model training and subsequent predictive analysis and reporting using the time series database
  • Advanced visualization and data mining use tools like Tableau and R.
  • Traditional report generation

Smart Alerting

A decoupled monitoring approach also presents opportunities to perform data mining on historical monitoring data. Obvious side effects of this could include automated report generation but the data can also be used to train machine learning models. This provides the foundation for predictive analysis of monitoring event data in order to warn for possible outage conditions before they occur.

Additionally monitoring and incident response software that has predictive analytics features can also be used in this capacity, examples including TrueSight and PagerDuty:

https://www.bmc.com/it-solutions/truesight.html
https://www.pagerduty.com

Incident Resolution

In order to achieve improved MTTR (mean time to resolution) operations teams should distinguish between “symptoms” vs “causes.” This applies to both root cause analysis as well as a remediation plan as the output of a post-mortem.

It’s also important that interpretation of monitoring data during the course of Incident Resolution focuses on the “tail” of metrics. Mean values aren’t generally helpful in outages because they mask the outliers which generally point to where an issue is. A single database query, for instance, might be incredibly slow but will be masked in the noise of the average queries.

A script should be introduced to simplify an “autopsy” of the Mule Runtime Engines and other nodes in the event of an incident. This script should be parameterized and quickly allow operations staff to obtain the following:

  • JVM Thread Dump
  • JVM Heap Dump
  • CPU Load
  • Memory Pressure
  • IO Pressure
  • Mule Runtime Engine
    • Apps running
    • WARN and ERROR level logs
    • Current TPS

Incident resolution workflow should be defined and tracked. Software like PagerDuty (https://www.pagerduty.com/) can be used to manage the lifecycle of incident response including the Alert Flows. It can additionally trace that the procedure laid out in the alert flows is being followed, monitor SLA’s around incident response, etc.

Documentation / runbooks should also be maintained so operational staff can follow the appropriate procedures in the event of incident. These will likely exist as Confluence pages.

This will include a Smart Diagram that allows operational staff to drill into the infrastructure. A Confluence plugin like the following could be used to simplify the construction of these diagrams:

https://marketplace.atlassian.com/apps/254/gliffy-diagram-for-confluence?hosting=cloud&tab=overview

Post Mortems

A post-mortem is a meeting that occurs after a production incident to discuss what went wrong and how to avoid the issue from happening again in the future. It is important that post-mortem’s are “blameless” to ensure open and honest communication in discussing what went wrong and how to fix for them to be effective. Root cause analysis should be the focus (see following Table for more details).

  • Post-mortems need to be blameless to be truly effective.
  • Root cause analysis should be the goal.
  • Issues and conclusions drawn from the post-mortem should be linked back to SLI → SLO → SLA
  • Post-mortems should ALWAYS be completed and be distributed / published to confluence (otherwise useless exercise)
Post Mortem Subjects to Cover Explanation
StatusComplete / In Progress / etc..
Summary What happened (e.g.service x went down / service y was non-performant at an average 1.2 second response time ), over what period (e.g. for 80 minutes) as a result of what general event (e.g. new application coming online / upgrade to API that seemed to break things ).
Root Cause An explanation of the circumstances, and contributing factors, in which this incident occurred. Look to physical, human and organizational causes, and in each case, ask and apply the following process in analyzing all three aspects: a) what does the problem look like from each aspect (e.g. physical vs. human), b) what data have I collected that supports my conclusions in a), c) identify possible causal factors by using the following analysis tools (and others that you find useful) i) Appreciation (asking “So what” for each data point that you feel is relevant, ii) 5 Whys - continue to ask “Why” until there are no more possibilities to answer the question, iii) Drill Down - decompose the problem into smaller and smaller components until you cannot decompose any further, and iv) Cause and Effect diagramming.
Impact Overall effect on users (e.g. over 1 million requests for insurance quotations were delayed / failed) and on revenue (e.g. with a potential loss of xxx dollars related to failed insurance quotations).
Trigger What event precipitated the problem (e.g. the outage of a System API caused a batch job that synchronizes balance data to fail).
Resolution What steps were take to resolve the problem (e.g. rollback to prior version, redirecting traffic, etc…).
Detection How was the error/problem detected (e.g. long delays in insurance quotations / failed quotes reported by xxx users).
Action Items Taskings and Owners that occurred to rectify the problem.
Lessons Learned Important to cover all 3 areas of 1) What went well, 2) What went wrong, and 3) Where we got lucky.
Timeline Essentially a “screenplay” of the incident with clearly denoted dates and normalized time (i.e. UTC).