Skip to content

Monitoring and Performance. Prometheus, Grafana, APMs and more

  1. Monitoring and Observability
    1. Key Performance Indicator (KPI)
  2. OpenShift Cluster Monitoring Built-in solutions
    1. OpenShift 3.11 Metrics and Logging
      1. Prometheus and Grafana
      2. Custom Grafana Dashboard for OpenShift 3.11
      3. Capacity Management Grafana Dashboard
      4. Software Delivery Metrics Grafana Dashboard
      5. Prometheus for OpenShift 3.11
    2. OpenShift 4
  3. Monitoring micro-front ends on kubernetes with NGINX
  4. Prometheus vs OpenTelemetry
  5. Prometheus
  6. Grafana
  7. Kibana
  8. Prometheus and Grafana Interactive Learning
  9. Logging \& Centralized Log Management
    1. ElasticSearch
      1. Elastic Cloud on Kubernetes (ECK)
    2. OpenSearch
    3. EFK ElasticSearch Fluentd Kibana
    4. Logstash Grok for Log Parsing
  10. Internet Performance Monitoring (IPM)
  11. Performance
  12. List of Performance Analysis Tools
    1. Thread Dumps. Debugging Java Applications
  13. Debugging Java Applications on OpenShift and Kubernetes
  14. Distributed Tracing. OpenTelemetry and Jaeger
    1. Microservice Observability with Distributed Tracing. OpenTelemetry.io
      1. OpenTelemetry Operator
    2. Jaeger VS OpenTelemetry. How Jaeger works with OpenTelemetry
    3. Jaeger vs Zipkin
    4. Grafana Tempo distributed tracing system
  15. Application Performance Management (APM)
    1. Elastic APM
    2. Dynatrace APM
  16. Message Queue Monitoring
    1. Red Hat AMQ 7 Broker Monitoring solutions based on Prometheus and Grafana
  17. Serverless Monitoring
  18. Distributed Tracing in Apache Beam
  19. Krossboard Converged Kubernetes usage analytics
  20. Instana APM
  21. Monitoring Etcd
  22. Zabbix
  23. Other Tools
  24. Other Awesome Lists
  25. Slides
  26. Tweets

Monitoring and Observability

Key Performance Indicator (KPI)

OpenShift Cluster Monitoring Built-in solutions

OpenShift 3.11 Metrics and Logging

OpenShift Container Platform Monitoring ships with a Prometheus instance for cluster monitoring and a central Alertmanager cluster. In addition to Prometheus and Alertmanager, OpenShift Container Platform Monitoring also includes a Grafana instance as well as pre-built dashboards for cluster monitoring troubleshooting. The Grafana instance that is provided with the monitoring stack, along with its dashboards, is read-only.

Monitoring Component Release URL
ElasticSearch 5 OpenShift 3.11 Metrics & Logging
Fluentd 0.12 OpenShift 3.11 Metrics & Logging
Kibana 5.6.13 kibana 5.6.13
Prometheus 2.3.2 OpenShift 3.11 Prometheus Cluster Monitoring
Prometheus Operator Prometheus Operator technical preview
Prometheus Alert Manager 0.15.1 OpenShift 3.11 Configuring Prometheus Alert Manager
Grafana 5.2.3 OpenShift 3.11 Prometheus Cluster Monitoring

Prometheus and Grafana

openshift3 Monitoring

Custom Grafana Dashboard for OpenShift 3.11

By default OpenShift 3.11 Grafana is a read-only instance. Many organizations may want to add new custom dashboards. This custom grafana will interact with existing Prometheus and will also add all out-of-the-box dashboards plus few more interesting dashboards which may require from day to day operation. Custom Grafana pod uses OpenShift oAuth to authenticate users and assigns “Admin” role to all users so that users can create their own dashboards for additional monitoring.

Getting Started with Custom Dashboarding on OpenShift using Grafana. This repository contains scaffolding and automation for developing a custom dashboarding strategy on OpenShift using the OpenShift Monitoring stac

Capacity Management Grafana Dashboard

This repo adds a capacity management Grafana dashboard. The intent of this dashboard is to answer a single question: Do I need a new node? . We believe this is the most important question when setting up a capacity management process. We are aware that this is not the only question a capacity management process may need to be able to answer. Thus, this should be considered as the starting point for organizations to build their capacity management process.

Software Delivery Metrics Grafana Dashboard

This repo contains tooling to help organizations measure Software Delivery and Value Stream metrics.

Prometheus for OpenShift 3.11

This repo contains example components for running either an operational Prometheus setup for your OpenShift cluster, or deploying a standalone secured Prometheus instance for configurating yourself.

OpenShift 4

OpenShift Container Platform includes a pre-configured, pre-installed, and self-updating monitoring stack that is based on the Prometheus open source project and its wider eco-system. It provides monitoring of cluster components and includes a set of alerts to immediately notify the cluster administrator about any occurring problems and a set of Grafana dashboards. The cluster monitoring stack is only supported for monitoring OpenShift Container Platform clusters.

OpenShift Cluster Monitoring components cannot be extended since they are read only.

Monitor your own services (technology preview): The existing monitoring stack can be extended so you can configure monitoring for your own Services.

Monitoring Component Deployed By Default OCP 4.1 OCP 4.2 OCP 4.3 OCP 4.4
ElasticSearch No 5.6.13.6
Fluentd No 0.12.43
Kibana No 5.6.13
Prometheus Yes 2.7.2 2.14.0 2.15.2
Prometheus Operator Yes 0.34.0 0.35.1
Prometheus Alert Manager Yes 0.16.2 0.19.0 0.20.0
kube-state-metrics Yes 1.8.0 1.9.5
Grafana Yes 5.4.3 6.2.4 6.4.3 6.5.3

Monitoring micro-front ends on kubernetes with NGINX

Prometheus vs OpenTelemetry

Prometheus

Grafana

Kibana

Prometheus and Grafana Interactive Learning

Logging & Centralized Log Management

ElasticSearch

Elastic Cloud on Kubernetes (ECK)

OpenSearch

EFK ElasticSearch Fluentd Kibana

Logstash Grok for Log Parsing

Internet Performance Monitoring (IPM)

  • devops.com: The Fallacy of Continuous Integration, Delivery and Testing Whether your organization embraces CI/CD/CT already or is rethinking its approach to DevOps, this article should give you pause. Your job–perhaps as part of a larger team–is to catch performance issues and potential disruptions with your application before client impact is realized. Without IPM, only part of that job is being done.

Performance

List of Performance Analysis Tools

Thread Dumps. Debugging Java Applications

#!/bin/sh
# Generate N thread dumps of the process PID with an INTERVAL between each dump.
if [ $# -ne 3 ]; then
   echo Generates Java thread dumps using the jstack command.
   echo
   echo usage: $0 process_id repetitions interval
   exit 1
fi 
PID=$1
N=$2
INTERVAL=$3 
for ((i=1;i<=$N;i++))
do
   d=$(date +%Y%m%d-%H%M%S)
   dump="threaddump-$PID-$d.txt"
   echo $i of $N: $dump
   jstack -l $PID > $dump
   curl -X POST --data-binary @./$dump https://fastthread.io/fastthread-api?apiKey=<APIKEY> --header "Content-Type:text"
   sleep $INTERVAL
done
  • How to run this script from within the POD: ./script_thread_dump.sh 1 15 3, where:
    • “1”: PID of java process (“1” in containers running a single process, check with “ps ux” command).
    • “15”: 15 repetitions or thread dumps
    • “3”: interval of 3 seconds between each thread dump.
  • According to some references only 3 thread dumps captured in a timeframe of 10 seconds is necessary (when we want to troubleshoot a Java issue during a service degradation).
  • Sample thread dump analysis reports generated by fastThread:

Debugging Java Applications on OpenShift and Kubernetes

Distributed Tracing. OpenTelemetry and Jaeger

Microservice Observability with Distributed Tracing. OpenTelemetry.io

OpenTelemetry Operator

Jaeger UI

Zipking UI

Jaeger VS OpenTelemetry. How Jaeger works with OpenTelemetry

Jaeger Vs OpenTelemetry

Jaeger vs Zipkin

Grafana Tempo distributed tracing system

Application Performance Management (APM)

Elastic APM

Elastic APM

Dynatrace APM

Message Queue Monitoring

Messaging Solution Monitoring Solution URL
ActiveMQ 5.8.0+ Dynatrace ref
ActiveMQ Artemis Micrometer Collector + Prometheus ref1, ref2
IBM MQ IBM MQ Exporter for Prometheus ref
Kafka Dynatrace ref1, ref2, ref3
Kafka Prometheus JMX Exporter ref1, ref2, ref3, ref4, ref5, ref6, ref7
Kafka Kafka Exporter
Use JMX Exporter to export other Kafka’s metrics
ref
Kafka Kafdrop – Kafka Web UI ref
Kafka ZooNavigator: Web-based ZooKeeper UI ref
Kafka CMAK (Cluster Manager for Apache Kafka, previously known as Kafka Manager) ref
Kafka Xinfra Monitor (renamed from Kafka Monitor, created by Linkedin) ref
Kafka Telegraf + InfluxDB ref
Red Hat AMQ Broker (ActiveMQ Artemis) Prometheus plugin for AMQ Broker
To monitor the health and performance of your broker instances, you can use the Prometheus plugin for AMQ Broker to monitor and store broker runtime metrics. Prometheus is software built for monitoring large, scalable systems and storing historical runtime data over an extended time period. The AMQ Broker Prometheus plugin exports the broker runtime metrics to Prometheus format, enabling you to use Prometheus itself to visualize and run queries on the data.
You can also use a graphical tool, such as Grafana, to configure more advanced visualizations and dashboards for the metrics that the Prometheus plugin collects.
The metrics that the plugin exports to Prometheus format are listed below. A description of each metric is exported along with the metric itself.
ref1, ref2, ref3
Red Hat AMQ Streams (Kafka) JMX, OpenTracing+Jaeger
ZooKeeper, the Kafka broker, Kafka Connect, and the Kafka clients all expose management information using Java Management Extensions (JMX). Most management information is in the form of metrics that are useful for monitoring the condition and performance of your Kafka cluster. Like other Java applications, Kafka provides this management information through managed beans or MBeans.
JMX works at the level of the JVM (Java Virtual Machine). To obtain management information, external tools can connect to the JVM that is running ZooKeeper, the Kafka broker, and so on. By default, only tools on the same machine and running as the same user as the JVM are able to connect.
Distributed Tracing with Jaeger:
- Kafka Producers, Kafka Consumers, and Kafka Streams applications (referred to as Kafka clients)
- MirrorMaker and Kafka Connect
- Kafka Bridge
ref1,ref2
Red Hat AMQ Streams Operator AMQ Streams Operator (Prometheus & Jaeger), strimzi, jmxtrans
How to monitor AMQ Streams Kafka, Zookeeper and Kafka Connect clusters using Prometheus to provide monitoring data for example Grafana dashboards.
Support for distributed tracing in AMQ Streams, using Jaeger:
- You instrument Kafka Producer, Consumer, and Streams API applications for distributed tracing using an OpenTracing client library. This involves adding instrumentation code to these clients, which monitors the execution of individual transactions in order to generate trace data.
- Distributed tracing support is built in to the Kafka Connect, MirrorMaker, and Kafka Bridge components of AMQ Streams. To configure these components for distributed tracing, you configure and update the relevant custom resources.
ref1, ref2, ref3 strimzi, ref4: jmxtrans, ref5: banzai operator
Red Hat AMQ Broker Operator Prometheus (recommended) or Jolokia REST to JMX
To monitor runtime data for brokers in your deployment, use one of these approaches:
- Section 9.1, “Monitoring broker runtime data using Prometheus”
- Section 9.2, “Monitoring broker runtime data using JMX”
In general, using Prometheus is the recommended approach. However, you might choose to use the Jolokia REST interface to JMX if a metric that you need to monitor is not exported by the Prometheus plugin. For more information about the broker runtime metrics that the Prometheus plugin exports, see Section 9.1.1, “Overview of Prometheus metrics”
ref1, ref2, ref3, ref4, ref5

Red Hat AMQ 7 Broker Monitoring solutions based on Prometheus and Grafana

This is a selection of monitoring solutions suitable for RH AMQ 7 Broker based on Prometheus and Grafana:

Environment Collector/Exporter Details/URL
RHEL Prometheus Plugin for AMQ Broker ref
RHEL Prometheus JMX Exporter Same solution applied to ActiveMQ Artemis
OpenShift 3 Prometheus Plugin for AMQ Broker Grafana Dashboard not available, ref1, ref2
OpenShift 4 Prometheus Plugin for AMQ Broker Check if Grafana Dashboard is automatically setup by Red Hat AMQ Operator
OpenShift 3 Prometheus JMX Exporter Grafana Dashboard not available, ref1, ref2

Serverless Monitoring

Distributed Tracing in Apache Beam

Krossboard Converged Kubernetes usage analytics

Instana APM

Monitoring Etcd

Zabbix

Other Tools

  • Netdata Netdata’s distributed, real-time monitoring Agent collects thousands of metrics from systems, hardware, containers, and applications with zero configuration.
  • PM2 is a production process manager for Node.js applications with a built-in load balancer. It allows you to keep applications alive forever, to reload them without downtime and to facilitate common system admin tasks.
  • Huginn Create agents that monitor and act on your behalf. Your agents are standing by!
  • OS Query SQL powered operating system instrumentation, monitoring, and analytics.
  • Glances Glances an Eye on your system. A top/htop alternative for GNU/Linux, BSD, Mac OS and Windows operating systems. It is written in Python and uses libraries to grab information from your system. It is based on an open architecture where developers can add new plugins or exports modules.
  • TDengine is an open-sourced big data platform under GNU AGPL v3.0, designed and optimized for the Internet of Things (IoT), Connected Cars, Industrial IoT, and IT Infrastructure and Application Monitoring.
  • stackpulse.com: Automated Kubernetes Pod Restarting Analysis with StackPulse
  • Checkly is the API & E2E monitoring platform for the modern stack: programmable, flexible and loving JavaScript.
  • network-king.net: IoT use in healthcare grows but has some pitfalls
  • Zebrium Monitoring detects problems, Zebrium finds root cause Resolve your software incidents 10x faster
  • louislam/uptime-kuma A fancy self-hosted monitoring tool. Uptime Kuma is an open source monitoring tool that can be used to monitor the service uptime along with few other stats like Ping Status, Avg. Response time, uptime etc.

Other Awesome Lists

Slides

Click to expand!

Tweets

Click to expand!