Monitoring infrastructure with Sensu Go

In my last post I mentioned I wanted to add some better monitoring into my lab. I’ve been using monit for a while but was getting annoyed by constant alerts. I remember hearing of “fatigue filters” in Sensu so I wanted to give it a go.

Sensu allows you to define everything in yaml or json, this is ideal given I want to put everything in version control and deploy everything using CI/CD.

The setup

Sensu has two components, the backend and the agent. We’ll need a backend for the agents to send information to. I chose to run it on my Kubernetes cluster. I won’t share the whole yaml but the key points are:

command to start the container sensu-backend start --state-dir /var/lib/sensu --log-level debug
make /var/lib/sensu persistent somehow.
import ports are 8080, 3000 and 8081

Once the backend is running (very easy), you can point an agent to the backend. I installed the agent on a couple of Ubuntu VMs. After adding the Sensu repo you can just do a simple apt install sensu-go-agent. Setting up the agent is very easy which is no surprise given how easy the backend was. My config is really simple, this is my config after removing all the comments:

subscriptions:
  - linux
  - ubuntu
backend-url:
  - "ws://10.0.0.1:8081"

Subscriptions are really groups, this tells the backend which checks should be assigned to the client. In this case the agent will look for all linux and ubuntu checks.

Checks

Checks are what the agent runs and reports back to the backend. I have configured some basic checks for the CPU and memory. Here are the important parts of the CPU check yaml:

type: Check
api_version: core/v2
metadata:
  created_by: admin
  name: check-cpu
  namespace: default
spec:
  check_hooks: 
  - critical:
    - process_info
  command: check-cpu-usage -w 80 -c 95
  env_vars: null
  executed: 0
  handlers:
  - telegram-inc-cpu
  runtime_assets:
  - sensu/check-cpu-usage
  subscriptions:
    - linux

There a couple of interesting points to this. Firstly we can see the check is just a command that is executed by the agent. Most checks are nagios style checks. -w 80 means a warning incident will be raised if CPU is at 80% utilisation. -c 95 means it will be raised as critical at 95%.

The subscription section means that this check will run on any system where the agent is subscribed to the linux group.

Check hooks are additional commands that will run when the check fails, in this case I get a list of the top processes ordered by CPU usage.

Handlers do things with the data, in this case we point it to a handler that sends me an alert on telegram (if some parameters are met)

Finally, Runtime assets are important, they include all the packages the check needs to execute. There are heaps of assets on the Sensu website.

Handlers

There’s no point in collecting data if we don’t do anything with it. Handlers do something with the data! This is the handler I use for most incidents:

type: Handler
api_version: core/v2
metadata:
  created_by: admin
  name: telegram-inc
  namespace: default
spec:
  command: sensu-telegram-handler --api-token API_TOKEN
    --chatid CHAT_ID
  env_vars: null
  filters: ["is_incident", "not_silenced", "fatigue_checker"] 
  handlers: null
  runtime_assets:
  - Thor77/sensu-telegram-handler
  secrets: null
  timeout: 0
  type: pipe

Again, a handler command is just an executable. In this case it’s an alert to Telegram. Filters ensure that I only get a message for things I care about.

is_incident ensures that only when something fails I get an alert
not_silenced means I won’t get alerts for things I suppress in the dashboard (false alerts maybe?)
fatigue_checker is a custom filter that ensure I don’t get hammered by a constantly failing check.

Fatigue checker

My fatigue checker filter means I only get issues I care about:

type: EventFilter
api_version: core/v2
metadata:
  name: fatigue_checker
  namespace: default
spec:
  action: allow
  expressions:
  - (event.check.status == 1 && event.check.occurrences % 120 == 0) || (event.check.status == 2 && event.check.occurrences % 60 == 0) || (event.is_resolution) || (event.check.occurrences == 1)

Breaking this down, an event will only make it through the filter if one of the below things are true:

It’s a warning and it’s the first event in a 4 minute period.
it’s a critical issue and it’s the first event in a minute period.
Something is fixed.
It’s the very first event

Sensuctl

sensuctl is how we control Sensu and update config. It’s very similar to kubectl which is nice. Configure it with sensuctl configure to point it to the backend. Once configured we can create config easily using sensuctl create -f checks/check-cpu.yaml. They can also be updated in place using the same command.

Final thoughts

I’m liking Sensu much better than monit. In fact I’ve removed monit from all my systems (except my mail server). All the configuration is versioned under git but I’m hoping to automate the deployment with CI/CD soon.

2022-05-01

https://kainem.com/posts/monitoring-infrastructure-with-sensu-go/ Kaine M