Monitoring infrastructure with Sensu Go
In my last post I mentioned I wanted to add some better monitoring into my lab. I’ve been using monit for a while but was getting annoyed by constant alerts. I remember hearing of “fatigue filters” in Sensu so I wanted to give it a go.
Sensu allows you to define everything in yaml or json, this is ideal given I want to put everything in version control and deploy everything using CI/CD.
The setup
Sensu has two components, the backend and the agent. We’ll need a backend for the agents to send information to. I chose to run it on my Kubernetes cluster. I won’t share the whole yaml but the key points are:
- command to start the container
sensu-backend start --state-dir /var/lib/sensu --log-level debug
- make
/var/lib/sensu
persistent somehow. - import ports are 8080, 3000 and 8081
Once the backend is running (very easy), you can point an agent to the backend. I installed the agent on a couple of Ubuntu VMs. After adding the Sensu repo you can just do a simple apt install sensu-go-agent
. Setting up the agent is very easy which is no surprise given how easy the backend was. My config is really simple, this is my config after removing all the comments:
subscriptions:
- linux
- ubuntu
backend-url:
- "ws://10.0.0.1:8081"
Subscriptions are really groups, this tells the backend which checks should be assigned to the client. In this case the agent will look for all linux and ubuntu checks.
Checks
Checks are what the agent runs and reports back to the backend. I have configured some basic checks for the CPU and memory. Here are the important parts of the CPU check yaml:
type: Check
api_version: core/v2
metadata:
created_by: admin
name: check-cpu
namespace: default
spec:
check_hooks:
- critical:
- process_info
command: check-cpu-usage -w 80 -c 95
env_vars: null
executed: 0
handlers:
- telegram-inc-cpu
runtime_assets:
- sensu/check-cpu-usage
subscriptions:
- linux
There a couple of interesting points to this. Firstly we can see the check is just a command that is executed by the agent. Most checks are nagios style checks. -w 80
means a warning incident will be raised if CPU is at 80% utilisation. -c 95
means it will be raised as critical at 95%.
The subscription section means that this check will run on any system where the agent is subscribed to the linux group.
Check hooks are additional commands that will run when the check fails, in this case I get a list of the top processes ordered by CPU usage.
Handlers do things with the data, in this case we point it to a handler that sends me an alert on telegram (if some parameters are met)
Finally, Runtime assets are important, they include all the packages the check needs to execute. There are heaps of assets on the Sensu website.
Handlers
There’s no point in collecting data if we don’t do anything with it. Handlers do something with the data! This is the handler I use for most incidents:
type: Handler
api_version: core/v2
metadata:
created_by: admin
name: telegram-inc
namespace: default
spec:
command: sensu-telegram-handler --api-token API_TOKEN
--chatid CHAT_ID
env_vars: null
filters: ["is_incident", "not_silenced", "fatigue_checker"]
handlers: null
runtime_assets:
- Thor77/sensu-telegram-handler
secrets: null
timeout: 0
type: pipe
Again, a handler command is just an executable. In this case it’s an alert to Telegram. Filters ensure that I only get a message for things I care about.
is_incident
ensures that only when something fails I get an alertnot_silenced
means I won’t get alerts for things I suppress in the dashboard (false alerts maybe?)fatigue_checker
is a custom filter that ensure I don’t get hammered by a constantly failing check.
Fatigue checker
My fatigue checker filter means I only get issues I care about:
type: EventFilter
api_version: core/v2
metadata:
name: fatigue_checker
namespace: default
spec:
action: allow
expressions:
- (event.check.status == 1 && event.check.occurrences % 120 == 0) || (event.check.status == 2 && event.check.occurrences % 60 == 0) || (event.is_resolution) || (event.check.occurrences == 1)
Breaking this down, an event will only make it through the filter if one of the below things are true:
- It’s a warning and it’s the first event in a 4 minute period.
- it’s a critical issue and it’s the first event in a minute period.
- Something is fixed.
- It’s the very first event
Sensuctl
sensuctl
is how we control Sensu and update config. It’s very similar to kubectl
which is nice. Configure it with sensuctl configure
to point it to the backend. Once configured we can create config easily using sensuctl create -f checks/check-cpu.yaml
. They can also be updated in place using the same command.
Final thoughts
I’m liking Sensu much better than monit. In fact I’ve removed monit from all my systems (except my mail server). All the configuration is versioned under git but I’m hoping to automate the deployment with CI/CD soon.