I'm curious to know how you all measure and report SLA metrics to customers. We run a SMS gateway for critical services (2FA, OTP and the like) and we measure our SLA metrics with a bunch of scripts in python and bash. Reporting to customers is just sending them an email with an excel attachment with the computed SLA for a given period (usually monthly)
Is there any frameworks or tools you use to measure and report SLA performance to customers ? Some kind of SLA dashboard.
Note that we use site24x7 for monitoring our services. They do have some SLA reports and dashboard, but I am thinking in specific tools just to ingest metrics and compute SLAs
Any insights you can provide will be greatly appreciated!
I would consider 1 of 3 approaches (well, 3 of 3 approaches):
1. Use an external service that measures your websites uptime. You should have this, and have it as a backup alerting system in case your monitoring infrastructure fails anyway.
2. Use a time series database (like Prometheus or influx), preferably that gets data from as close to your edge as possible. (vulnerable to reporting/collection failures), use grafana or something like it to make graphs.
3. Ingest events into a data warehouse via a kafka/sqs/durable queue like system and then write queries that output reports. (most accurate, most effort).
For a normal website without anything fancy, taking nginx (or whatever your load balancer/ssl terminations logs are from) logs and ingesting them into a database, then cronning a python script that performs an SLA query and makes it all pretty for an e-mail seems like a fine way to calculate an SLA.
Option 2 is industry standard operations. Option 2 is good enough for paging on, but probably not good enough for a legal relationship. Option 3 is probably as good as you can get for a legal agreement.
Of course whats important to consider is that a customer might be able to measure their SLA, and you should probably have a good answer for any differences.