Integrating infrastructure monitoring alerts with Slack

Introduction

Deployments to production environments require real-time monitoring to ensure that their execution is smooth and they do not affect general stability of the systems and platforms. On that end, having an alerting layer at different levels is crucial so teams can react fast if some anomalous behavior is noticed. The more complete alert system we can have, the better.

Application-related alerts for instance measure if the system meets some basic functional requirements, ensuring that core parts work properly - so-called "health checks". Infrastructure alerts, on the other side, can be configured to be triggered when some metrics reach specific thresholds (for instance, high CPU usage or low available storage space) or when a specific event occurs (for instance, a building process fails).

In general, all cloud environments provide automatic ways and metrics to configure and trigger these alerts. But what happens when we want to define alerts on specific situations that are not provided or detected out of the box by the chosen Cloud solution?. In this article we explore how we can still make use of cloud infrastructure utilities and general events to implement specific alerts that are not included directly. In this case, we are going to focus on a real example we implemented for one of our customers which uses AWS ECS.

Problem & Solution

Alerts are an excellent method for support teams to be aware of important events in the system. However, we have to make sure that those alerts are shared in a communication channel that ensures all engineers in that team pay immediate attention to it. In practice, we've found that Slack channels are an excellent fit for that.

Having agreed on the strategy to share the alerts across the team, we are going to show a real alert example we've implemented in one of our projects. In this case, the team wanted to be notified when an ECS container (a Task) in a cluster stops running for any reason. AWS Cloudwatch provides an alarm system based on CPU usage and memory usage for ECS tasks, although it doesn’t provide a clear way of notifying when a task stops. In particular, that event was of our interest, since a container that stops means that something went wrong - for instance, an initialization or out-of-memory error. To capture this event, we've used Amazon EventBridge, which is a serverless event bus that ingests data from the apps and routes it to targets.

First, we've created a rule was created with the following pattern, that matches when a ECS task stops:

Then, we defined a SNS topic with an HTTPS subscription, as a target.

When an ECS task is stopped, an EventBridge notification is sent via HTTPS to a dedicated backend application devoted to information and event processing - screenshots showing the configuration are depicted above. Depending on the reason for the task to stop, a decision is made regarding whether to forward the notification to a Slack channel. When the task has not stopped for an expected reason (for instance, a deployment that is taking down old tasks as a part of the process), a detailed notification is sent, explicitly stating the reason and the amount of time the task has been active for. In addition, a button is attached with a link to the ECS website with the specific task details for a more in-depth review.

Impact in the processes

The impact of implementing this alerting system was very positive for the engineering team, especially for the support team members. Even having an auto scaling system that ensures a new task is deployed when another fails, it is helpful for the support team to find a pattern and fix the root cause to prevent it from happening again.

While engineering teams at Ensolvers follow strict processes written in runbooks for high-priority tasks, especially if they might compromise system stability, human errors can occur. For instance, it is possible that before a release the team forgot to set up a now-required environment configuration variable, for example, that could cause the new instance not to start successfully. Or we might have some special context in the production environment that makes a container initialization fail. By having this alert system, we get notified and can provide a fix as soon as possible, although we count on a monitoring team that is in charge of checking this. Furthermore, this setup system resulted in a big hand for the QA environment. When the team is dedicated to production related tasks, this alert is needed to know more about the health of the QA infrastructure.

Conclusion

Automatic alerts are an essential part of the monitoring process of productive environments, as there are many things to take into account. In this article, it was shown how to set up alerts on an infrastructure level and how they can be easily integrated into Slack, apart from the already out of the box metrics provided by AWS Cloudwatch.

To learn more about this topic, click here.

Facundo Garbino

Software Engineer & Solver

Integrating infrastructure monitoring alerts with Slack

Introduction

Problem & Solution

Impact in the processes

Conclusion

Start Your Digital Journey Now!