Logo von nextlevels
Hey!
Back to the wiki

Alerting

Alerting means that a system automatically sounds an alarm as soon as a key figure gets out of hand. Instead of someone staring at a dashboard every hour, you define thresholds and rules in advance - and the monitoring system reports automatically if anything deviates from them. By email, Slack, SMS, push or a call in the middle of the night, depending on how critical it is.

The core is a reversal of the way we work: It's not the person looking for the error, the error finds the person. This is vital for the survival of an online shop. A checkout that has been throwing errors for two hours costs money - and in two hours, depending on traffic, you can lose a four-figure turnover without realising it. Alerting shortens the time between "something is broken" and "someone knows about it" from hours to seconds.

Monitoring, alerting, observability - what's the difference?

The three terms are often lumped together, but they mean different things. Monitoring is the continuous collection and display of measured values - server load, response times, error rates. It shows you the status. Alerting is the layer on top that derives an action from these measured values: If value X exceeds threshold Y, notify Z. Observability goes further and asks whether you can even deduce why something happened from the collected data.

In short: Monitoring sees, alerting calls, observability explains. Without alerting, monitoring is a dashboard that nobody looks at when it matters.

What triggers an alert

Alerts arise from conditions via metrics or logs. The typical triggers in the e-commerce environment:

  • Threshold alerts: A metric exceeds or falls below a fixed value. Example: CPU utilisation above 90 percent, or conversion rate below half of the daily average.
  • Error rate alerts: The proportion of incorrect requests (HTTP 5xx) rises above a defined limit.
  • Availability alerts: A health check or a synthetic sample no longer reaches the shop - the classic for "site is down".
  • Anomaly alerts: Instead of fixed thresholds, the system recognises a deviation from the learned normal behaviour. Helpful for metrics with a strong daily rhythm, where a fixed value does not fit.
  • Heartbeat/dead man switch alerts: It sounds an alarm if an expected signal does not materialise - such as a nightly cron job that no longer reports that it has been running.

How a good alert is structured

An alert is more than just "something is red". It only becomes useful through context. A well-designed alert answers three questions immediately: What has happened, how bad is it, and what should the recipient do now? A notification with the text "DiskUsageWarning on prod-db-01" with no severity level and no instructions for action only generates stress, not a reaction.

This is why severity levels have been established. A pragmatic grading:

Severity Meaning Reaction Channel
Critical (P1) Shop down, checkout broken, data loss imminent immediately, even at night Call / PagerDuty
High (P2) Partial failure, performance severely degraded Within working hours, promptly SMS / Slack-Mention
Warning (P3) Trend is going in the wrong direction Check in day-to-day business Slack channel
Info (P4) Pure logging, no need for action none Dashboard / Ticket

The most common sin: Declaring everything as critical. Then the phone rings at night because a log file directory is 80 per cent full. The result is alert fatigue - and that is more dangerous than no alerting at all.

Alert fatigue: When too many alerts become a real risk

If a team receives dozens of alerts every day, most of which are harmless or disappear on their own, they become numb. At some point, notifications are reflexively clicked away - and then the one alert that really mattered is lost. In practice, this phenomenon is the main reason why alerting fails, not a lack of technology.

Countermeasure: Calibrate thresholds honestly instead of setting them low out of fear. Group related alerts together so that one failure does not generate fifty individual alerts. And consistently clean out - every alert that has never led to an action in recent months should be scrutinised. In its much-quoted site reliability engineering book, Google formulates the guideline that every alert that wakes a person up must be actionable. Freely available to read: Google SRE Book - Monitoring Distributed Systems.

A concrete practical example

A Shopware shop sells an average of around 500 orders per day, with a clear daily pattern: little in the morning, peak in the evening. The operator sets up the following alerting rule: "If the number of successfully completed orders over a period of 30 minutes is more than 70 per cent below the expected value for this time of day, trigger a high alert in the Operations Slack channel."

On a Tuesday evening at 8.15 pm - actually peak time - the order rate suddenly drops to almost zero. The anomaly alert fires after eight minutes. The message lands in the channel with context: expected orders 45, actual 3, link to dashboard, last deploy 22 minutes ago. The team checks immediately and finds the cause: a deployment has corrupted a payment method configuration and the checkout is cancelled during the payment step.

Without alerting, someone might not have noticed this until the next morning - more than half of the evening's revenue would have been lost. With alerting, the error was rectified after 25 minutes. The calculation is simple: the few hours spent setting up the rule paid off on just this one evening.

Setting up alerting properly - the basic rules

There is no gold standard that fits every shop, but there are principles that have proven themselves:

  1. Alert on symptoms, not causes "Customers can't pay" is a useful alert. "CPU at 85 per cent" is often not - as long as the shop is running normally, a high CPU load is just a number for now.
  2. Every alert needs an addressee and an action. If nobody is responsible or nobody can do anything, it is not an alert, but noise.
  3. Define escalation chains If the first recipient does not respond within X minutes, the alert is sent to the next one. So nothing falls through because someone is sitting in the cinema.
  4. Versionise and review alerts Treat alert rules like code. Ask after every incident: Would an alert have caught this sooner? Did an alert fire unnecessarily?

Tools at a glance

The tool landscape is broad. Prometheus with Alertmanager is the de facto standard for metrics-based alerting in the open source camp. Grafana provides the visualisation and can issue alerts itself. PagerDuty or Opsgenie have established themselves for the notification and escalation side. In the cloud environment, AWS CloudWatch, Google Cloud Monitoring and Azure Monitor provide their own alerting engines. For pure availability checks, lean uptime services that ping your shop from outside are often sufficient.

Which tool is of secondary importance. What matters is that someone maintains the rules and that the alerts are actually received by an awake person.

Setting thresholds sensibly

The most difficult question in alerting is not "if", but "from when". If you set the threshold too low, the alert fires constantly and turns into noise. If you set it too high, you won't notice the problem until it hurts. There is no universally correct value - it results from the normal behaviour of your system and from the question of when a person really needs to intervene.

A tried and tested approach: First observe over several weeks without raising the alarm. Learn how a metric behaves in normal operation - with its daily, weekly and seasonal fluctuations. Only then do you set a threshold that does not hit the normal ups and downs, but catches real outliers. Static thresholds work well for metrics with a clear upper limit (storage space, error rate). For metrics with a strong rhythm, such as the order rate, dynamic or anomaly-based thresholds work better.

A second lever against false alarms is duration. Instead of alerting immediately the first time the threshold is exceeded, wait to see if the condition persists over a period of time. A single peak of two seconds is usually harmless; an increased error rate that lasts for five minutes is not. This "for" condition filters out the majority of annoying false alarms.

Business metrics alert, not just technology

Most alerting setups focus on infrastructure: servers, databases, storage space, network. This is necessary, but not sufficient. After all, a shop can run flawlessly from a technical perspective - all servers green, all databases responding - and still not make any money because there is a bug in the order process that no server metric is showing.

That's why business metrics belong in alerting. The most effective signals for an online shop are often the most obvious:

  • Orders per period: A sudden drop is the most reliable early warning signal for a broken checkout.
  • Conversion rate: If it drops without a drop in traffic, something is wrong in the funnel.
  • Payment error rate: If the proportion of failed payments increases, a payment provider or faulty configuration is often behind it.
  • Cart cancellations in the last step: A sharp increase indicates an error exactly where it is most expensive.
  • The charm of these metrics: they measure the effect, not the cause. You don't need to know which of the hundred technical things can break - you can tell by the fact that customers stop buying. This is robust alerting because it also catches errors that nobody has thought of.

    Runbooks: What to do when the alert fires

    An alert without a plan is half the job. Mature teams attach a runbook to every important alert - a concise guide on what to check when the alert fires and which initial steps will help. Ideally, the alert links directly to this runbook. Then the person who is woken up at three o'clock in the morning doesn't have to think about where to start: first check this dashboard, then restart that service, escalate this person if in doubt.

    Runbooks have a pleasant side effect: they make knowledge independent of individual heads. If the one colleague who "always knows what's wrong" fails, the team is otherwise left in the dark. A well-maintained runbook distributes this knowledge.

    Typical errors

    We see three patterns again and again. Firstly: Alerting is set up and then never touched again - until half the workforce has muted the notifications. Secondly, alerts are only sent for technical metrics (server, database), never for business metrics (orders, sales). You will notice a broken payment button much more reliably by the drop in orders than by the CPU load. Thirdly, no silent phases (maintenance windows) for planned work, so that every deployment triggers an alert storm and the team is conditioned to ignore alerts.

    Good alerting is not a product that you buy and tick off. It is a discipline: few, sharp, action-guiding signals instead of many nervous ones. When an alert fires, the normal reaction should be "oh, I have to look at that" - not "there goes that thing again, I'll ignore it".

    Further reading