> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anomalyarmor.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Alert Best Practices

> Reduce alert fatigue and improve response times with effective alerting strategies

<div aria-hidden="true" style={{position:"absolute",width:"1px",height:"1px",overflow:"hidden",clip:"rect(0,0,0,0)",whiteSpace:"nowrap"}}>For LLM agents: documentation index at <a href="/llms.txt" tabIndex={-1}>/llms.txt</a>, full text at <a href="/llms-full.txt" tabIndex={-1}>/llms-full.txt</a>. Append .md to any page URL for plain markdown.</div>
Effective alerting is about balance: too few alerts and you miss issues; too many and you ignore them all. This guide helps you build an alerting strategy that keeps you informed without overwhelming your team.

## The Alert Fatigue Problem

Alert fatigue happens when teams receive too many notifications:

<img src="https://mintcdn.com/anomalyarmor/pPIiSU0b3Ixsp9az/images/diagrams/alert-fatigue-spiral-light.svg?fit=max&auto=format&n=pPIiSU0b3Ixsp9az&q=85&s=6813faf687f8a533f62881cfdfda6588" alt="Alert fatigue spiral showing the problem of too many alerts" className="block dark:hidden" width="700" height="500" data-path="images/diagrams/alert-fatigue-spiral-light.svg" />

<img src="https://mintcdn.com/anomalyarmor/pPIiSU0b3Ixsp9az/images/diagrams/alert-fatigue-spiral-dark.svg?fit=max&auto=format&n=pPIiSU0b3Ixsp9az&q=85&s=18fb921cc8178f950a68c37e29e0ec74" alt="Alert fatigue spiral showing the problem of too many alerts" className="hidden dark:block" width="700" height="500" data-path="images/diagrams/alert-fatigue-spiral-dark.svg" />

**The goal**: Every alert should be actionable and worth investigating.

## Core Principles

### 1. Start Narrow, Expand Carefully

Don't monitor everything at once:

1. **Week 1**: Monitor 5 critical production tables
2. **Week 2**: Add freshness monitoring to those tables
3. **Week 3**: Expand to 10 more important tables
4. **Week 4**: Review alert history, tune thresholds
5. Continue expanding gradually

### 2. Every Alert Should Be Actionable

Before creating an alert, ask:

* What action should someone take when this fires?
* Is immediate action required, or can it wait?
* Who is the right person to respond?

If you can't answer these questions, the alert may not be useful.

### 3. Match Urgency to Destination

| Urgency        | Destination | When to Use                  |
| -------------- | ----------- | ---------------------------- |
| **Immediate**  | PagerDuty   | On-call response needed now  |
| **Soon**       | Slack       | Team should see within hours |
| **Eventually** | Email       | Can be reviewed daily/weekly |

## Event-Based Routing

Route different event types based on impact severity:

<img src="https://mintcdn.com/anomalyarmor/CZXBGa_D1aE9spAI/images/diagrams/event-based-routing-light.svg?fit=max&auto=format&n=CZXBGa_D1aE9spAI&q=85&s=54de0d2e3538c599c67903f5771caead" alt="Event-based routing showing different alert types going to appropriate channels" className="block dark:hidden" width="800" height="350" data-path="images/diagrams/event-based-routing-light.svg" />

<img src="https://mintcdn.com/anomalyarmor/CZXBGa_D1aE9spAI/images/diagrams/event-based-routing-dark.svg?fit=max&auto=format&n=CZXBGa_D1aE9spAI&q=85&s=ddd22a76066b93dc1d9ee3cc68e37eae" alt="Event-based routing showing different alert types going to appropriate channels" className="hidden dark:block" width="800" height="350" data-path="images/diagrams/event-based-routing-dark.svg" />

### Recommended Setup

| Alert Type                  | Event               | Trigger scope                                                 | Destination         |
| --------------------------- | ------------------- | ------------------------------------------------------------- | ------------------- |
| Production breaking changes | Schema Change       | Breaking only                                                 | PagerDuty + Slack   |
| Production additive changes | Schema Change       | Non-breaking only                                             | Slack (low urgency) |
| Gold-table change freeze    | Schema Change       | Specific types (`COLUMN_REMOVED`, `PRIMARY_KEY_REMOVED`, ...) | PagerDuty           |
| Freshness violations        | Freshness Violation | SLA breached                                                  | Slack               |
| Discovery failures          | Discovery Failed    | Any failure                                                   | Slack + Email       |
| Dev/staging changes         | Schema Change       | Breaking only                                                 | Email               |

## Environment Separation

Monitor different environments differently:

### Production

**Rules:**

* All schema changes → Slack + PagerDuty (for breaking)
* All freshness violations → Slack
* Discovery failures → Slack + Email

**Schedule:** Hourly discovery | **Threshold:** Strict SLAs

### Staging

**Rules:**

* Breaking changes only → Slack
* Freshness (critical tables only) → Slack

**Schedule:** Every 6 hours | **Threshold:** Lenient SLAs

### Development

**Rules:**

* None or weekly digest only

**Schedule:** Daily | **Threshold:** Very lenient or disabled

## Threshold Tuning

### Start Lenient

If your ETL runs hourly, don't set a 30-minute SLA:

| Pattern        | Starting SLA | After Tuning |
| -------------- | ------------ | ------------ |
| 15 min updates | 45 min       | 30 min       |
| Hourly updates | 3 hours      | 2 hours      |
| Daily updates  | 36 hours     | 24 hours     |

### Use Warning Thresholds

Two-stage alerts reduce surprise violations:

**orders table freshness:**

* **Expected**: Updated hourly
* **Warning**: After 90 minutes (alert to Slack)
* **Violation**: After 2 hours (alert to PagerDuty)

Warnings give you time to investigate before escalation.

### Review and Tighten

After 2-4 weeks:

1. Check alert history
2. Identify alerts that fired but weren't actionable
3. Tighten thresholds that never trigger
4. Loosen thresholds that trigger too often

## Scope Filtering

### Include Only What Matters

Filter rules to relevant assets:

**Rule: Production Revenue Freshness**

* **Data source**: production-postgres
* **Schema**: public
* **Assets**: `orders`, `payments`, `revenue_*`, `transaction_*`

### Exclude Noise

Remove assets that don't need monitoring:

**Exclusions:**

* `*_temp` (temporary tables)
* `*_backup` (backup copies)
* `*_old` (deprecated tables)
* `pg_temp_*` (PostgreSQL temp)
* `test_*` (test tables)

## Alert Aggregation

Avoid alert storms by grouping related alerts:

### Same Asset, Multiple Changes

**Instead of:**

* Column added: new\_field\_1
* Column added: new\_field\_2
* Column added: new\_field\_3
* Column type changed: status

**AnomalyArmor groups:**

* **Schema Change: 4 changes detected**
  * 3 columns added
  * 1 column type changed
  * View details →

### Deduplication

The same change won't re-alert until resolved or a cooldown period passes.

## Common Mistakes

<AccordionGroup>
  <Accordion title="Alerting on everything">
    **Problem**: Every table, every change, every environment → hundreds of alerts

    **Solution**: Start with 5-10 critical tables. Expand only after you've proven the value.
  </Accordion>

  <Accordion title="Same destination for everything">
    **Problem**: All alerts go to Slack → important ones get buried

    **Solution**: Use event-based routing. PagerDuty for breaking changes, Slack for schema changes, Email for informational.
  </Accordion>

  <Accordion title="Too-tight SLAs">
    **Problem**: Freshness SLA is 1 hour, but ETL sometimes takes 70 minutes → constant false positives

    **Solution**: Set SLA at 2x expected, tune down over time.
  </Accordion>

  <Accordion title="Monitoring dev environments">
    **Problem**: Dev databases change constantly → alert storm

    **Solution**: Don't monitor dev at all, or use weekly email digests only.
  </Accordion>

  <Accordion title="No one owns the alerts">
    **Problem**: Alerts fire but no one responds

    **Solution**: Define ownership for each alert type. Use PagerDuty with on-call rotations for critical alerts.
  </Accordion>
</AccordionGroup>

## Weekly Review Process

Schedule 15-30 minutes weekly to review alerts:

### Questions to Ask

1. **How many alerts fired this week?**
   * If more than 50: Too many. Add filters or raise thresholds.
   * If fewer than 5: Are you monitoring enough?

2. **What percentage were actionable?**
   * Target: >80%
   * If lower: Identify patterns and add filters

3. **Were any issues missed?**
   * If yes: Add coverage for those scenarios

4. **Which alerts took longest to resolve?**
   * These may need better routing or documentation

### Tuning Actions

| Finding                              | Action                                     |
| ------------------------------------ | ------------------------------------------ |
| Alert fires often but isn't actioned | Disable or change to email digest          |
| Same asset alerts repeatedly         | Investigate root cause, not just the alert |
| Critical issue wasn't alerted        | Add coverage                               |
| Team ignores channel                 | Reduce volume or change channel            |

## Sample Alert Configuration

Here's a recommended starting configuration:

| Rule                            | Event               | Scope                              | Trigger scope         | Destinations                                                        |
| ------------------------------- | ------------------- | ---------------------------------- | --------------------- | ------------------------------------------------------------------- |
| **Production Breaking Changes** | Schema Change       | Production database, all schemas   | Breaking only         | PagerDuty, Slack #incidents                                         |
| **Production Additive Changes** | Schema Change       | Production database, all schemas   | Non-breaking only     | Slack #data-alerts                                                  |
| **Critical Table Freshness**    | Freshness Violation | orders, payments, users, products  | SLA from asset config | Slack #data-alerts, PagerDuty (if >4h stale)                        |
| **Analytics Freshness**         | Freshness Violation | daily\_*, weekly\_*, analytics\_\* | SLA from asset config | Slack #analytics-team                                               |
| **Discovery Failures**          | Discovery Failed    | All                                | All failures          | Slack #data-alerts, Email [ops@company.com](mailto:ops@company.com) |
| **Staging Changes (Breaking)**  | Schema Change       | Staging database                   | Breaking only         | Email (daily digest)                                                |

## Checklist

Before going live with alerts:

* [ ] Defined critical tables (start with 5-10)
* [ ] Set up event-based routing (breaking → PagerDuty, others → Slack)
* [ ] Excluded dev/test environments
* [ ] SLAs set with buffer (2x expected)
* [ ] Warning thresholds configured
* [ ] Assigned ownership for each alert type
* [ ] Scheduled weekly review meeting
* [ ] Documented escalation process

## Use Schedules and Blackouts

Reduce noise by controlling when alerts fire:

### Operating Schedules

Assign [operating schedules](/alerts/schedules) to rules that only matter during business hours:

* **Freshness rules**: If your pipelines run overnight, set schedules to only alert during business hours when the team can respond
* **Non-critical schema changes**: Alert during work hours, suppress overnight
* **Development environments**: Restrict to CI/CD windows

### Blackout Windows

Use [blackout windows](/alerts/blackouts) for planned quiet periods:

* **Deployment windows**: Suppress alerts during known release times
* **Holiday freezes**: Create yearly recurring blackouts for company holidays
* **Maintenance periods**: Silence alerts during planned infrastructure work

<Tip>
  Combine schedules and blackouts: schedules handle recurring weekly patterns, blackouts handle specific date ranges. Both keep your team focused on alerts they can act on.
</Tip>

## Common Questions

### How do I stop getting too many data alerts?

Alert fatigue usually comes from monitoring too broadly. Start with 5-10 critical production tables, route only breaking changes to PagerDuty, send additive changes to Slack, and exclude dev and staging from noisy rules. Set freshness SLAs at roughly 2x expected update time, then tighten once you see real patterns.

### Should I alert on dev and staging databases?

Usually no. Dev databases change constantly and produce noise without actionable signal. If you must monitor non-prod, restrict it to breaking changes only and route to a weekly email digest, not to a real-time channel.

### What's a good starting freshness SLA?

Start at roughly twice your expected update interval, then tighten over time. For hourly pipelines try a 3-hour SLA and tune down to 2 hours. For 15-minute pipelines try 45 minutes. Tight SLAs at launch produce constant false positives during normal pipeline variance.

### Which alerts belong on PagerDuty versus Slack versus email?

Match the destination to urgency. PagerDuty is for breaking production changes and critical SLA violations that need on-call response now. Slack is for schema changes and freshness issues the team should see within hours. Email fits informational events, digests, and low-urgency records.

### How often should I review my alert rules?

Block 15-30 minutes weekly. Count alerts fired, estimate what percentage were actionable (target above 80 percent), and check whether any real issue was missed. Disable rules that never produce action and tighten ones that never fire. This is the fastest path out of chronic alert fatigue.

## Related Topics

<CardGroup cols={2}>
  <Card title="Alert Rules" icon="bell" href="/alerts/alert-rules">
    Configure alert rules
  </Card>

  <Card title="Freshness Monitoring" icon="clock" href="/data-quality/freshness-monitoring">
    Set up freshness SLAs
  </Card>

  <Card title="Slack Integration" icon="slack" href="/alerts/destinations/slack">
    Configure Slack alerts
  </Card>

  <Card title="Alerts Overview" icon="bullhorn" href="/alerts/overview">
    Alert system architecture
  </Card>

  <Card title="Operating Schedules" icon="calendar" href="/alerts/schedules">
    Control when rules are active
  </Card>

  <Card title="Blackout Windows" icon="ban" href="/alerts/blackouts">
    Suppress alerts during maintenance
  </Card>
</CardGroup>
