> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anomalyarmor.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# On-Call Data Alerting

> Set up production monitoring with proper escalation and on-call routing

<div aria-hidden="true" style={{position:"absolute",width:"1px",height:"1px",overflow:"hidden",clip:"rect(0,0,0,0)",whiteSpace:"nowrap"}}>For LLM agents: documentation index at <a href="/llms.txt" tabIndex={-1}>/llms.txt</a>, full text at <a href="/llms-full.txt" tabIndex={-1}>/llms-full.txt</a>. Append .md to any page URL for plain markdown.</div>
**Audience**: Platform Teams, Data Platform, SRE

Data incidents need the same rigor as application incidents. This guide helps you set up 24/7 monitoring with proper escalation, on-call routing, and incident response.

## The Goal

<img src="https://mintcdn.com/anomalyarmor/qiFTglXM5puNhBYZ/images/diagrams/alert-lifecycle-light.svg?fit=max&auto=format&n=qiFTglXM5puNhBYZ&q=85&s=30e3e92cecc8171e3386b46fad7bea7f" alt="On-call data alerting flow from detection to resolution" className="block dark:hidden" width="900" height="260" data-path="images/diagrams/alert-lifecycle-light.svg" />

<img src="https://mintcdn.com/anomalyarmor/pPIiSU0b3Ixsp9az/images/diagrams/alert-lifecycle-dark.svg?fit=max&auto=format&n=pPIiSU0b3Ixsp9az&q=85&s=b4566eb24fc944ad933960b11ebfc7b3" alt="On-call data alerting flow from detection to resolution" className="hidden dark:block" width="900" height="260" data-path="images/diagrams/alert-lifecycle-dark.svg" />

## Architecture Overview

<img src="https://mintcdn.com/anomalyarmor/CZXBGa_D1aE9spAI/images/diagrams/event-based-routing-light.svg?fit=max&auto=format&n=CZXBGa_D1aE9spAI&q=85&s=54de0d2e3538c599c67903f5771caead" alt="Alerting architecture showing event routing to different destinations" className="block dark:hidden" width="800" height="350" data-path="images/diagrams/event-based-routing-light.svg" />

<img src="https://mintcdn.com/anomalyarmor/CZXBGa_D1aE9spAI/images/diagrams/event-based-routing-dark.svg?fit=max&auto=format&n=CZXBGa_D1aE9spAI&q=85&s=ddd22a76066b93dc1d9ee3cc68e37eae" alt="Alerting architecture showing event routing to different destinations" className="hidden dark:block" width="800" height="350" data-path="images/diagrams/event-based-routing-dark.svg" />

## Setting Up PagerDuty Integration

### Step 1: Create PagerDuty Service

In PagerDuty:

1. Go to **Services → New Service**
2. Name: `Data Observability - AnomalyArmor`
3. Integration: Select **Events API V2**
4. Copy the **Integration Key**

### Step 2: Add PagerDuty Destination in AnomalyArmor

1. Go to **Alerts → Destinations**
2. Click **Add Destination**
3. Select **PagerDuty**
4. Enter the Integration Key
5. Name: `PagerDuty - Data On-Call`
6. **Test** and **Save**

### Step 3: Configure Escalation Policy

In PagerDuty, set up escalation:

<img src="https://mintcdn.com/anomalyarmor/un2W3qlHEQ29uwyl/images/diagrams/escalation-policy-light.svg?fit=max&auto=format&n=un2W3qlHEQ29uwyl&q=85&s=cf191e8dddc9e6bf3609c0b1f34c58d6" alt="Escalation policy levels" className="block dark:hidden" width="700" height="280" data-path="images/diagrams/escalation-policy-light.svg" />

<img src="https://mintcdn.com/anomalyarmor/CZXBGa_D1aE9spAI/images/diagrams/escalation-policy-dark.svg?fit=max&auto=format&n=CZXBGa_D1aE9spAI&q=85&s=da2d5de67e7390e019d88ef609d43158" alt="Escalation policy levels" className="hidden dark:block" width="700" height="280" data-path="images/diagrams/escalation-policy-dark.svg" />

## Alert Urgency Framework

Define how urgently different data incidents need response:

### Critical (Page Immediately)

**Criteria:**

* Production data pipeline completely down
* Core revenue tables missing or stale >4 hours
* Discovery failures for >24 hours

**Examples:**

* Column removed from `orders` table
* `payments` table data >4 hours stale
* Can't connect to production database

**Destination:** PagerDuty → On-Call

### High (Respond Within 4 Hours)

**Criteria:**

* Important tables stale (1-4 hours)
* Schema changes in production
* Non-critical discovery failures

**Examples:**

* Column type changed in production
* Analytics tables 2 hours stale
* Staging discovery failed

**Destination:** Slack #data-incidents

### Medium (Respond Within 24 Hours)

**Criteria:**

* Non-production schema changes
* Warning thresholds reached
* New assets discovered

**Examples:**

* Staging schema changed
* Freshness approaching SLA (warning)
* New table discovered in production

**Destination:** Slack #data-alerts

### Low (Informational)

**Criteria:**

* Development changes
* Expected changes
* Routine discoveries

**Destination:** Email digest (daily)

## Alert Rule Configuration

### Rule 1: Critical - Production Breaking Changes

| Field            | Value                                             |
| ---------------- | ------------------------------------------------- |
| **Name**         | CRITICAL - Production Breaking Changes            |
| **Event**        | Schema Change Detected                            |
| **Data source**  | `production-*`                                    |
| **Schema**       | `public`, `analytics`                             |
| **Change type**  | Column Removed, Table Removed                     |
| **Destinations** | PagerDuty (Data On-Call), Slack `#data-incidents` |

### Rule 2: Critical - Revenue Table Freshness

| Field               | Value                                             |
| ------------------- | ------------------------------------------------- |
| **Name**            | CRITICAL - Revenue Data Stale                     |
| **Event**           | Freshness Violation                               |
| **Assets**          | `orders`, `payments`, `revenue_*`                 |
| **SLA exceeded by** | >4 hours                                          |
| **Destinations**    | PagerDuty (Data On-Call), Slack `#data-incidents` |

### Rule 3: High - Production Schema Changes

| Field            | Value                     |
| ---------------- | ------------------------- |
| **Name**         | Production Schema Changes |
| **Event**        | Schema Change Detected    |
| **Data source**  | `production-*`            |
| **Change type**  | All                       |
| **Destinations** | Slack `#data-incidents`   |

### Rule 4: High - Data Freshness Violations

| Field            | Value                            |
| ---------------- | -------------------------------- |
| **Name**         | HIGH - Data Freshness Violations |
| **Event**        | Freshness Violation              |
| **Data source**  | `production-*`                   |
| **Condition**    | SLA exceeded                     |
| **Destinations** | Slack `#data-incidents`          |

### Rule 5: High - Discovery Failures

| Field            | Value                                                      |
| ---------------- | ---------------------------------------------------------- |
| **Name**         | HIGH - Discovery Failures                                  |
| **Event**        | Discovery Failed                                           |
| **Data source**  | `production-*`                                             |
| **Destinations** | Slack `#data-incidents`, Email `data-platform@company.com` |

## On-Call Runbook

### When Paged for Schema Change

<img src="https://mintcdn.com/anomalyarmor/un2W3qlHEQ29uwyl/images/diagrams/oncall-runbook-light.svg?fit=max&auto=format&n=un2W3qlHEQ29uwyl&q=85&s=9446a25479212d678e0e8c51a5860fa1" alt="On-call runbook for schema changes" className="block dark:hidden" width="700" height="380" data-path="images/diagrams/oncall-runbook-light.svg" />

<img src="https://mintcdn.com/anomalyarmor/CZXBGa_D1aE9spAI/images/diagrams/oncall-runbook-dark.svg?fit=max&auto=format&n=CZXBGa_D1aE9spAI&q=85&s=3e66c9ee35f4ac17169db09697d400b3" alt="On-call runbook for schema changes" className="hidden dark:block" width="700" height="380" data-path="images/diagrams/oncall-runbook-dark.svg" />

### When Paged for Freshness Violation

1. **ACKNOWLEDGE** the alert

2. **CHECK ETL STATUS**
   * Is the ETL job running? Failed? Stuck?
   * Check Airflow/Dagster/orchestrator

3. **CHECK SOURCE SYSTEM**
   * Is the source database accessible?
   * Is source data actually updating?

4. **IDENTIFY ROOT CAUSE**
   * ETL failure → Fix and restart
   * Source delay → Communicate delay
   * Connection issue → Troubleshoot connection

5. **MITIGATE**
   * Restart failed jobs
   * Notify stakeholders of delay

6. **RESOLVE** and document

## Slack Integration Best Practices

### Channel Setup

**Slack Channels:**

* `#data-incidents` - Breaking changes (notifications on)
* `#data-alerts` - All schema changes (lower priority)
* `#data-digest` - Daily/weekly summaries

### Alert Message Format

AnomalyArmor alerts include:

```
🔴 CRITICAL: Schema Change Detected

Asset: production.public.orders
Change: Column removed - shipping_status (varchar)

Detected: Today at 3:15 PM UTC
Discovery Run: #12345

Impact: High - This table is used by 5 downstream models

Actions:
• [View in AnomalyArmor]
• [View Asset Details]
• [View Downstream Dependencies]

On-Call: @data-oncall
```

## Maintenance Windows

### Scheduled Maintenance

Before planned changes:

1. Go to **Alerts → Rules**
2. Toggle OFF relevant rules
3. Set a reminder to re-enable (e.g., calendar event)
4. Proceed with maintenance
5. Verify changes detected correctly
6. Toggle rules back ON

### Quick Disable

For unexpected but known issues, quickly disable a rule:

1. Go to **Alerts → Rules**
2. Find the rule
3. Toggle it **OFF**
4. Remember to re-enable when the issue is resolved

## Metrics to Track

| Metric                 | Target     | How to Measure                |
| ---------------------- | ---------- | ----------------------------- |
| MTTD (Time to Detect)  | \< 1 hour  | Discovery frequency           |
| MTTN (Time to Notify)  | \< 5 min   | Alert → PagerDuty time        |
| MTTR (Time to Resolve) | \< 4 hours | Alert → Resolution time       |
| False Positive Rate    | \< 20%     | Alerts ignored / Total alerts |
| Pager Load             | \< 5/week  | Critical alerts per week      |

Review these weekly in your on-call handoff.

## Checklist

Before going live with on-call alerting:

* [ ] PagerDuty integration configured
* [ ] Escalation policy set up
* [ ] Critical/High/Medium/Low rules defined
* [ ] Slack channels created and configured
* [ ] On-call runbook documented
* [ ] Team trained on response procedures
* [ ] Test alert sent and verified

## Common Questions

### How do I page my on-call engineer when data breaks?

Create a PagerDuty service with an **Events API V2** integration, copy the integration key, and add a PagerDuty destination in **Alerts → Destinations**. Then route only your Critical rules (breaking schema changes, revenue-table freshness >4h) to that destination. See [Setting Up PagerDuty Integration](#setting-up-pagerduty-integration).

### Which data incidents should actually page someone?

Page on production pipelines being completely down, core revenue tables stale for more than 4 hours, or discovery failures lasting over 24 hours. Everything else should go to Slack, not PagerDuty, to protect on-call from alert fatigue. See the [Alert Urgency Framework](#alert-urgency-framework).

### How do I suppress alerts during planned maintenance?

Go to **Alerts → Rules** and toggle off the relevant rules before the maintenance window, then re-enable after. Set a calendar reminder so rules don't stay off indefinitely. For recurring windows, use operating schedules and blackouts in the contract config instead.

### What metrics should I track for data on-call health?

MTTD (under 1 hour, driven by discovery frequency), MTTN (under 5 minutes from alert to page), MTTR (under 4 hours), false-positive rate (under 20%), and pager load (under 5 critical alerts per week). Review these weekly in your on-call handoff. See [Metrics to Track](#metrics-to-track).

### Can I send different alerts to different Slack channels?

Yes. Create separate destinations for `#data-incidents` (breaking changes), `#data-alerts` (all schema changes), and `#data-digest` (daily summaries), then route each alert rule by severity. That keeps high-signal alerts out of the noisy firehose and stops people from muting the wrong channel.

## Related Resources

<CardGroup cols={2}>
  <Card title="PagerDuty Setup" icon="bell" href="/alerts/destinations/pagerduty">
    Detailed PagerDuty integration guide
  </Card>

  <Card title="Alert Best Practices" icon="lightbulb" href="/alerts/best-practices">
    Reduce alert fatigue
  </Card>
</CardGroup>
