> ## Documentation Index > Fetch the complete documentation index at: https://docs.anomalyarmor.ai/llms.txt > Use this file to discover all available pages before exploring further. # On-Call Data Alerting > Set up production monitoring with proper escalation and on-call routing **Audience**: Platform Teams, Data Platform, SRE Data incidents need the same rigor as application incidents. This guide helps you set up 24/7 monitoring with proper escalation, on-call routing, and incident response. ## The Goal On-call data alerting flow from detection to resolution

On-call data alerting flow from detection to resolution

## Architecture Overview Alerting architecture showing event routing to different destinations

## Setting Up PagerDuty Integration ### Step 1: Create PagerDuty Service In PagerDuty: 1. Go to **Services → New Service** 2. Name: `Data Observability - AnomalyArmor` 3. Integration: Select **Events API V2** 4. Copy the **Integration Key** ### Step 2: Add PagerDuty Destination in AnomalyArmor 1. Go to **Alerts → Destinations** 2. Click **Add Destination** 3. Select **PagerDuty** 4. Enter the Integration Key 5. Name: `PagerDuty - Data On-Call` 6. **Test** and **Save** ### Step 3: Configure Escalation Policy In PagerDuty, set up escalation: Escalation policy levels

## Alert Urgency Framework Define how urgently different data incidents need response: ### Critical (Page Immediately) **Criteria:** * Production data pipeline completely down * Core revenue tables missing or stale >4 hours * Discovery failures for >24 hours **Examples:** * Column removed from `orders` table * `payments` table data >4 hours stale * Can't connect to production database **Destination:** PagerDuty → On-Call ### High (Respond Within 4 Hours) **Criteria:** * Important tables stale (1-4 hours) * Schema changes in production * Non-critical discovery failures **Examples:** * Column type changed in production * Analytics tables 2 hours stale * Staging discovery failed **Destination:** Slack #data-incidents ### Medium (Respond Within 24 Hours) **Criteria:** * Non-production schema changes * Warning thresholds reached * New assets discovered **Examples:** * Staging schema changed * Freshness approaching SLA (warning) * New table discovered in production **Destination:** Slack #data-alerts ### Low (Informational) **Criteria:** * Development changes * Expected changes * Routine discoveries **Destination:** Email digest (daily) ## Alert Rule Configuration ### Rule 1: Critical - Production Breaking Changes | Field | Value | | ---------------- | ------------------------------------------------- | | **Name** | CRITICAL - Production Breaking Changes | | **Event** | Schema Change Detected | | **Data source** | `production-*` | | **Schema** | `public`, `analytics` | | **Change type** | Column Removed, Table Removed | | **Destinations** | PagerDuty (Data On-Call), Slack `#data-incidents` | ### Rule 2: Critical - Revenue Table Freshness | Field | Value | | ------------------- | ------------------------------------------------- | | **Name** | CRITICAL - Revenue Data Stale | | **Event** | Freshness Violation | | **Assets** | `orders`, `payments`, `revenue_*` | | **SLA exceeded by** | >4 hours | | **Destinations** | PagerDuty (Data On-Call), Slack `#data-incidents` | ### Rule 3: High - Production Schema Changes | Field | Value | | ---------------- | ------------------------- | | **Name** | Production Schema Changes | | **Event** | Schema Change Detected | | **Data source** | `production-*` | | **Change type** | All | | **Destinations** | Slack `#data-incidents` | ### Rule 4: High - Data Freshness Violations | Field | Value | | ---------------- | -------------------------------- | | **Name** | HIGH - Data Freshness Violations | | **Event** | Freshness Violation | | **Data source** | `production-*` | | **Condition** | SLA exceeded | | **Destinations** | Slack `#data-incidents` | ### Rule 5: High - Discovery Failures | Field | Value | | ---------------- | ---------------------------------------------------------- | | **Name** | HIGH - Discovery Failures | | **Event** | Discovery Failed | | **Data source** | `production-*` | | **Destinations** | Slack `#data-incidents`, Email `data-platform@company.com` | ## On-Call Runbook ### When Paged for Schema Change On-call runbook for schema changes

### When Paged for Freshness Violation 1. **ACKNOWLEDGE** the alert 2. **CHECK ETL STATUS** * Is the ETL job running? Failed? Stuck? * Check Airflow/Dagster/orchestrator 3. **CHECK SOURCE SYSTEM** * Is the source database accessible? * Is source data actually updating? 4. **IDENTIFY ROOT CAUSE** * ETL failure → Fix and restart * Source delay → Communicate delay * Connection issue → Troubleshoot connection 5. **MITIGATE** * Restart failed jobs * Notify stakeholders of delay 6. **RESOLVE** and document ## Slack Integration Best Practices ### Channel Setup **Slack Channels:** * `#data-incidents` - Breaking changes (notifications on) * `#data-alerts` - All schema changes (lower priority) * `#data-digest` - Daily/weekly summaries ### Alert Message Format AnomalyArmor alerts include: ``` 🔴 CRITICAL: Schema Change Detected Asset: production.public.orders Change: Column removed - shipping_status (varchar) Detected: Today at 3:15 PM UTC Discovery Run: #12345 Impact: High - This table is used by 5 downstream models Actions: • [View in AnomalyArmor] • [View Asset Details] • [View Downstream Dependencies] On-Call: @data-oncall ``` ## Maintenance Windows ### Scheduled Maintenance Before planned changes: 1. Go to **Alerts → Rules** 2. Toggle OFF relevant rules 3. Set a reminder to re-enable (e.g., calendar event) 4. Proceed with maintenance 5. Verify changes detected correctly 6. Toggle rules back ON ### Quick Disable For unexpected but known issues, quickly disable a rule: 1. Go to **Alerts → Rules** 2. Find the rule 3. Toggle it **OFF** 4. Remember to re-enable when the issue is resolved ## Metrics to Track | Metric | Target | How to Measure | | ---------------------- | ---------- | ----------------------------- | | MTTD (Time to Detect) | \< 1 hour | Discovery frequency | | MTTN (Time to Notify) | \< 5 min | Alert → PagerDuty time | | MTTR (Time to Resolve) | \< 4 hours | Alert → Resolution time | | False Positive Rate | \< 20% | Alerts ignored / Total alerts | | Pager Load | \< 5/week | Critical alerts per week | Review these weekly in your on-call handoff. ## Checklist Before going live with on-call alerting: * [ ] PagerDuty integration configured * [ ] Escalation policy set up * [ ] Critical/High/Medium/Low rules defined * [ ] Slack channels created and configured * [ ] On-call runbook documented * [ ] Team trained on response procedures * [ ] Test alert sent and verified ## Common Questions ### How do I page my on-call engineer when data breaks? Create a PagerDuty service with an **Events API V2** integration, copy the integration key, and add a PagerDuty destination in **Alerts → Destinations**. Then route only your Critical rules (breaking schema changes, revenue-table freshness >4h) to that destination. See [Setting Up PagerDuty Integration](#setting-up-pagerduty-integration). ### Which data incidents should actually page someone? Page on production pipelines being completely down, core revenue tables stale for more than 4 hours, or discovery failures lasting over 24 hours. Everything else should go to Slack, not PagerDuty, to protect on-call from alert fatigue. See the [Alert Urgency Framework](#alert-urgency-framework). ### How do I suppress alerts during planned maintenance? Go to **Alerts → Rules** and toggle off the relevant rules before the maintenance window, then re-enable after. Set a calendar reminder so rules don't stay off indefinitely. For recurring windows, use operating schedules and blackouts in the contract config instead. ### What metrics should I track for data on-call health? MTTD (under 1 hour, driven by discovery frequency), MTTN (under 5 minutes from alert to page), MTTR (under 4 hours), false-positive rate (under 20%), and pager load (under 5 critical alerts per week). Review these weekly in your on-call handoff. See [Metrics to Track](#metrics-to-track). ### Can I send different alerts to different Slack channels? Yes. Create separate destinations for `#data-incidents` (breaking changes), `#data-alerts` (all schema changes), and `#data-digest` (daily summaries), then route each alert rule by severity. That keeps high-signal alerts out of the noisy firehose and stops people from muting the wrong channel. ## Related Resources Detailed PagerDuty integration guide Reduce alert fatigue