Alerting

Azure Monitor alert rules are configured for each Microtec environment to detect and notify on infrastructure anomalies, service degradation, and business-critical failures. All alerts route through Action Groups that combine email, Teams, and PagerDuty notifications.

Alert Severity Levels

Azure Monitor uses numeric severity levels (0–4). Microtec maps these to operational severity tiers:

Azure Severity	Microtec Tier	Description	Response SLA
Sev 0	P1 — Critical	Service completely unavailable, data loss risk	15 minutes
Sev 1	P2 — High	Major feature unavailable, significant user impact	1 hour
Sev 2	P3 — Medium	Performance degraded, partial feature loss	4 hours
Sev 3	P4 — Low	Advisory, non-urgent anomaly	Next business day
Sev 4	P5 — Info	Trend notification, no action required	—

Action Groups

Primary Action Group (`mic-erp-prod-critical-ag`)

Used for P1 and P2 alerts in production and UAT:

Channel	Target	Trigger
Email	`devops@microtec.com.sa`, `oncall@microtec.com.sa`	All severities
Teams Webhook	`#ops-alerts` channel	Sev 0–2
PagerDuty	On-call rotation	Sev 0–1 only
SMS	On-call mobile	Sev 0 only

Secondary Action Group (`mic-erp-nonprod-info-ag`)

Used for dev, stage, and preprod alerts:

Channel	Target
Email	`devops@microtec.com.sa`
Teams Webhook	`#dev-alerts` channel

Infrastructure Alert Rules

CPU Utilisation

Property	Value
Metric	Container App — CPU usage (%)
Condition	Average CPU > 80% for 5 minutes
Severity	P3 (warning)
Aggregation window	5 minutes

Threshold	Severity
CPU avg > 80% for 5 min	P3
CPU avg > 95% for 2 min	P2

Memory Utilisation

Condition	Severity
Memory usage > 85% for 5 min	P3
Memory usage > 95% for 2 min	P2
OOM container restarts > 2 in 10 min	P1

Container Restart Count

Frequent container restarts indicate a crash loop:

Condition: Container restart count > 3 in 15 minutes
Severity: P2
Action: Page on-call + Teams notification

Application Alert Rules

HTTP Error Rate

Condition	Severity	Notes
HTTP 5xx rate > 1% for 5 min	P3	Normal baseline is <0.1%
HTTP 5xx rate > 5% for 5 min	P2	Significant degradation
HTTP 5xx rate > 20% for 2 min	P1	Service effectively down
HTTP 4xx rate > 10% for 5 min	P4	May indicate auth issues or crawler activity

Response Latency

P95 (95th percentile) latency thresholds:

Service	P3 Threshold	P2 Threshold
API Gateway	1,000 ms	3,000 ms
Accounting service	2,000 ms	5,000 ms
Reporting service	5,000 ms	15,000 ms
Keycloak	500 ms	2,000 ms

High-latency alerts fire after 5 minutes of sustained threshold breach to avoid noise from transient spikes.

Availability (AFD Health Probes)

Condition: AFD origin availability < 100% for any 5-minute window
Severity: P1 (production) / P3 (non-production)

AFD availability is measured by the health probe success rate. Below 100% means at least one origin is unhealthy.

Database Alert Rules

SQL Connection Pool Exhaustion

Condition: SQL connection errors > 10 in 5 minutes
Source: Application Insights dependency failures to SQL
Severity: P1

SQL Query Duration

Condition: SQL dependency P95 duration > 5,000ms for 10 minutes
Severity: P2
Note: Triggers DB performance investigation workflow

Redis Connection Failures

Condition: Redis dependency failures > 5 in 5 minutes
Severity: P1
Note: Redis failure impacts session management and DataProtection keys

Business Metric Alert Rules

Custom metric alerts on business events tracked via Application Insights custom metrics:

Alert	Condition	Severity
Zatca submission failures	Failure rate > 5% for 15 min	P2
Login failure spike	Auth failures > 100/min	P2 (may indicate brute force)
Zero invoice creation	No invoices created for 2 hours during business hours	P3
Tenant DB connection failure	Any tenant reports connection errors	P1

Alert Rule Bicep Definition

Alert rules are defined in Devops/azure/infrastructure/modules/monitoring.bicep:

bicep

resource cpuAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: '${resourcePrefix}-cpu-high-alert'
  location: 'global'
  properties: {
    description: 'CPU usage exceeded 80% threshold'
    severity: 2
    enabled: true
    scopes: [containerAppId]
    evaluationFrequency: 'PT1M'
    windowSize: 'PT5M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [{
        name: 'HighCPU'
        metricName: 'UsageNanoCores'
        operator: 'GreaterThan'
        threshold: 800000000  // 80% of 1 core = 800,000,000 nanocores
        timeAggregation: 'Average'
      }]
    }
    actions: [{
      actionGroupId: actionGroup.id
    }]
  }
}

Alert Suppression and Maintenance Windows

Maintenance Windows

During planned maintenance (deployments, infrastructure changes), suppress alerts to avoid noise:

bash

# Create alert suppression rule for 2-hour deployment window
az monitor action-rule create \
  --resource-group mic-erp-be-prod-monitoring-rg \
  --name "deployment-suppression-$(date +%Y%m%d)" \
  --status Enabled \
  --type Suppression \
  --suppression-recurrence-type Once \
  --suppression-start-date "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --suppression-end-date "$(date -u -d '+2 hours' +%Y-%m-%dT%H:%M:%SZ)"

Alert Fatigue Prevention

Tuning Alert Thresholds

Alert thresholds are reviewed quarterly. Use Application Insights Workbooks → Alert Effectiveness to identify:

Alerts that fire too frequently (threshold too low)
Alerts that never fire (threshold too high or condition never triggers)
Alerts that correlate with deployments (add maintenance window suppression)

The goal is <5 actionable alerts per week per environment. Alert fatigue leads to on-call engineers ignoring genuine incidents.

Alerting ​

Alert Severity Levels ​

Action Groups ​

Primary Action Group (mic-erp-prod-critical-ag) ​

Secondary Action Group (mic-erp-nonprod-info-ag) ​

Infrastructure Alert Rules ​

CPU Utilisation ​

Memory Utilisation ​

Container Restart Count ​

Application Alert Rules ​

HTTP Error Rate ​

Response Latency ​

Availability (AFD Health Probes) ​

Database Alert Rules ​

SQL Connection Pool Exhaustion ​

SQL Query Duration ​

Redis Connection Failures ​

Business Metric Alert Rules ​

Alert Rule Bicep Definition ​

Alert Suppression and Maintenance Windows ​

Maintenance Windows ​

Alert Fatigue Prevention ​