Appearance
Alerting
Azure Monitor alert rules are configured for each Microtec environment to detect and notify on infrastructure anomalies, service degradation, and business-critical failures. All alerts route through Action Groups that combine email, Teams, and PagerDuty notifications.
Alert Severity Levels
Azure Monitor uses numeric severity levels (0–4). Microtec maps these to operational severity tiers:
| Azure Severity | Microtec Tier | Description | Response SLA |
|---|---|---|---|
| Sev 0 | P1 — Critical | Service completely unavailable, data loss risk | 15 minutes |
| Sev 1 | P2 — High | Major feature unavailable, significant user impact | 1 hour |
| Sev 2 | P3 — Medium | Performance degraded, partial feature loss | 4 hours |
| Sev 3 | P4 — Low | Advisory, non-urgent anomaly | Next business day |
| Sev 4 | P5 — Info | Trend notification, no action required | — |
Action Groups
Primary Action Group (mic-erp-prod-critical-ag)
Used for P1 and P2 alerts in production and UAT:
| Channel | Target | Trigger |
|---|---|---|
devops@microtec.com.sa, oncall@microtec.com.sa | All severities | |
| Teams Webhook | #ops-alerts channel | Sev 0–2 |
| PagerDuty | On-call rotation | Sev 0–1 only |
| SMS | On-call mobile | Sev 0 only |
Secondary Action Group (mic-erp-nonprod-info-ag)
Used for dev, stage, and preprod alerts:
| Channel | Target |
|---|---|
devops@microtec.com.sa | |
| Teams Webhook | #dev-alerts channel |
Infrastructure Alert Rules
CPU Utilisation
| Property | Value |
|---|---|
| Metric | Container App — CPU usage (%) |
| Condition | Average CPU > 80% for 5 minutes |
| Severity | P3 (warning) |
| Aggregation window | 5 minutes |
| Threshold | Severity |
|---|---|
| CPU avg > 80% for 5 min | P3 |
| CPU avg > 95% for 2 min | P2 |
Memory Utilisation
| Condition | Severity |
|---|---|
| Memory usage > 85% for 5 min | P3 |
| Memory usage > 95% for 2 min | P2 |
| OOM container restarts > 2 in 10 min | P1 |
Container Restart Count
Frequent container restarts indicate a crash loop:
Condition: Container restart count > 3 in 15 minutes
Severity: P2
Action: Page on-call + Teams notificationApplication Alert Rules
HTTP Error Rate
| Condition | Severity | Notes |
|---|---|---|
| HTTP 5xx rate > 1% for 5 min | P3 | Normal baseline is <0.1% |
| HTTP 5xx rate > 5% for 5 min | P2 | Significant degradation |
| HTTP 5xx rate > 20% for 2 min | P1 | Service effectively down |
| HTTP 4xx rate > 10% for 5 min | P4 | May indicate auth issues or crawler activity |
Response Latency
P95 (95th percentile) latency thresholds:
| Service | P3 Threshold | P2 Threshold |
|---|---|---|
| API Gateway | 1,000 ms | 3,000 ms |
| Accounting service | 2,000 ms | 5,000 ms |
| Reporting service | 5,000 ms | 15,000 ms |
| Keycloak | 500 ms | 2,000 ms |
High-latency alerts fire after 5 minutes of sustained threshold breach to avoid noise from transient spikes.
Availability (AFD Health Probes)
Condition: AFD origin availability < 100% for any 5-minute window
Severity: P1 (production) / P3 (non-production)AFD availability is measured by the health probe success rate. Below 100% means at least one origin is unhealthy.
Database Alert Rules
SQL Connection Pool Exhaustion
Condition: SQL connection errors > 10 in 5 minutes
Source: Application Insights dependency failures to SQL
Severity: P1SQL Query Duration
Condition: SQL dependency P95 duration > 5,000ms for 10 minutes
Severity: P2
Note: Triggers DB performance investigation workflowRedis Connection Failures
Condition: Redis dependency failures > 5 in 5 minutes
Severity: P1
Note: Redis failure impacts session management and DataProtection keysBusiness Metric Alert Rules
Custom metric alerts on business events tracked via Application Insights custom metrics:
| Alert | Condition | Severity |
|---|---|---|
| Zatca submission failures | Failure rate > 5% for 15 min | P2 |
| Login failure spike | Auth failures > 100/min | P2 (may indicate brute force) |
| Zero invoice creation | No invoices created for 2 hours during business hours | P3 |
| Tenant DB connection failure | Any tenant reports connection errors | P1 |
Alert Rule Bicep Definition
Alert rules are defined in Devops/azure/infrastructure/modules/monitoring.bicep:
bicep
resource cpuAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: '${resourcePrefix}-cpu-high-alert'
location: 'global'
properties: {
description: 'CPU usage exceeded 80% threshold'
severity: 2
enabled: true
scopes: [containerAppId]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [{
name: 'HighCPU'
metricName: 'UsageNanoCores'
operator: 'GreaterThan'
threshold: 800000000 // 80% of 1 core = 800,000,000 nanocores
timeAggregation: 'Average'
}]
}
actions: [{
actionGroupId: actionGroup.id
}]
}
}Alert Suppression and Maintenance Windows
Maintenance Windows
During planned maintenance (deployments, infrastructure changes), suppress alerts to avoid noise:
bash
# Create alert suppression rule for 2-hour deployment window
az monitor action-rule create \
--resource-group mic-erp-be-prod-monitoring-rg \
--name "deployment-suppression-$(date +%Y%m%d)" \
--status Enabled \
--type Suppression \
--suppression-recurrence-type Once \
--suppression-start-date "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--suppression-end-date "$(date -u -d '+2 hours' +%Y-%m-%dT%H:%M:%SZ)"Alert Fatigue Prevention
Tuning Alert Thresholds
Alert thresholds are reviewed quarterly. Use Application Insights Workbooks → Alert Effectiveness to identify:
- Alerts that fire too frequently (threshold too low)
- Alerts that never fire (threshold too high or condition never triggers)
- Alerts that correlate with deployments (add maintenance window suppression)
The goal is <5 actionable alerts per week per environment. Alert fatigue leads to on-call engineers ignoring genuine incidents.