Skip to content

Alerting

Azure Monitor alert rules are configured for each Microtec environment to detect and notify on infrastructure anomalies, service degradation, and business-critical failures. All alerts route through Action Groups that combine email, Teams, and PagerDuty notifications.


Alert Severity Levels

Azure Monitor uses numeric severity levels (0–4). Microtec maps these to operational severity tiers:

Azure SeverityMicrotec TierDescriptionResponse SLA
Sev 0P1 — CriticalService completely unavailable, data loss risk15 minutes
Sev 1P2 — HighMajor feature unavailable, significant user impact1 hour
Sev 2P3 — MediumPerformance degraded, partial feature loss4 hours
Sev 3P4 — LowAdvisory, non-urgent anomalyNext business day
Sev 4P5 — InfoTrend notification, no action required

Action Groups

Primary Action Group (mic-erp-prod-critical-ag)

Used for P1 and P2 alerts in production and UAT:

ChannelTargetTrigger
Emaildevops@microtec.com.sa, oncall@microtec.com.saAll severities
Teams Webhook#ops-alerts channelSev 0–2
PagerDutyOn-call rotationSev 0–1 only
SMSOn-call mobileSev 0 only

Secondary Action Group (mic-erp-nonprod-info-ag)

Used for dev, stage, and preprod alerts:

ChannelTarget
Emaildevops@microtec.com.sa
Teams Webhook#dev-alerts channel

Infrastructure Alert Rules

CPU Utilisation

PropertyValue
MetricContainer App — CPU usage (%)
ConditionAverage CPU > 80% for 5 minutes
SeverityP3 (warning)
Aggregation window5 minutes
ThresholdSeverity
CPU avg > 80% for 5 minP3
CPU avg > 95% for 2 minP2

Memory Utilisation

ConditionSeverity
Memory usage > 85% for 5 minP3
Memory usage > 95% for 2 minP2
OOM container restarts > 2 in 10 minP1

Container Restart Count

Frequent container restarts indicate a crash loop:

Condition: Container restart count > 3 in 15 minutes
Severity: P2
Action: Page on-call + Teams notification

Application Alert Rules

HTTP Error Rate

ConditionSeverityNotes
HTTP 5xx rate > 1% for 5 minP3Normal baseline is <0.1%
HTTP 5xx rate > 5% for 5 minP2Significant degradation
HTTP 5xx rate > 20% for 2 minP1Service effectively down
HTTP 4xx rate > 10% for 5 minP4May indicate auth issues or crawler activity

Response Latency

P95 (95th percentile) latency thresholds:

ServiceP3 ThresholdP2 Threshold
API Gateway1,000 ms3,000 ms
Accounting service2,000 ms5,000 ms
Reporting service5,000 ms15,000 ms
Keycloak500 ms2,000 ms

High-latency alerts fire after 5 minutes of sustained threshold breach to avoid noise from transient spikes.

Availability (AFD Health Probes)

Condition: AFD origin availability < 100% for any 5-minute window
Severity: P1 (production) / P3 (non-production)

AFD availability is measured by the health probe success rate. Below 100% means at least one origin is unhealthy.


Database Alert Rules

SQL Connection Pool Exhaustion

Condition: SQL connection errors > 10 in 5 minutes
Source: Application Insights dependency failures to SQL
Severity: P1

SQL Query Duration

Condition: SQL dependency P95 duration > 5,000ms for 10 minutes
Severity: P2
Note: Triggers DB performance investigation workflow

Redis Connection Failures

Condition: Redis dependency failures > 5 in 5 minutes
Severity: P1
Note: Redis failure impacts session management and DataProtection keys

Business Metric Alert Rules

Custom metric alerts on business events tracked via Application Insights custom metrics:

AlertConditionSeverity
Zatca submission failuresFailure rate > 5% for 15 minP2
Login failure spikeAuth failures > 100/minP2 (may indicate brute force)
Zero invoice creationNo invoices created for 2 hours during business hoursP3
Tenant DB connection failureAny tenant reports connection errorsP1

Alert Rule Bicep Definition

Alert rules are defined in Devops/azure/infrastructure/modules/monitoring.bicep:

bicep
resource cpuAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: '${resourcePrefix}-cpu-high-alert'
  location: 'global'
  properties: {
    description: 'CPU usage exceeded 80% threshold'
    severity: 2
    enabled: true
    scopes: [containerAppId]
    evaluationFrequency: 'PT1M'
    windowSize: 'PT5M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [{
        name: 'HighCPU'
        metricName: 'UsageNanoCores'
        operator: 'GreaterThan'
        threshold: 800000000  // 80% of 1 core = 800,000,000 nanocores
        timeAggregation: 'Average'
      }]
    }
    actions: [{
      actionGroupId: actionGroup.id
    }]
  }
}

Alert Suppression and Maintenance Windows

Maintenance Windows

During planned maintenance (deployments, infrastructure changes), suppress alerts to avoid noise:

bash
# Create alert suppression rule for 2-hour deployment window
az monitor action-rule create \
  --resource-group mic-erp-be-prod-monitoring-rg \
  --name "deployment-suppression-$(date +%Y%m%d)" \
  --status Enabled \
  --type Suppression \
  --suppression-recurrence-type Once \
  --suppression-start-date "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --suppression-end-date "$(date -u -d '+2 hours' +%Y-%m-%dT%H:%M:%SZ)"

Alert Fatigue Prevention

Tuning Alert Thresholds

Alert thresholds are reviewed quarterly. Use Application Insights Workbooks → Alert Effectiveness to identify:

  • Alerts that fire too frequently (threshold too low)
  • Alerts that never fire (threshold too high or condition never triggers)
  • Alerts that correlate with deployments (add maintenance window suppression)

The goal is <5 actionable alerts per week per environment. Alert fatigue leads to on-call engineers ignoring genuine incidents.

Internal Documentation — Microtec Platform Team