Runbook: Incident Response

P0/P1 incident playbook for the Microtec ERP platform.

Audience: On-call engineers, platform team leads
Last reviewed: 2026-05-30

Severity Definitions

Severity	Definition	Response Time	Examples
P0	Production completely down, all users affected	Immediate — page on-call	Gateway unreachable, Keycloak down, SQL Server down
P1	Production degraded, subset of users or features affected	< 30 minutes	Single service down, slow responses, partial auth failures
P2	Non-critical feature broken, workaround available	< 4 hours	Notification failures, report errors, non-blocking UI
P3	Minor issue, cosmetic or low-impact	Next business day	UI glitch, log noise, non-critical warning

On-Call Contact Chain

Role	Responsibility
On-call DevOps	First responder — triage and initial fix
Platform Lead	Escalation for P0, architecture decisions
Database Admin	SQL Server and data-related issues
Keycloak Owner	Auth and SSO issues

P0 Response Protocol

Minute 0–5: Assess

bash

# [ACTION] Check Azure Front Door health
az afd endpoint show \
  --resource-group "mic-erp-fr-prod-network-rg" \
  --profile-name "mic-erp-prod-afd" \
  --endpoint-name "mic-erp-prod-endpoint" \
  --query "enabledState" -o tsv

# [ACTION] Check Gateway CAE (public-facing)
az containerapp show \
  --name "mic-erp-be-prod-gateway" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --query "{status:properties.runningStatus, replicas:properties.template.scale.minReplicas}" \
  -o json

# [ACTION] Check Keycloak
az containerapp show \
  --name "mic-erp-be-prod-keycloak" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --query "properties.runningStatus" -o tsv

Minute 5–10: Check App Insights

Open App Insights dashboard in Azure Portal:

Live Metrics: Confirm whether requests are reaching the service
Failures blade: Look for 5xx spike correlated with the incident start time
Performance blade: Check response time percentiles

kusto

// KQL — Find the first error in the last 30 minutes
exceptions
| where timestamp > ago(30m)
| summarize count() by outerMessage, problemId
| order by count_ desc
| take 20

Minute 10–15: Check Recent Deployments

bash

# [ACTION] List recent pipeline runs
az pipelines run list \
  --org https://dev.azure.com/microtec \
  --project ERP \
  --status completed \
  --top 10 \
  --query "[].{name:definition.name, result:result, finishTime:finishTime}" \
  -o table

If a recent deployment correlates with the incident time → proceed to Rollback.

Minute 15–20: Check CAE Health

bash

export ENV="prod"
export RG="mic-erp-be-${ENV}-containers-rg"

# [ACTION] Get status of all container apps in the private CAE
az containerapp list \
  --resource-group "${RG}" \
  --query "[].{name:name, status:properties.runningStatus, replicas:properties.template.scale.minReplicas}" \
  -o table

# [ACTION] Check failing container logs
az containerapp logs show \
  --name "mic-erp-be-prod-apps-portal" \
  --resource-group "${RG}" \
  --follow --tail 100

Minute 20–25: Check Key Vault Accessibility

bash

# [ACTION] Verify KV secret is accessible from managed identity
az keyvault secret show \
  --vault-name "mic-erp-prod-kv" \
  --name "ConnectionStrings--DefaultConnection" \
  --query "value" -o tsv > /dev/null && echo "KV OK" || echo "KV ERROR"

Minute 25–30: Check Service Bus Dead-Letter Queues

bash

# [ACTION] Check ASB dead-letter queue depth
az servicebus queue show \
  --resource-group "mic-erp-be-prod-messaging-rg" \
  --namespace-name "mic-erp-be-prod-asb" \
  --name "erp-events" \
  --query "countDetails.deadLetterMessageCount" -o tsv

A growing dead-letter count indicates a consumer service is down or rejecting messages.

Emergency Rollback

[ACTION] Use when a deployment caused the incident. Rolls back to the previous image tag.

bash

export ENV="prod"
export SVC="apps-portal"                    # Service to roll back
export RG="mic-erp-be-${ENV}-containers-rg"
export APP="mic-erp-be-${ENV}-${SVC}"
export ACR="micerpbe${ENV}acr"

# [ACTION] Find the previous image tag
PREV_TAG=$(az containerapp revision list \
  --name "${APP}" --resource-group "${RG}" \
  --query "sort_by([?properties.active==\`false\`], &properties.createdTime)[-1].properties.template.containers[0].image" \
  -o tsv | sed 's/.*://')

echo "Rolling back to tag: ${PREV_TAG}"

# [ACTION] Update container app to previous image
az containerapp update \
  --name "${APP}" \
  --resource-group "${RG}" \
  --image "${ACR}.azurecr.io/${SVC}-apis:${PREV_TAG}"

# [VERIFY] New revision is active
az containerapp revision list \
  --name "${APP}" --resource-group "${RG}" \
  --query "[?properties.active==\`true\`].{name:name, image:properties.template.containers[0].image}" \
  -o table

Rollback multiple services simultaneously

bash

SERVICES=("apps-portal" "inventory" "business-owners" "hr")
for SVC in "${SERVICES[@]}"; do
  APP="mic-erp-be-${ENV}-${SVC}"
  # ... repeat az containerapp update for each ...
done

P1 Response Protocol

P1 follows the same steps as P0 but with a lower urgency. Do not page on-call unless the situation escalates to P0.

Assess impact scope (how many users, which features)
Check App Insights for the affected service only
Check logs for the specific service
Apply targeted fix or rollback
Monitor for 15 minutes post-fix
File P1 report within 24 hours

Common Incident Scenarios

Scenario 1: Gateway 502

Symptoms: All API calls return 502.
Cause: YARP cannot reach a backend service.

bash

# Check Gateway logs
az containerapp logs show \
  --name "mic-erp-be-prod-gateway" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --tail 50

# Verify internal service is running
az containerapp show \
  --name "mic-erp-be-prod-apps-portal" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --query "properties.runningStatus" -o tsv

Fix: Restart the target service or roll back if a recent deployment broke it.

Scenario 2: Authentication Failures (401/403)

Symptoms: Users cannot log in or get 401 on authenticated endpoints.
Cause: Keycloak is down, client secret rotated without restart, or token validation misconfigured.

bash

# Check Keycloak health
curl -sf https://prod.onlinemicrotec.com.sa/auth/health/ready | jq .

# Check Keycloak logs
az containerapp logs show \
  --name "mic-erp-be-prod-keycloak" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --tail 100

Fix: Restart Keycloak. If a client secret was recently rotated, restart the services that use it.

Scenario 3: Database Connection Failures

Symptoms: Services return 500 with "Cannot connect to SQL Server" in logs.

bash

# Test SQL connectivity from your workstation (via VPN)
sqlcmd -S 20.50.120.95 -U sqladmin -Q "SELECT 1" -P "${SQL_ADMIN_PASS}"

# Check SQL Server VM status
az vm show \
  --resource-group "mic-backend-shared-sql-rg" \
  --name "mic-shared-sql-vm" \
  --query "provisioningState" -o tsv

Fix: If VM is stopped, start it via Azure Portal. If the connection string secret was rotated, restart affected services.

Scenario 4: Service Bus Message Backlog

Symptoms: Background processing is slow; dead-letter queue is growing.

bash

# Check worker service
az containerapp show \
  --name "mic-erp-be-prod-platforms-worker" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --query "properties.runningStatus" -o tsv

# Restart worker
az containerapp revision restart \
  --name "mic-erp-be-prod-platforms-worker" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --revision "$(az containerapp revision list \
    --name mic-erp-be-prod-platforms-worker \
    --resource-group mic-erp-be-prod-containers-rg \
    --query '[?properties.active].name' -o tsv)"

Post-Incident Actions

[ACTION] Complete within 48 hours of incident resolution for P0, 72 hours for P1.

Incident Report Template

markdown

## Incident Report — <YYYY-MM-DD> P0/P1

**Duration**: HH:MM – HH:MM (UTC+3)  
**Severity**: P0 / P1  
**Affected Services**: list services  
**User Impact**: describe impact  

### Timeline
- HH:MM — Issue first detected (alert / user report)
- HH:MM — On-call paged
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Service restored
- HH:MM — All-clear declared

### Root Cause
(describe root cause)

### Resolution
(describe fix applied)

### Action Items
- [ ] Item 1 (owner, due date)
- [ ] Item 2 (owner, due date)

### Lessons Learned
(what would prevent this in the future)

Communication Templates

Initial stakeholder notification (P0)

SUBJECT: [P0 INCIDENT] Microtec ERP Production — <service> unavailable

We are investigating a production issue affecting <service/all users>.
Impact: <describe impact>.
Status: Investigating.
Next update in 15 minutes.

Resolution notification

SUBJECT: [RESOLVED] Microtec ERP Production — <service> restored

The production incident has been resolved.
Duration: HH:MM – HH:MM
Root cause: <summary>
Full post-mortem will be shared within 48 hours.

Key Rotation — if a secret compromise triggered the incident
Scale a Service — if the incident was caused by capacity exhaustion
Keycloak Realm Recovery — for auth-related incidents

Runbook: Incident Response ​

Severity Definitions ​

On-Call Contact Chain ​

P0 Response Protocol ​

Minute 0–5: Assess ​

Minute 5–10: Check App Insights ​

Minute 10–15: Check Recent Deployments ​

Minute 15–20: Check CAE Health ​

Minute 20–25: Check Key Vault Accessibility ​

Minute 25–30: Check Service Bus Dead-Letter Queues ​

Emergency Rollback ​

Rollback multiple services simultaneously ​

P1 Response Protocol ​

Common Incident Scenarios ​

Scenario 1: Gateway 502 ​

Scenario 2: Authentication Failures (401/403) ​

Scenario 3: Database Connection Failures ​

Scenario 4: Service Bus Message Backlog ​

Post-Incident Actions ​

Incident Report Template ​

Communication Templates ​

Initial stakeholder notification (P0) ​

Resolution notification ​

Related Runbooks ​

Runbook: Incident Response

Severity Definitions

On-Call Contact Chain

P0 Response Protocol

Minute 0–5: Assess

Minute 5–10: Check App Insights

Minute 10–15: Check Recent Deployments

Minute 15–20: Check CAE Health

Minute 20–25: Check Key Vault Accessibility

Minute 25–30: Check Service Bus Dead-Letter Queues

Emergency Rollback

Rollback multiple services simultaneously

P1 Response Protocol

Common Incident Scenarios

Scenario 1: Gateway 502

Scenario 2: Authentication Failures (401/403)

Scenario 3: Database Connection Failures

Scenario 4: Service Bus Message Backlog

Post-Incident Actions

Incident Report Template

Communication Templates

Initial stakeholder notification (P0)

Resolution notification

Related Runbooks