Appearance
Runbook: Incident Response
P0/P1 incident playbook for the Microtec ERP platform.
Audience: On-call engineers, platform team leads
Last reviewed: 2026-05-30
Severity Definitions
| Severity | Definition | Response Time | Examples |
|---|---|---|---|
| P0 | Production completely down, all users affected | Immediate — page on-call | Gateway unreachable, Keycloak down, SQL Server down |
| P1 | Production degraded, subset of users or features affected | < 30 minutes | Single service down, slow responses, partial auth failures |
| P2 | Non-critical feature broken, workaround available | < 4 hours | Notification failures, report errors, non-blocking UI |
| P3 | Minor issue, cosmetic or low-impact | Next business day | UI glitch, log noise, non-critical warning |
On-Call Contact Chain
| Role | Responsibility |
|---|---|
| On-call DevOps | First responder — triage and initial fix |
| Platform Lead | Escalation for P0, architecture decisions |
| Database Admin | SQL Server and data-related issues |
| Keycloak Owner | Auth and SSO issues |
P0 Response Protocol
Minute 0–5: Assess
bash
# [ACTION] Check Azure Front Door health
az afd endpoint show \
--resource-group "mic-erp-fr-prod-network-rg" \
--profile-name "mic-erp-prod-afd" \
--endpoint-name "mic-erp-prod-endpoint" \
--query "enabledState" -o tsv
# [ACTION] Check Gateway CAE (public-facing)
az containerapp show \
--name "mic-erp-be-prod-gateway" \
--resource-group "mic-erp-be-prod-containers-rg" \
--query "{status:properties.runningStatus, replicas:properties.template.scale.minReplicas}" \
-o json
# [ACTION] Check Keycloak
az containerapp show \
--name "mic-erp-be-prod-keycloak" \
--resource-group "mic-erp-be-prod-containers-rg" \
--query "properties.runningStatus" -o tsvMinute 5–10: Check App Insights
Open App Insights dashboard in Azure Portal:
- Live Metrics: Confirm whether requests are reaching the service
- Failures blade: Look for 5xx spike correlated with the incident start time
- Performance blade: Check response time percentiles
kusto
// KQL — Find the first error in the last 30 minutes
exceptions
| where timestamp > ago(30m)
| summarize count() by outerMessage, problemId
| order by count_ desc
| take 20Minute 10–15: Check Recent Deployments
bash
# [ACTION] List recent pipeline runs
az pipelines run list \
--org https://dev.azure.com/microtec \
--project ERP \
--status completed \
--top 10 \
--query "[].{name:definition.name, result:result, finishTime:finishTime}" \
-o tableIf a recent deployment correlates with the incident time → proceed to Rollback.
Minute 15–20: Check CAE Health
bash
export ENV="prod"
export RG="mic-erp-be-${ENV}-containers-rg"
# [ACTION] Get status of all container apps in the private CAE
az containerapp list \
--resource-group "${RG}" \
--query "[].{name:name, status:properties.runningStatus, replicas:properties.template.scale.minReplicas}" \
-o table
# [ACTION] Check failing container logs
az containerapp logs show \
--name "mic-erp-be-prod-apps-portal" \
--resource-group "${RG}" \
--follow --tail 100Minute 20–25: Check Key Vault Accessibility
bash
# [ACTION] Verify KV secret is accessible from managed identity
az keyvault secret show \
--vault-name "mic-erp-prod-kv" \
--name "ConnectionStrings--DefaultConnection" \
--query "value" -o tsv > /dev/null && echo "KV OK" || echo "KV ERROR"Minute 25–30: Check Service Bus Dead-Letter Queues
bash
# [ACTION] Check ASB dead-letter queue depth
az servicebus queue show \
--resource-group "mic-erp-be-prod-messaging-rg" \
--namespace-name "mic-erp-be-prod-asb" \
--name "erp-events" \
--query "countDetails.deadLetterMessageCount" -o tsvA growing dead-letter count indicates a consumer service is down or rejecting messages.
Emergency Rollback
[ACTION] Use when a deployment caused the incident. Rolls back to the previous image tag.
bash
export ENV="prod"
export SVC="apps-portal" # Service to roll back
export RG="mic-erp-be-${ENV}-containers-rg"
export APP="mic-erp-be-${ENV}-${SVC}"
export ACR="micerpbe${ENV}acr"
# [ACTION] Find the previous image tag
PREV_TAG=$(az containerapp revision list \
--name "${APP}" --resource-group "${RG}" \
--query "sort_by([?properties.active==\`false\`], &properties.createdTime)[-1].properties.template.containers[0].image" \
-o tsv | sed 's/.*://')
echo "Rolling back to tag: ${PREV_TAG}"
# [ACTION] Update container app to previous image
az containerapp update \
--name "${APP}" \
--resource-group "${RG}" \
--image "${ACR}.azurecr.io/${SVC}-apis:${PREV_TAG}"
# [VERIFY] New revision is active
az containerapp revision list \
--name "${APP}" --resource-group "${RG}" \
--query "[?properties.active==\`true\`].{name:name, image:properties.template.containers[0].image}" \
-o tableRollback multiple services simultaneously
bash
SERVICES=("apps-portal" "inventory" "business-owners" "hr")
for SVC in "${SERVICES[@]}"; do
APP="mic-erp-be-${ENV}-${SVC}"
# ... repeat az containerapp update for each ...
doneP1 Response Protocol
P1 follows the same steps as P0 but with a lower urgency. Do not page on-call unless the situation escalates to P0.
- Assess impact scope (how many users, which features)
- Check App Insights for the affected service only
- Check logs for the specific service
- Apply targeted fix or rollback
- Monitor for 15 minutes post-fix
- File P1 report within 24 hours
Common Incident Scenarios
Scenario 1: Gateway 502
Symptoms: All API calls return 502.
Cause: YARP cannot reach a backend service.
bash
# Check Gateway logs
az containerapp logs show \
--name "mic-erp-be-prod-gateway" \
--resource-group "mic-erp-be-prod-containers-rg" \
--tail 50
# Verify internal service is running
az containerapp show \
--name "mic-erp-be-prod-apps-portal" \
--resource-group "mic-erp-be-prod-containers-rg" \
--query "properties.runningStatus" -o tsvFix: Restart the target service or roll back if a recent deployment broke it.
Scenario 2: Authentication Failures (401/403)
Symptoms: Users cannot log in or get 401 on authenticated endpoints.
Cause: Keycloak is down, client secret rotated without restart, or token validation misconfigured.
bash
# Check Keycloak health
curl -sf https://prod.onlinemicrotec.com.sa/auth/health/ready | jq .
# Check Keycloak logs
az containerapp logs show \
--name "mic-erp-be-prod-keycloak" \
--resource-group "mic-erp-be-prod-containers-rg" \
--tail 100Fix: Restart Keycloak. If a client secret was recently rotated, restart the services that use it.
Scenario 3: Database Connection Failures
Symptoms: Services return 500 with "Cannot connect to SQL Server" in logs.
bash
# Test SQL connectivity from your workstation (via VPN)
sqlcmd -S 20.50.120.95 -U sqladmin -Q "SELECT 1" -P "${SQL_ADMIN_PASS}"
# Check SQL Server VM status
az vm show \
--resource-group "mic-backend-shared-sql-rg" \
--name "mic-shared-sql-vm" \
--query "provisioningState" -o tsvFix: If VM is stopped, start it via Azure Portal. If the connection string secret was rotated, restart affected services.
Scenario 4: Service Bus Message Backlog
Symptoms: Background processing is slow; dead-letter queue is growing.
bash
# Check worker service
az containerapp show \
--name "mic-erp-be-prod-platforms-worker" \
--resource-group "mic-erp-be-prod-containers-rg" \
--query "properties.runningStatus" -o tsv
# Restart worker
az containerapp revision restart \
--name "mic-erp-be-prod-platforms-worker" \
--resource-group "mic-erp-be-prod-containers-rg" \
--revision "$(az containerapp revision list \
--name mic-erp-be-prod-platforms-worker \
--resource-group mic-erp-be-prod-containers-rg \
--query '[?properties.active].name' -o tsv)"Post-Incident Actions
[ACTION] Complete within 48 hours of incident resolution for P0, 72 hours for P1.
Incident Report Template
markdown
## Incident Report — <YYYY-MM-DD> P0/P1
**Duration**: HH:MM – HH:MM (UTC+3)
**Severity**: P0 / P1
**Affected Services**: list services
**User Impact**: describe impact
### Timeline
- HH:MM — Issue first detected (alert / user report)
- HH:MM — On-call paged
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Service restored
- HH:MM — All-clear declared
### Root Cause
(describe root cause)
### Resolution
(describe fix applied)
### Action Items
- [ ] Item 1 (owner, due date)
- [ ] Item 2 (owner, due date)
### Lessons Learned
(what would prevent this in the future)Communication Templates
Initial stakeholder notification (P0)
SUBJECT: [P0 INCIDENT] Microtec ERP Production — <service> unavailable
We are investigating a production issue affecting <service/all users>.
Impact: <describe impact>.
Status: Investigating.
Next update in 15 minutes.Resolution notification
SUBJECT: [RESOLVED] Microtec ERP Production — <service> restored
The production incident has been resolved.
Duration: HH:MM – HH:MM
Root cause: <summary>
Full post-mortem will be shared within 48 hours.Related Runbooks
- Key Rotation — if a secret compromise triggered the incident
- Scale a Service — if the incident was caused by capacity exhaustion
- Keycloak Realm Recovery — for auth-related incidents