Skip to content

Runbook: Incident Response

P0/P1 incident playbook for the Microtec ERP platform.

Audience: On-call engineers, platform team leads
Last reviewed: 2026-05-30


Severity Definitions

SeverityDefinitionResponse TimeExamples
P0Production completely down, all users affectedImmediate — page on-callGateway unreachable, Keycloak down, SQL Server down
P1Production degraded, subset of users or features affected< 30 minutesSingle service down, slow responses, partial auth failures
P2Non-critical feature broken, workaround available< 4 hoursNotification failures, report errors, non-blocking UI
P3Minor issue, cosmetic or low-impactNext business dayUI glitch, log noise, non-critical warning

On-Call Contact Chain

RoleResponsibility
On-call DevOpsFirst responder — triage and initial fix
Platform LeadEscalation for P0, architecture decisions
Database AdminSQL Server and data-related issues
Keycloak OwnerAuth and SSO issues

P0 Response Protocol

Minute 0–5: Assess

bash
# [ACTION] Check Azure Front Door health
az afd endpoint show \
  --resource-group "mic-erp-fr-prod-network-rg" \
  --profile-name "mic-erp-prod-afd" \
  --endpoint-name "mic-erp-prod-endpoint" \
  --query "enabledState" -o tsv

# [ACTION] Check Gateway CAE (public-facing)
az containerapp show \
  --name "mic-erp-be-prod-gateway" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --query "{status:properties.runningStatus, replicas:properties.template.scale.minReplicas}" \
  -o json

# [ACTION] Check Keycloak
az containerapp show \
  --name "mic-erp-be-prod-keycloak" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --query "properties.runningStatus" -o tsv

Minute 5–10: Check App Insights

Open App Insights dashboard in Azure Portal:

  • Live Metrics: Confirm whether requests are reaching the service
  • Failures blade: Look for 5xx spike correlated with the incident start time
  • Performance blade: Check response time percentiles
kusto
// KQL — Find the first error in the last 30 minutes
exceptions
| where timestamp > ago(30m)
| summarize count() by outerMessage, problemId
| order by count_ desc
| take 20

Minute 10–15: Check Recent Deployments

bash
# [ACTION] List recent pipeline runs
az pipelines run list \
  --org https://dev.azure.com/microtec \
  --project ERP \
  --status completed \
  --top 10 \
  --query "[].{name:definition.name, result:result, finishTime:finishTime}" \
  -o table

If a recent deployment correlates with the incident time → proceed to Rollback.

Minute 15–20: Check CAE Health

bash
export ENV="prod"
export RG="mic-erp-be-${ENV}-containers-rg"

# [ACTION] Get status of all container apps in the private CAE
az containerapp list \
  --resource-group "${RG}" \
  --query "[].{name:name, status:properties.runningStatus, replicas:properties.template.scale.minReplicas}" \
  -o table

# [ACTION] Check failing container logs
az containerapp logs show \
  --name "mic-erp-be-prod-apps-portal" \
  --resource-group "${RG}" \
  --follow --tail 100

Minute 20–25: Check Key Vault Accessibility

bash
# [ACTION] Verify KV secret is accessible from managed identity
az keyvault secret show \
  --vault-name "mic-erp-prod-kv" \
  --name "ConnectionStrings--DefaultConnection" \
  --query "value" -o tsv > /dev/null && echo "KV OK" || echo "KV ERROR"

Minute 25–30: Check Service Bus Dead-Letter Queues

bash
# [ACTION] Check ASB dead-letter queue depth
az servicebus queue show \
  --resource-group "mic-erp-be-prod-messaging-rg" \
  --namespace-name "mic-erp-be-prod-asb" \
  --name "erp-events" \
  --query "countDetails.deadLetterMessageCount" -o tsv

A growing dead-letter count indicates a consumer service is down or rejecting messages.


Emergency Rollback

[ACTION] Use when a deployment caused the incident. Rolls back to the previous image tag.

bash
export ENV="prod"
export SVC="apps-portal"                    # Service to roll back
export RG="mic-erp-be-${ENV}-containers-rg"
export APP="mic-erp-be-${ENV}-${SVC}"
export ACR="micerpbe${ENV}acr"

# [ACTION] Find the previous image tag
PREV_TAG=$(az containerapp revision list \
  --name "${APP}" --resource-group "${RG}" \
  --query "sort_by([?properties.active==\`false\`], &properties.createdTime)[-1].properties.template.containers[0].image" \
  -o tsv | sed 's/.*://')

echo "Rolling back to tag: ${PREV_TAG}"

# [ACTION] Update container app to previous image
az containerapp update \
  --name "${APP}" \
  --resource-group "${RG}" \
  --image "${ACR}.azurecr.io/${SVC}-apis:${PREV_TAG}"

# [VERIFY] New revision is active
az containerapp revision list \
  --name "${APP}" --resource-group "${RG}" \
  --query "[?properties.active==\`true\`].{name:name, image:properties.template.containers[0].image}" \
  -o table

Rollback multiple services simultaneously

bash
SERVICES=("apps-portal" "inventory" "business-owners" "hr")
for SVC in "${SERVICES[@]}"; do
  APP="mic-erp-be-${ENV}-${SVC}"
  # ... repeat az containerapp update for each ...
done

P1 Response Protocol

P1 follows the same steps as P0 but with a lower urgency. Do not page on-call unless the situation escalates to P0.

  1. Assess impact scope (how many users, which features)
  2. Check App Insights for the affected service only
  3. Check logs for the specific service
  4. Apply targeted fix or rollback
  5. Monitor for 15 minutes post-fix
  6. File P1 report within 24 hours

Common Incident Scenarios

Scenario 1: Gateway 502

Symptoms: All API calls return 502.
Cause: YARP cannot reach a backend service.

bash
# Check Gateway logs
az containerapp logs show \
  --name "mic-erp-be-prod-gateway" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --tail 50

# Verify internal service is running
az containerapp show \
  --name "mic-erp-be-prod-apps-portal" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --query "properties.runningStatus" -o tsv

Fix: Restart the target service or roll back if a recent deployment broke it.


Scenario 2: Authentication Failures (401/403)

Symptoms: Users cannot log in or get 401 on authenticated endpoints.
Cause: Keycloak is down, client secret rotated without restart, or token validation misconfigured.

bash
# Check Keycloak health
curl -sf https://prod.onlinemicrotec.com.sa/auth/health/ready | jq .

# Check Keycloak logs
az containerapp logs show \
  --name "mic-erp-be-prod-keycloak" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --tail 100

Fix: Restart Keycloak. If a client secret was recently rotated, restart the services that use it.


Scenario 3: Database Connection Failures

Symptoms: Services return 500 with "Cannot connect to SQL Server" in logs.

bash
# Test SQL connectivity from your workstation (via VPN)
sqlcmd -S 20.50.120.95 -U sqladmin -Q "SELECT 1" -P "${SQL_ADMIN_PASS}"

# Check SQL Server VM status
az vm show \
  --resource-group "mic-backend-shared-sql-rg" \
  --name "mic-shared-sql-vm" \
  --query "provisioningState" -o tsv

Fix: If VM is stopped, start it via Azure Portal. If the connection string secret was rotated, restart affected services.


Scenario 4: Service Bus Message Backlog

Symptoms: Background processing is slow; dead-letter queue is growing.

bash
# Check worker service
az containerapp show \
  --name "mic-erp-be-prod-platforms-worker" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --query "properties.runningStatus" -o tsv

# Restart worker
az containerapp revision restart \
  --name "mic-erp-be-prod-platforms-worker" \
  --resource-group "mic-erp-be-prod-containers-rg" \
  --revision "$(az containerapp revision list \
    --name mic-erp-be-prod-platforms-worker \
    --resource-group mic-erp-be-prod-containers-rg \
    --query '[?properties.active].name' -o tsv)"

Post-Incident Actions

[ACTION] Complete within 48 hours of incident resolution for P0, 72 hours for P1.

Incident Report Template

markdown
## Incident Report — <YYYY-MM-DD> P0/P1

**Duration**: HH:MM – HH:MM (UTC+3)  
**Severity**: P0 / P1  
**Affected Services**: list services  
**User Impact**: describe impact  

### Timeline
- HH:MM — Issue first detected (alert / user report)
- HH:MM — On-call paged
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Service restored
- HH:MM — All-clear declared

### Root Cause
(describe root cause)

### Resolution
(describe fix applied)

### Action Items
- [ ] Item 1 (owner, due date)
- [ ] Item 2 (owner, due date)

### Lessons Learned
(what would prevent this in the future)

Communication Templates

Initial stakeholder notification (P0)

SUBJECT: [P0 INCIDENT] Microtec ERP Production — <service> unavailable

We are investigating a production issue affecting <service/all users>.
Impact: <describe impact>.
Status: Investigating.
Next update in 15 minutes.

Resolution notification

SUBJECT: [RESOLVED] Microtec ERP Production — <service> restored

The production incident has been resolved.
Duration: HH:MM – HH:MM
Root cause: <summary>
Full post-mortem will be shared within 48 hours.

Internal Documentation — Microtec Platform Team