Appearance
Fooj Shared Egress Migration
Section: 17 — Fooj
Last Updated: 2026-05-30 (migration completed 2026-04-05)
Scope: Stage + Production CAE consolidation, shared NAT Gateway, init container gotcha
Summary
On 2026-04-05, the Fooj stage and production Container App Environments were consolidated into a shared VNet with a single NAT Gateway. This reduced the number of public egress IPs from 2 (one per env) to 1 (20.26.0.39).
Before the Migration
fooj-stage environment:
VNet: 10.20.1.0/24
NAT Gateway: fooj-stage-nat (separate public IP)
Public IP: 20.26.X.X (stage-specific)
fooj-prod environment:
VNet: 10.20.2.0/24
NAT Gateway: fooj-prod-nat (separate public IP)
Public IP: 20.26.Y.Y (prod-specific)Each environment had its own NAT Gateway, resulting in two public IPs that needed to be whitelisted by external services (payment gateways, third-party APIs).
After the Migration
Shared VNet: 10.20.0.0/16
Subnet stage: 10.20.1.0/24
Subnet prod: 10.20.2.0/24
Shared NAT Gateway: fooj-shared-nat
Public IP: 20.26.0.39 (SINGLE IP for ALL Fooj environments)Why Consolidate?
Benefits
| Benefit | Detail |
|---|---|
| Single IP to whitelist | External services only need 20.26.0.39 — not two separate IPs |
| Cost reduction | One NAT Gateway (~$40/mo) instead of two (~$80/mo) |
| Simplified IP allowlisting | Payment gateways, SMS providers, etc. |
| Unified egress policy | One place to manage firewall/DNAT rules |
Trade-offs
| Trade-off | Impact |
|---|---|
| Stage traffic on same IP as prod | Staging requests appear to come from the same IP as production — external rate limits apply jointly |
| Single point of egress | If the NAT Gateway has issues, both stage and prod lose outbound connectivity |
| Environment blending for whitelists | Third parties cannot distinguish stage vs prod traffic by source IP |
NOTE
The cost trade-off was accepted: the ~$40/mo savings and reduced IP management overhead outweigh the risk for Fooj's current scale. This decision can be revisited if Fooj's stage traffic starts consuming significant NAT bandwidth.
Init Container Gotcha
CAUTION
During the migration, a critical deployment issue was discovered with Azure Container Apps init containers and VNet-dependent resources.
The Problem
When re-deploying ACA environments into the new shared VNet, init containers that performed network-dependent operations (database migrations, Keycloak realm seeding) failed with connectivity errors:
Init container 'db-migrate' failed with exit code 1:
Error: Connection refused: fooj-stage-sql.database.windows.net:1433The root cause: init containers are scheduled before the ACA environment's VNet integration is fully established. The first pod scheduled after a VNet change or environment re-creation may not have network access until the underlying infrastructure fully propagates.
Symptoms
- Init container fails on first deployment after VNet migration
- Same init container works on re-deploy (because VNet is now stable)
- No code changes between failing and succeeding deployment
- Error is always connectivity-related (TCP connection refused, DNS not resolving)
Workaround
Option 1: Retry with exponential backoff in init container
dockerfile
# Dockerfile for init container
ENTRYPOINT ["sh", "-c", "\
attempt=1; \
until dotnet Fooj.Migrations.dll; do \
echo \"Attempt $attempt failed. Retrying in $((attempt * 5))s...\"; \
sleep $((attempt * 5)); \
attempt=$((attempt + 1)); \
if [ $attempt -gt 5 ]; then exit 1; fi; \
done"]Option 2: Separate migration job (recommended)
Move database migrations out of init containers entirely. Run them as a separate Azure Container App Job triggered before the main deployment:
yaml
# azure-pipelines.yml
- task: AzureCLI@2
displayName: Run DB Migrations
inputs:
script: |
az containerapp job start \
--name fooj-db-migrate-job \
--resource-group fooj-stage-rg \
--wait-for-completionThis approach:
- Decouples migration from app startup
- Makes migration failures visible in the pipeline (not silently killing pods)
- Allows separate retry/rollback of migrations
Option 3: Health check with grace period
Add a startup probe with a long initialDelaySeconds to give the VNet time to stabilize:
yaml
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60 # Give VNet time to propagate
periodSeconds: 10
failureThreshold: 12Current State
The Fooj deployment pipeline uses Option 2 (separate migration job) as the permanent fix.
NAT Gateway Configuration
Resource Details
| Property | Value |
|---|---|
| Resource name | fooj-shared-nat |
| Resource group | fooj-shared-network-rg |
| Public IP | 20.26.0.39 (static) |
| Idle timeout | 4 minutes |
| SKU | Standard |
| Zones | Zone-redundant |
Subnet Association
fooj-shared-nat associated with:
- fooj-stage-subnet (10.20.1.0/24) ← Stage egress
- fooj-prod-subnet (10.20.2.0/24) ← Prod egressIP Whitelisting Requirements
Third-party services that need to allowlist Fooj's egress IP:
| Service | Purpose | IP to whitelist |
|---|---|---|
| Payment gateway | Transaction processing | 20.26.0.39 |
| SMS provider | OTP / notifications | 20.26.0.39 |
| External data APIs | Business data feeds | 20.26.0.39 |
Monitoring
NAT Gateway Metrics (Azure Monitor)
Key metrics to watch on fooj-shared-nat:
| Metric | Alert Threshold | Meaning |
|---|---|---|
SNATConnectionCount | > 10,000 | High SNAT port usage |
SNATPortUtilization | > 75% | SNAT port exhaustion risk |
PacketsDropped | > 100/min | Connectivity issues |
ByteCount | Baseline + 2σ | Unexpected traffic spike |
WARNING
SNAT port exhaustion (SNATPortUtilization > 100%) means outbound connections from Fooj will start failing. At current scale this is unlikely, but monitor as traffic grows.
Alert Setup
bash
# Create SNAT port utilization alert
az monitor metrics alert create \
--name "fooj-nat-snat-high" \
--resource "/subscriptions/f2340b90-2a00-4551-aabc-6e1776e82077/resourceGroups/fooj-shared-network-rg/providers/Microsoft.Network/natGateways/fooj-shared-nat" \
--condition "avg SNATPortUtilization > 75" \
--window-size 5m \
--evaluation-frequency 1m \
--action-groups "/subscriptions/.../actionGroups/fooj-ops-alerts"