Fooj Shared Egress Migration

Section: 17 — Fooj
Last Updated: 2026-05-30 (migration completed 2026-04-05)
Scope: Stage + Production CAE consolidation, shared NAT Gateway, init container gotcha

Summary

On 2026-04-05, the Fooj stage and production Container App Environments were consolidated into a shared VNet with a single NAT Gateway. This reduced the number of public egress IPs from 2 (one per env) to 1 (20.26.0.39).

Before the Migration

fooj-stage environment:
  VNet: 10.20.1.0/24
  NAT Gateway: fooj-stage-nat (separate public IP)
  Public IP: 20.26.X.X (stage-specific)

fooj-prod environment:
  VNet: 10.20.2.0/24
  NAT Gateway: fooj-prod-nat (separate public IP)
  Public IP: 20.26.Y.Y (prod-specific)

Each environment had its own NAT Gateway, resulting in two public IPs that needed to be whitelisted by external services (payment gateways, third-party APIs).

After the Migration

Shared VNet: 10.20.0.0/16
  Subnet stage:   10.20.1.0/24
  Subnet prod:    10.20.2.0/24

Shared NAT Gateway: fooj-shared-nat
  Public IP: 20.26.0.39 (SINGLE IP for ALL Fooj environments)

Why Consolidate?

Benefits

Benefit	Detail
Single IP to whitelist	External services only need `20.26.0.39` — not two separate IPs
Cost reduction	One NAT Gateway (~$40/mo) instead of two (~$80/mo)
Simplified IP allowlisting	Payment gateways, SMS providers, etc.
Unified egress policy	One place to manage firewall/DNAT rules

Trade-offs

Trade-off	Impact
Stage traffic on same IP as prod	Staging requests appear to come from the same IP as production — external rate limits apply jointly
Single point of egress	If the NAT Gateway has issues, both stage and prod lose outbound connectivity
Environment blending for whitelists	Third parties cannot distinguish stage vs prod traffic by source IP

NOTE

The cost trade-off was accepted: the ~$40/mo savings and reduced IP management overhead outweigh the risk for Fooj's current scale. This decision can be revisited if Fooj's stage traffic starts consuming significant NAT bandwidth.

Init Container Gotcha

CAUTION

During the migration, a critical deployment issue was discovered with Azure Container Apps init containers and VNet-dependent resources.

The Problem

When re-deploying ACA environments into the new shared VNet, init containers that performed network-dependent operations (database migrations, Keycloak realm seeding) failed with connectivity errors:

Init container 'db-migrate' failed with exit code 1:
  Error: Connection refused: fooj-stage-sql.database.windows.net:1433

The root cause: init containers are scheduled before the ACA environment's VNet integration is fully established. The first pod scheduled after a VNet change or environment re-creation may not have network access until the underlying infrastructure fully propagates.

Symptoms

Init container fails on first deployment after VNet migration
Same init container works on re-deploy (because VNet is now stable)
No code changes between failing and succeeding deployment
Error is always connectivity-related (TCP connection refused, DNS not resolving)

Workaround

Option 1: Retry with exponential backoff in init container

dockerfile

# Dockerfile for init container
ENTRYPOINT ["sh", "-c", "\
  attempt=1; \
  until dotnet Fooj.Migrations.dll; do \
    echo \"Attempt $attempt failed. Retrying in $((attempt * 5))s...\"; \
    sleep $((attempt * 5)); \
    attempt=$((attempt + 1)); \
    if [ $attempt -gt 5 ]; then exit 1; fi; \
  done"]

Option 2: Separate migration job (recommended)

Move database migrations out of init containers entirely. Run them as a separate Azure Container App Job triggered before the main deployment:

yaml

# azure-pipelines.yml
- task: AzureCLI@2
  displayName: Run DB Migrations
  inputs:
    script: |
      az containerapp job start \
        --name fooj-db-migrate-job \
        --resource-group fooj-stage-rg \
        --wait-for-completion

This approach:

Decouples migration from app startup
Makes migration failures visible in the pipeline (not silently killing pods)
Allows separate retry/rollback of migrations

Option 3: Health check with grace period

Add a startup probe with a long initialDelaySeconds to give the VNet time to stabilize:

yaml

startupProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60    # Give VNet time to propagate
  periodSeconds: 10
  failureThreshold: 12

Current State

The Fooj deployment pipeline uses Option 2 (separate migration job) as the permanent fix.

NAT Gateway Configuration

Resource Details

Property	Value
Resource name	`fooj-shared-nat`
Resource group	`fooj-shared-network-rg`
Public IP	`20.26.0.39` (static)
Idle timeout	4 minutes
SKU	Standard
Zones	Zone-redundant

Subnet Association

fooj-shared-nat associated with:
  - fooj-stage-subnet (10.20.1.0/24) ← Stage egress
  - fooj-prod-subnet  (10.20.2.0/24) ← Prod egress

IP Whitelisting Requirements

Third-party services that need to allowlist Fooj's egress IP:

Service	Purpose	IP to whitelist
Payment gateway	Transaction processing	`20.26.0.39`
SMS provider	OTP / notifications	`20.26.0.39`
External data APIs	Business data feeds	`20.26.0.39`

Monitoring

NAT Gateway Metrics (Azure Monitor)

Key metrics to watch on fooj-shared-nat:

Metric	Alert Threshold	Meaning
`SNATConnectionCount`	> 10,000	High SNAT port usage
`SNATPortUtilization`	> 75%	SNAT port exhaustion risk
`PacketsDropped`	> 100/min	Connectivity issues
`ByteCount`	Baseline + 2σ	Unexpected traffic spike

WARNING

SNAT port exhaustion (SNATPortUtilization > 100%) means outbound connections from Fooj will start failing. At current scale this is unlikely, but monitor as traffic grows.

Alert Setup

bash

# Create SNAT port utilization alert
az monitor metrics alert create \
  --name "fooj-nat-snat-high" \
  --resource "/subscriptions/f2340b90-2a00-4551-aabc-6e1776e82077/resourceGroups/fooj-shared-network-rg/providers/Microsoft.Network/natGateways/fooj-shared-nat" \
  --condition "avg SNATPortUtilization > 75" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action-groups "/subscriptions/.../actionGroups/fooj-ops-alerts"

Fooj Shared Egress Migration ​

Summary ​

Before the Migration ​

After the Migration ​

Why Consolidate? ​

Benefits ​

Trade-offs ​

Init Container Gotcha ​

The Problem ​

Symptoms ​

Workaround ​

Current State ​

NAT Gateway Configuration ​

Resource Details ​

Subnet Association ​

IP Whitelisting Requirements ​

Monitoring ​

NAT Gateway Metrics (Azure Monitor) ​

Alert Setup ​

Related Sections ​