Skip to content

Fooj Shared Egress Migration

Section: 17 — Fooj
Last Updated: 2026-05-30 (migration completed 2026-04-05)
Scope: Stage + Production CAE consolidation, shared NAT Gateway, init container gotcha


Summary

On 2026-04-05, the Fooj stage and production Container App Environments were consolidated into a shared VNet with a single NAT Gateway. This reduced the number of public egress IPs from 2 (one per env) to 1 (20.26.0.39).


Before the Migration

fooj-stage environment:
  VNet: 10.20.1.0/24
  NAT Gateway: fooj-stage-nat (separate public IP)
  Public IP: 20.26.X.X (stage-specific)

fooj-prod environment:
  VNet: 10.20.2.0/24
  NAT Gateway: fooj-prod-nat (separate public IP)
  Public IP: 20.26.Y.Y (prod-specific)

Each environment had its own NAT Gateway, resulting in two public IPs that needed to be whitelisted by external services (payment gateways, third-party APIs).


After the Migration

Shared VNet: 10.20.0.0/16
  Subnet stage:   10.20.1.0/24
  Subnet prod:    10.20.2.0/24

Shared NAT Gateway: fooj-shared-nat
  Public IP: 20.26.0.39 (SINGLE IP for ALL Fooj environments)

Why Consolidate?

Benefits

BenefitDetail
Single IP to whitelistExternal services only need 20.26.0.39 — not two separate IPs
Cost reductionOne NAT Gateway (~$40/mo) instead of two (~$80/mo)
Simplified IP allowlistingPayment gateways, SMS providers, etc.
Unified egress policyOne place to manage firewall/DNAT rules

Trade-offs

Trade-offImpact
Stage traffic on same IP as prodStaging requests appear to come from the same IP as production — external rate limits apply jointly
Single point of egressIf the NAT Gateway has issues, both stage and prod lose outbound connectivity
Environment blending for whitelistsThird parties cannot distinguish stage vs prod traffic by source IP

NOTE

The cost trade-off was accepted: the ~$40/mo savings and reduced IP management overhead outweigh the risk for Fooj's current scale. This decision can be revisited if Fooj's stage traffic starts consuming significant NAT bandwidth.


Init Container Gotcha

CAUTION

During the migration, a critical deployment issue was discovered with Azure Container Apps init containers and VNet-dependent resources.

The Problem

When re-deploying ACA environments into the new shared VNet, init containers that performed network-dependent operations (database migrations, Keycloak realm seeding) failed with connectivity errors:

Init container 'db-migrate' failed with exit code 1:
  Error: Connection refused: fooj-stage-sql.database.windows.net:1433

The root cause: init containers are scheduled before the ACA environment's VNet integration is fully established. The first pod scheduled after a VNet change or environment re-creation may not have network access until the underlying infrastructure fully propagates.

Symptoms

  • Init container fails on first deployment after VNet migration
  • Same init container works on re-deploy (because VNet is now stable)
  • No code changes between failing and succeeding deployment
  • Error is always connectivity-related (TCP connection refused, DNS not resolving)

Workaround

Option 1: Retry with exponential backoff in init container

dockerfile
# Dockerfile for init container
ENTRYPOINT ["sh", "-c", "\
  attempt=1; \
  until dotnet Fooj.Migrations.dll; do \
    echo \"Attempt $attempt failed. Retrying in $((attempt * 5))s...\"; \
    sleep $((attempt * 5)); \
    attempt=$((attempt + 1)); \
    if [ $attempt -gt 5 ]; then exit 1; fi; \
  done"]

Option 2: Separate migration job (recommended)

Move database migrations out of init containers entirely. Run them as a separate Azure Container App Job triggered before the main deployment:

yaml
# azure-pipelines.yml
- task: AzureCLI@2
  displayName: Run DB Migrations
  inputs:
    script: |
      az containerapp job start \
        --name fooj-db-migrate-job \
        --resource-group fooj-stage-rg \
        --wait-for-completion

This approach:

  • Decouples migration from app startup
  • Makes migration failures visible in the pipeline (not silently killing pods)
  • Allows separate retry/rollback of migrations

Option 3: Health check with grace period

Add a startup probe with a long initialDelaySeconds to give the VNet time to stabilize:

yaml
startupProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60    # Give VNet time to propagate
  periodSeconds: 10
  failureThreshold: 12

Current State

The Fooj deployment pipeline uses Option 2 (separate migration job) as the permanent fix.


NAT Gateway Configuration

Resource Details

PropertyValue
Resource namefooj-shared-nat
Resource groupfooj-shared-network-rg
Public IP20.26.0.39 (static)
Idle timeout4 minutes
SKUStandard
ZonesZone-redundant

Subnet Association

fooj-shared-nat associated with:
  - fooj-stage-subnet (10.20.1.0/24) ← Stage egress
  - fooj-prod-subnet  (10.20.2.0/24) ← Prod egress

IP Whitelisting Requirements

Third-party services that need to allowlist Fooj's egress IP:

ServicePurposeIP to whitelist
Payment gatewayTransaction processing20.26.0.39
SMS providerOTP / notifications20.26.0.39
External data APIsBusiness data feeds20.26.0.39

Monitoring

NAT Gateway Metrics (Azure Monitor)

Key metrics to watch on fooj-shared-nat:

MetricAlert ThresholdMeaning
SNATConnectionCount> 10,000High SNAT port usage
SNATPortUtilization> 75%SNAT port exhaustion risk
PacketsDropped> 100/minConnectivity issues
ByteCountBaseline + 2σUnexpected traffic spike

WARNING

SNAT port exhaustion (SNATPortUtilization > 100%) means outbound connections from Fooj will start failing. At current scale this is unlikely, but monitor as traffic grows.

Alert Setup

bash
# Create SNAT port utilization alert
az monitor metrics alert create \
  --name "fooj-nat-snat-high" \
  --resource "/subscriptions/f2340b90-2a00-4551-aabc-6e1776e82077/resourceGroups/fooj-shared-network-rg/providers/Microsoft.Network/natGateways/fooj-shared-nat" \
  --condition "avg SNATPortUtilization > 75" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action-groups "/subscriptions/.../actionGroups/fooj-ops-alerts"

Internal Documentation — Microtec Platform Team