Production Monitoring & Deployment Automation¶

This guide covers practical patterns for integrating api-shield into the scripts and pipelines that keep your production systems healthy.

Monitoring scripts¶

Poll route health via the REST API¶

The ShieldAdmin REST API is JSON over HTTP, so any monitoring tool that can make an HTTP request can query it. No shield CLI install needed on the monitoring host.

#!/usr/bin/env bash
# check-routes.sh — exit 1 if any route is unexpectedly disabled

SHIELD_URL="${SHIELD_SERVER_URL:-http://localhost:8000/shield}"
TOKEN="${SHIELD_TOKEN}"

routes=$(curl -sf \
  -H "X-Shield-Token: $TOKEN" \
  "$SHIELD_URL/api/routes")

if [ $? -ne 0 ]; then
  echo "ERROR: Could not reach ShieldAdmin at $SHIELD_URL" >&2
  exit 1
fi

# Alert on any DISABLED route (adapt jq filter to your alert threshold)
disabled=$(echo "$routes" | jq -r '.[] | select(.status == "disabled") | .path')

if [ -n "$disabled" ]; then
  echo "ALERT: The following routes are disabled:"
  echo "$disabled"
  exit 1
fi

echo "OK: all routes nominal"

Run this from cron, Datadog, or any scheduler:

*/5 * * * * /opt/scripts/check-routes.sh >> /var/log/shield-monitor.log 2>&1

Python monitoring script¶

#!/usr/bin/env python3
"""monitor_routes.py — check api-shield route states and alert on anomalies."""

import os
import sys
import httpx

SHIELD_URL = os.environ.get("SHIELD_SERVER_URL", "http://localhost:8000/shield")
TOKEN = os.environ["SHIELD_TOKEN"]

ALERT_ON = {"disabled", "maintenance"}   # statuses that warrant an alert


def fetch_routes() -> list[dict]:
    resp = httpx.get(
        f"{SHIELD_URL}/api/routes",
        headers={"X-Shield-Token": TOKEN},
        timeout=10,
    )
    resp.raise_for_status()
    return resp.json()


def main() -> int:
    try:
        routes = fetch_routes()
    except httpx.HTTPError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        return 2   # unknown — monitoring system treats as warning

    alerts = [r for r in routes if r["status"] in ALERT_ON]

    if alerts:
        for r in alerts:
            print(f"ALERT  {r['status'].upper():<12} {r['path']}  reason={r.get('reason', '')!r}")
        return 1

    print(f"OK  {len(routes)} route(s) nominal")
    return 0


if __name__ == "__main__":
    sys.exit(main())

Webhook alerting (Slack / PagerDuty)¶

api-shield fires webhooks on every state change — enable, disable, maintenance on/off. Webhook delivery always originates from the process that owns the engine where state mutations happen. Where you register them depends on your deployment mode.

Embedded mode (single service)¶

Register directly on the engine before mounting ShieldAdmin:

from shield.core.engine import ShieldEngine
from shield.core.webhooks import SlackWebhookFormatter
from shield.fastapi import ShieldAdmin

engine = ShieldEngine()
engine.add_webhook(
    url=os.environ["SLACK_WEBHOOK_URL"],
    formatter=SlackWebhookFormatter(),
)
engine.add_webhook(url=os.environ["PAGERDUTY_WEBHOOK_URL"])

admin = ShieldAdmin(engine=engine, auth=("admin", os.environ["SHIELD_PASS"]))
app.mount("/shield", admin)

Shield Server mode (multi-service)¶

State mutations happen on the Shield Server, not on SDK clients. Build the engine explicitly so you can call add_webhook() on it before passing it to ShieldAdmin:

# shield_server.py
import os
from shield.core.engine import ShieldEngine
from shield.core.backends.redis import RedisBackend
from shield.core.webhooks import SlackWebhookFormatter
from shield.admin.app import ShieldAdmin

engine = ShieldEngine(backend=RedisBackend(os.environ["REDIS_URL"]))
engine.add_webhook(
    url=os.environ["SLACK_WEBHOOK_URL"],
    formatter=SlackWebhookFormatter(),
)
engine.add_webhook(url=os.environ["PAGERDUTY_WEBHOOK_URL"])

shield_app = ShieldAdmin(
    engine=engine,
    auth=("admin", os.environ["SHIELD_PASS"]),
    secret_key=os.environ["SHIELD_SECRET_KEY"],
)

Note

SDK service apps (ShieldSDK) never fire webhooks. They only enforce state locally — all mutations and therefore all webhook triggers originate on the Shield Server.

Webhook payload sent on every state change:

{
  "event": "maintenance_on",
  "path": "GET:/payments",
  "reason": "DB migration",
  "timestamp": "2025-06-01T02:00:00Z",
  "state": { "path": "GET:/payments", "status": "maintenance", ... }
}

Webhook failures are non-blocking; they are logged and never affect the request path. On multi-node Shield Server deployments (RedisBackend), Redis SET NX deduplication ensures only one node fires per event.

Deployment automation¶

Pre/post deploy maintenance pattern¶

The safest deployment pattern: enable maintenance before the deploy, run migrations, then re-enable routes.

#!/usr/bin/env bash
# deploy.sh
set -euo pipefail

SHIELD_URL="${SHIELD_SERVER_URL:-http://localhost:8000/shield}"

shield_cmd() {
  shield --server-url "$SHIELD_URL" "$@"
}

echo "==> Enabling global maintenance..."
shield_cmd global enable \
  --reason "Deploying v$(cat VERSION) — back in ~5 minutes" \
  --exempt /health \
  --exempt GET:/readiness

echo "==> Running migrations..."
uv run alembic upgrade head

echo "==> Deploying new container..."
docker compose up -d --no-deps --build api

echo "==> Waiting for health check..."
until curl -sf http://localhost:8000/health; do sleep 2; done

echo "==> Disabling global maintenance..."
shield_cmd global disable

echo "==> Deploy complete."

Route-level rolling deploy¶

For zero-downtime deploys where only specific routes need to go offline:

#!/usr/bin/env bash
# rolling-deploy.sh
set -euo pipefail

shield maintenance "POST:/orders" \
  --reason "Order service upgrade — ETA 10 minutes" \
  --start "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --end "$(date -u -d '+10 minutes' +%Y-%m-%dT%H:%M:%SZ)"

# ... deploy only the orders service ...
docker compose up -d --no-deps --build orders

# Wait for readiness
until curl -sf http://localhost:8001/health; do sleep 2; done

shield enable "POST:/orders"
echo "Orders service back online."

GitHub Actions — deploy workflow¶

# .github/workflows/deploy.yml
name: Deploy

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    env:
      SHIELD_SERVER_URL: ${{ secrets.SHIELD_SERVER_URL }}

    steps:
      - uses: actions/checkout@v4

      - name: Install shield CLI
        run: pip install "api-shield[cli]"

      - name: Authenticate with ShieldAdmin
        run: shield login ${{ secrets.SHIELD_USER }} --password ${{ secrets.SHIELD_PASS }}

      - name: Enable global maintenance
        run: |
          shield global enable \
            --reason "GitHub Actions deploy — commit ${{ github.sha }}" \
            --exempt /health

      - name: Run database migrations
        run: uv run alembic upgrade head

      - name: Deploy application
        run: |
          # your deploy command here
          kubectl set image deployment/api api=${{ env.IMAGE_TAG }}
          kubectl rollout status deployment/api --timeout=120s

      - name: Disable global maintenance
        if: always()   # run even if a previous step failed
        run: shield global disable

      - name: Verify routes
        run: |
          shield status
          # fail the workflow if any route is unexpectedly disabled
          shield status | grep -qv DISABLED || exit 1

Always disable on failure

Use if: always() on the disable step so maintenance mode is lifted even when the deploy fails. Pair it with a Slack webhook so the team is notified immediately.

Kubernetes — pre/post deploy hooks¶

Use Kubernetes lifecycle hooks to tie maintenance mode to pod lifecycle:

# k8s/deployment.yaml
spec:
  template:
    spec:
      containers:
        - name: api
          lifecycle:
            preStop:
              exec:
                command:
                  - sh
                  - -c
                  - |
                    shield --server-url $SHIELD_SERVER_URL \
                      maintenance GET:/payments \
                      --reason "Pod shutting down (rolling update)"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 5

And a post-deploy Job to re-enable:

# k8s/post-deploy-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: shield-enable-routes
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: shield-cli
          image: python:3.13-slim
          command:
            - sh
            - -c
            - |
              pip install -q "api-shield[cli]"
              shield login $SHIELD_USER --password $SHIELD_PASS
              shield enable GET:/payments
              shield global disable
          env:
            - name: SHIELD_SERVER_URL
              value: "http://api-svc/shield"
            - name: SHIELD_USER
              valueFrom:
                secretKeyRef: { name: shield-creds, key: username }
            - name: SHIELD_PASS
              valueFrom:
                secretKeyRef: { name: shield-creds, key: password }

Scheduled maintenance via cron + CLI¶

For recurring maintenance windows (nightly jobs, weekly DB vacuums):

# crontab — every Sunday 02:00–04:00 UTC
0 2 * * 0 shield schedule GET:/reports \
  --start "$(date -u +\%Y-\%m-\%dT02:00:00Z)" \
  --end   "$(date -u +\%Y-\%m-\%dT04:00:00Z)" \
  --reason "Weekly report rebuild"

Or schedule programmatically from Python:

import asyncio
from datetime import datetime, UTC, timedelta
from shield.core.engine import ShieldEngine
from shield.core.models import MaintenanceWindow

async def schedule_nightly(engine: ShieldEngine) -> None:
    now = datetime.now(UTC)
    tonight = now.replace(hour=2, minute=0, second=0, microsecond=0)
    if tonight < now:
        tonight += timedelta(days=1)

    window = MaintenanceWindow(
        start=tonight,
        end=tonight + timedelta(hours=2),
        reason="Nightly data pipeline",
    )
    await engine.schedule_maintenance("GET:/reports", window=window)
    print(f"Scheduled maintenance: {window.start} → {window.end}")

Audit log in monitoring pipelines¶

Pull the audit log to detect unexpected state changes (e.g. a route disabled by an unknown actor):

#!/usr/bin/env python3
"""audit-sentinel.py — alert on unexpected route state changes."""

import httpx, os, sys
from datetime import datetime, UTC, timedelta

SHIELD_URL = os.environ.get("SHIELD_SERVER_URL", "http://localhost:8000/shield")
TOKEN = os.environ["SHIELD_TOKEN"]
LOOKBACK = timedelta(minutes=15)

resp = httpx.get(
    f"{SHIELD_URL}/api/audit?limit=50",
    headers={"Authorization": f"Bearer {TOKEN}"},
    timeout=10,
)
resp.raise_for_status()

cutoff = datetime.now(UTC) - LOOKBACK
unexpected = [
    e for e in resp.json()
    if datetime.fromisoformat(e["timestamp"]) > cutoff
    and e["actor"] not in {"system", "deploy-bot", "alice", "bob"}
]

if unexpected:
    for e in unexpected:
        print(f"UNKNOWN ACTOR  {e['actor']}  {e['action']}  {e['path']}  {e['timestamp']}")
    sys.exit(1)

print("OK")

Environment variable reference¶

Variable	Used by	Description
`SHIELD_SERVER_URL`	CLI, monitoring scripts	Base URL of the `ShieldAdmin` mount point
`SHIELD_TOKEN`	Monitoring scripts (direct API calls)	Bearer token from `shield login`
`SHIELD_BACKEND`	App server	Backend type: `memory`, `file`, `redis`
`SHIELD_ENV`	App server	Current environment name (`dev`, `staging`, `production`)
`SHIELD_REDIS_URL`	App server	Redis connection URL for `RedisBackend`
`SHIELD_FILE_PATH`	App server	JSON file path for `FileBackend`