Skip to content

Production Monitoring & Deployment Automation

This guide covers practical patterns for integrating api-shield into the scripts and pipelines that keep your production systems healthy.


Monitoring scripts

Poll route health via the REST API

The ShieldAdmin REST API is JSON over HTTP, so any monitoring tool that can make an HTTP request can query it. No shield CLI install needed on the monitoring host.

#!/usr/bin/env bash
# check-routes.sh — exit 1 if any route is unexpectedly disabled

SHIELD_URL="${SHIELD_SERVER_URL:-http://localhost:8000/shield}"
TOKEN="${SHIELD_TOKEN}"

routes=$(curl -sf \
  -H "X-Shield-Token: $TOKEN" \
  "$SHIELD_URL/api/routes")

if [ $? -ne 0 ]; then
  echo "ERROR: Could not reach ShieldAdmin at $SHIELD_URL" >&2
  exit 1
fi

# Alert on any DISABLED route (adapt jq filter to your alert threshold)
disabled=$(echo "$routes" | jq -r '.[] | select(.status == "disabled") | .path')

if [ -n "$disabled" ]; then
  echo "ALERT: The following routes are disabled:"
  echo "$disabled"
  exit 1
fi

echo "OK: all routes nominal"

Run this from cron, Datadog, or any scheduler:

*/5 * * * * /opt/scripts/check-routes.sh >> /var/log/shield-monitor.log 2>&1

Python monitoring script

#!/usr/bin/env python3
"""monitor_routes.py — check api-shield route states and alert on anomalies."""

import os
import sys
import httpx

SHIELD_URL = os.environ.get("SHIELD_SERVER_URL", "http://localhost:8000/shield")
TOKEN = os.environ["SHIELD_TOKEN"]

ALERT_ON = {"disabled", "maintenance"}   # statuses that warrant an alert


def fetch_routes() -> list[dict]:
    resp = httpx.get(
        f"{SHIELD_URL}/api/routes",
        headers={"X-Shield-Token": TOKEN},
        timeout=10,
    )
    resp.raise_for_status()
    return resp.json()


def main() -> int:
    try:
        routes = fetch_routes()
    except httpx.HTTPError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        return 2   # unknown — monitoring system treats as warning

    alerts = [r for r in routes if r["status"] in ALERT_ON]

    if alerts:
        for r in alerts:
            print(f"ALERT  {r['status'].upper():<12} {r['path']}  reason={r.get('reason', '')!r}")
        return 1

    print(f"OK  {len(routes)} route(s) nominal")
    return 0


if __name__ == "__main__":
    sys.exit(main())

Webhook alerting (Slack / PagerDuty)

api-shield fires webhooks on every state change — enable, disable, maintenance on/off. Webhook delivery always originates from the process that owns the engine where state mutations happen. Where you register them depends on your deployment mode.

Embedded mode (single service)

Register directly on the engine before mounting ShieldAdmin:

from shield.core.engine import ShieldEngine
from shield.core.webhooks import SlackWebhookFormatter
from shield.fastapi import ShieldAdmin

engine = ShieldEngine()
engine.add_webhook(
    url=os.environ["SLACK_WEBHOOK_URL"],
    formatter=SlackWebhookFormatter(),
)
engine.add_webhook(url=os.environ["PAGERDUTY_WEBHOOK_URL"])

admin = ShieldAdmin(engine=engine, auth=("admin", os.environ["SHIELD_PASS"]))
app.mount("/shield", admin)

Shield Server mode (multi-service)

State mutations happen on the Shield Server, not on SDK clients. Build the engine explicitly so you can call add_webhook() on it before passing it to ShieldAdmin:

# shield_server.py
import os
from shield.core.engine import ShieldEngine
from shield.core.backends.redis import RedisBackend
from shield.core.webhooks import SlackWebhookFormatter
from shield.admin.app import ShieldAdmin

engine = ShieldEngine(backend=RedisBackend(os.environ["REDIS_URL"]))
engine.add_webhook(
    url=os.environ["SLACK_WEBHOOK_URL"],
    formatter=SlackWebhookFormatter(),
)
engine.add_webhook(url=os.environ["PAGERDUTY_WEBHOOK_URL"])

shield_app = ShieldAdmin(
    engine=engine,
    auth=("admin", os.environ["SHIELD_PASS"]),
    secret_key=os.environ["SHIELD_SECRET_KEY"],
)

Note

SDK service apps (ShieldSDK) never fire webhooks. They only enforce state locally — all mutations and therefore all webhook triggers originate on the Shield Server.

Webhook payload sent on every state change:

{
  "event": "maintenance_on",
  "path": "GET:/payments",
  "reason": "DB migration",
  "timestamp": "2025-06-01T02:00:00Z",
  "state": { "path": "GET:/payments", "status": "maintenance", ... }
}

Webhook failures are non-blocking; they are logged and never affect the request path. On multi-node Shield Server deployments (RedisBackend), Redis SET NX deduplication ensures only one node fires per event.


Deployment automation

Pre/post deploy maintenance pattern

The safest deployment pattern: enable maintenance before the deploy, run migrations, then re-enable routes.

#!/usr/bin/env bash
# deploy.sh
set -euo pipefail

SHIELD_URL="${SHIELD_SERVER_URL:-http://localhost:8000/shield}"

shield_cmd() {
  shield --server-url "$SHIELD_URL" "$@"
}

echo "==> Enabling global maintenance..."
shield_cmd global enable \
  --reason "Deploying v$(cat VERSION) — back in ~5 minutes" \
  --exempt /health \
  --exempt GET:/readiness

echo "==> Running migrations..."
uv run alembic upgrade head

echo "==> Deploying new container..."
docker compose up -d --no-deps --build api

echo "==> Waiting for health check..."
until curl -sf http://localhost:8000/health; do sleep 2; done

echo "==> Disabling global maintenance..."
shield_cmd global disable

echo "==> Deploy complete."

Route-level rolling deploy

For zero-downtime deploys where only specific routes need to go offline:

#!/usr/bin/env bash
# rolling-deploy.sh
set -euo pipefail

shield maintenance "POST:/orders" \
  --reason "Order service upgrade — ETA 10 minutes" \
  --start "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --end "$(date -u -d '+10 minutes' +%Y-%m-%dT%H:%M:%SZ)"

# ... deploy only the orders service ...
docker compose up -d --no-deps --build orders

# Wait for readiness
until curl -sf http://localhost:8001/health; do sleep 2; done

shield enable "POST:/orders"
echo "Orders service back online."

GitHub Actions — deploy workflow

# .github/workflows/deploy.yml
name: Deploy

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    env:
      SHIELD_SERVER_URL: ${{ secrets.SHIELD_SERVER_URL }}

    steps:
      - uses: actions/checkout@v4

      - name: Install shield CLI
        run: pip install "api-shield[cli]"

      - name: Authenticate with ShieldAdmin
        run: shield login ${{ secrets.SHIELD_USER }} --password ${{ secrets.SHIELD_PASS }}

      - name: Enable global maintenance
        run: |
          shield global enable \
            --reason "GitHub Actions deploy — commit ${{ github.sha }}" \
            --exempt /health

      - name: Run database migrations
        run: uv run alembic upgrade head

      - name: Deploy application
        run: |
          # your deploy command here
          kubectl set image deployment/api api=${{ env.IMAGE_TAG }}
          kubectl rollout status deployment/api --timeout=120s

      - name: Disable global maintenance
        if: always()   # run even if a previous step failed
        run: shield global disable

      - name: Verify routes
        run: |
          shield status
          # fail the workflow if any route is unexpectedly disabled
          shield status | grep -qv DISABLED || exit 1

Always disable on failure

Use if: always() on the disable step so maintenance mode is lifted even when the deploy fails. Pair it with a Slack webhook so the team is notified immediately.


Kubernetes — pre/post deploy hooks

Use Kubernetes lifecycle hooks to tie maintenance mode to pod lifecycle:

# k8s/deployment.yaml
spec:
  template:
    spec:
      containers:
        - name: api
          lifecycle:
            preStop:
              exec:
                command:
                  - sh
                  - -c
                  - |
                    shield --server-url $SHIELD_SERVER_URL \
                      maintenance GET:/payments \
                      --reason "Pod shutting down (rolling update)"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 5

And a post-deploy Job to re-enable:

# k8s/post-deploy-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: shield-enable-routes
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: shield-cli
          image: python:3.13-slim
          command:
            - sh
            - -c
            - |
              pip install -q "api-shield[cli]"
              shield login $SHIELD_USER --password $SHIELD_PASS
              shield enable GET:/payments
              shield global disable
          env:
            - name: SHIELD_SERVER_URL
              value: "http://api-svc/shield"
            - name: SHIELD_USER
              valueFrom:
                secretKeyRef: { name: shield-creds, key: username }
            - name: SHIELD_PASS
              valueFrom:
                secretKeyRef: { name: shield-creds, key: password }

Scheduled maintenance via cron + CLI

For recurring maintenance windows (nightly jobs, weekly DB vacuums):

# crontab — every Sunday 02:00–04:00 UTC
0 2 * * 0 shield schedule GET:/reports \
  --start "$(date -u +\%Y-\%m-\%dT02:00:00Z)" \
  --end   "$(date -u +\%Y-\%m-\%dT04:00:00Z)" \
  --reason "Weekly report rebuild"

Or schedule programmatically from Python:

import asyncio
from datetime import datetime, UTC, timedelta
from shield.core.engine import ShieldEngine
from shield.core.models import MaintenanceWindow

async def schedule_nightly(engine: ShieldEngine) -> None:
    now = datetime.now(UTC)
    tonight = now.replace(hour=2, minute=0, second=0, microsecond=0)
    if tonight < now:
        tonight += timedelta(days=1)

    window = MaintenanceWindow(
        start=tonight,
        end=tonight + timedelta(hours=2),
        reason="Nightly data pipeline",
    )
    await engine.schedule_maintenance("GET:/reports", window=window)
    print(f"Scheduled maintenance: {window.start}{window.end}")

Audit log in monitoring pipelines

Pull the audit log to detect unexpected state changes (e.g. a route disabled by an unknown actor):

#!/usr/bin/env python3
"""audit-sentinel.py — alert on unexpected route state changes."""

import httpx, os, sys
from datetime import datetime, UTC, timedelta

SHIELD_URL = os.environ.get("SHIELD_SERVER_URL", "http://localhost:8000/shield")
TOKEN = os.environ["SHIELD_TOKEN"]
LOOKBACK = timedelta(minutes=15)

resp = httpx.get(
    f"{SHIELD_URL}/api/audit?limit=50",
    headers={"Authorization": f"Bearer {TOKEN}"},
    timeout=10,
)
resp.raise_for_status()

cutoff = datetime.now(UTC) - LOOKBACK
unexpected = [
    e for e in resp.json()
    if datetime.fromisoformat(e["timestamp"]) > cutoff
    and e["actor"] not in {"system", "deploy-bot", "alice", "bob"}
]

if unexpected:
    for e in unexpected:
        print(f"UNKNOWN ACTOR  {e['actor']}  {e['action']}  {e['path']}  {e['timestamp']}")
    sys.exit(1)

print("OK")

Environment variable reference

Variable Used by Description
SHIELD_SERVER_URL CLI, monitoring scripts Base URL of the ShieldAdmin mount point
SHIELD_TOKEN Monitoring scripts (direct API calls) Bearer token from shield login
SHIELD_BACKEND App server Backend type: memory, file, redis
SHIELD_ENV App server Current environment name (dev, staging, production)
SHIELD_REDIS_URL App server Redis connection URL for RedisBackend
SHIELD_FILE_PATH App server JSON file path for FileBackend