🎓 What you will learn ▼

Context
Reflex in case of outage
Diagnostic table
The site/service is not responding
The server is inaccessible via SSH
The database is not responding
The disk is full
A container keeps restarting in a loop
The agent is no longer working
After the outage
Post-mortem template
Common mistakes
Checklist

status: complete audience: both chapter: 05 last_updated: 2026-04 contributors: [alexwill87, claude-cockpit] lang: en

5.11 -- In case of outage

Context

An outage happens. The reflex: don't panic, diagnose, fix, prevent. This guide gives you a diagnostic table by symptom for the most common outages.

Reflex in case of outage

Don't touch everything at once. One diagnosis at a time.
Note the time. When it started. To correlate logs.
Check the simplest thing first. Is the server up? Is the service running? Is disk space full?

Diagnostic table

The site/service is not responding

Diagnostic	Command	Fix
The container is stopped	`docker ps -a`	`docker compose up -d`
The port is not exposed	`ss -tlnp \\| grep PORT`	Check docker-compose.yml ports
Nginx is not proxying	`nginx -t && systemctl status nginx`	Fix the config, `systemctl restart nginx`
The service crashes on startup	`docker logs CONTAINER --tail 50`	Read the error, fix, rebuild
DNS does not point	`dig your-domain.com`	Check DNS config

Prevention: automated health check (section 5.1) + monitoring (section 5.10).

The server is inaccessible via SSH

Diagnostic	Command	Fix
The server is down	Hetzner Cloud console	Reboot from console
SSH is blocked by firewall	Hetzner console > rescue mode	`ufw allow 22`
Disk full (SSH cannot create a session)	Hetzner console	Rescue mode, mount disk, free space
Wrong SSH key	From your machine: `ssh -vvv user@ip`	Check authorized_keys

Prevention: never modify ufw/iptables without confirmed SSH rule. Snapshot before network changes.

The database is not responding

Diagnostic	Command	Fix
PostgreSQL is stopped	`systemctl status postgresql`	`systemctl start postgresql`
Too many connections	`psql -c "SELECT count(*) FROM pg_stat_activity;"`	Identify ghost connections, increase max_connections
Disk full	`df -h /var/lib/postgresql`	Free space (logs, tmp)
Corruption	PostgreSQL logs	Restore from last backup (section 5.3)
Wrong password (after rotation)	`psql -U oa_admin -d cockpit`	Check secret in Vault, fix

Prevention: PostgreSQL monitoring + daily tested backup.

The disk is full

Diagnostic	Command	Fix
Identify what takes space	`du -sh /* \\| sort -rh \\| head -10`	See below
Docker logs	`du -sh /var/lib/docker/containers/`	Configure rotation (section 5.2)
System logs	`journalctl --disk-usage`	`journalctl --vacuum-size=200M`
Old backups	`du -sh /var/backups/`	Delete oldest, keep 7 days
Unused Docker images	`docker system df`	`docker system prune -a --filter "until=720h"`
Temporary files	`du -sh /tmp/`	`rm -rf /tmp/old-*` (with discretion)

Prevention: alert at 80% disk usage in health check. Log rotation configured.

A container keeps restarting in a loop

Diagnostic	Command	Fix
Startup error	`docker logs CONTAINER --tail 100`	Read the error, fix
Missing dependency	`docker logs CONTAINER 2>&1 \\| grep -i "error\\|fatal"`	npm install, pip install, etc.
Missing environment variable	`docker inspect CONTAINER \\| jq '.[0].Config.Env'`	Add to .env or docker-compose.yml
Port already in use	`ss -tlnp \\| grep PORT`	Stop the other service or change port
OOM Kill	`dmesg \\| grep -i oom`	Increase memory or optimize app

Prevention: restart: unless-stopped in docker-compose.yml + health check + logs.

The agent is no longer working

Diagnostic	Command	Fix
Invalid/expired API key	Test a direct API call	Renew the key
Rate limit reached	Check response headers	Wait or change plan
API service down	Check status.anthropic.com (or equivalent)	Wait
Broken configuration	Read .claude/ or equivalent	Restore from backup
Deprecated model	Read error logs	Change model in config

Prevention: fallback on another configured model (section 5.13). Budget and rate limits monitored.

After the outage

Document. What, when, cause, fix, duration.
Prevent. Add a check to the health check, a rule to the boundary prompt, a workflow to WORKFLOWS.md.
Test the prevention. Simulate the outage again and verify that monitoring detects it.

Post-mortem template

## Incident — YYYY-MM-DD

**Symptom:** [what was observed]
**Start:** HH:MM — **End:** HH:MM — **Duration:** X min
**Cause:** [root cause]
**Fix:** [what was done]
**Impact:** [who/what was affected]
**Prevention:** [measure added to prevent recurrence]

Common mistakes

Panicking and touching everything. You change 3 configs at once. Now you have 4 problems instead of one.

No recent backup. Recovery is impossible or too old. See section 5.3.

Not documenting. The same outage comes back 3 months later and you forgot the fix.

Fixing without understanding. "I rebooted and it works." Until next time. Identify the root cause.

Checklist

[ ] You can diagnose the 6 scenarios above.
[ ] The diagnostic commands are tested and work on your setup.
[ ] A post-mortem template exists.
[ ] Past outages are documented.
[ ] Each outage generated a prevention measure.

Well done, you completed this section!

You covered: Context, Reflex in case of outage, Diagnostic table, After the outage and 2 more. Continue →

Proposer une modification sur GitHub

Commentaires et discussions

← Monitoring and Alerts When to add a second agent →