# Mist — Operational Runbook A short, dense reference for "what do I do when X happens." Fill in as we hit real situations. ## Backend VM access SSH: `ssh mist@` (or ``). Compose lives at: `/opt/mist/` (TODO: confirm during deploy). All `docker compose` commands run from that directory. ## Normal operations ### Deploy a new image CI pushes images to GHCR on merge to `main`. To pull and restart: ```sh cd /opt/mist docker compose pull docker compose up -d docker compose ps ``` ### View logs ```sh docker compose logs -f api docker compose logs -f worker docker compose logs --tail=200 api worker ``` ### Restart the stack ```sh cd /opt/mist docker compose down docker compose up -d ``` ### Restart a single service ```sh docker compose restart worker ``` ### Run a Celery task manually (debugging) ```sh docker compose exec api python -c "from mist.worker.tasks import generate_direct_update; generate_direct_update.delay('Satisfactory', '1.0.0.0', '1.0.0.1')" ``` ## Failure scenarios ### NAS is unreachable **Symptoms:** worker tasks fail with `FileNotFoundError` for `/mnt/nas/...`, API `/downloads/*` returns 404 for non-cached files. **Action:** 1. Verify NAS reachability from the VM: `ls /mnt/nas/mist/games/` 2. If empty/error, NFS mount is broken. Check mount: `mount | grep nas` 3. Remount: `sudo mount -a` (assuming `/etc/fstab` has the entry) 4. If still broken, log into NAS, verify it's serving NFS 5. Stack will recover automatically once NAS is back; in-flight jobs will retry per Celery config ### Postgres won't start **Symptoms:** `api` container restarts in a loop, logs show `connection refused` to `postgres`. **Action:** 1. `docker compose logs postgres` — look for the actual error 2. Common cause: out of disk space. `df -h` on the VM. 3. If corrupted volume: stop stack, restore from last `pg_dump` (see "Restore from backup") ### Worker queue is backed up **Symptoms:** Builds take forever, RabbitMQ UI (`http://:15672/`) shows growing queue depth. **Action:** 1. Check worker logs for stuck tasks 2. Scale workers: edit `docker-compose.yml`, set `worker.deploy.replicas: 2`, `docker compose up -d` 3. If a specific task is hanging, purge it: `docker compose exec worker celery -A mist.worker purge` ### Cache disk is full **Symptoms:** Build jobs fail with `OSError: no space left on device`. **Action:** 1. `df -h` to confirm 2. `docker compose exec api python -m mist.core.paths --clear-cache` (TODO: implement this maintenance task) 3. Or manually: stop stack, `rm -rf /var/lib/docker/volumes/mist_cache-vol/_data/*`, restart ### Stack won't come back up after VM reboot **Symptoms:** SSH in after reboot, `docker compose ps` shows nothing or services are Exited. **Action:** 1. Verify Docker daemon: `systemctl status docker` 2. `cd /opt/mist && docker compose up -d` 3. If still failing, check `restart: unless-stopped` is set on all services in `docker-compose.yml` ## Backups ### What we back up - Postgres (full dump) — daily - `Mist/.env` (passwords, secrets) — versioned outside this repo - `docker-compose.yml` and any host-level config — in git ### What we DON'T back up here - Game files on NAS — NAS has its own backup story (assumed RAID + remote replication) - Hot cache — regenerable from NAS ### Take a Postgres backup ```sh docker compose exec -T postgres pg_dump -U mist mist | zstd > /mnt/nas/mist/backups/pg-$(date +%F).sql.zst ``` ### Restore from a Postgres backup ```sh docker compose stop api worker zstd -d < /mnt/nas/mist/backups/pg-YYYY-MM-DD.sql.zst | docker compose exec -T postgres psql -U mist mist docker compose start api worker ``` ## Provisioning a new friend account (Until the admin portal supports this end-to-end.) ```sh docker compose exec api python -m mist.scripts.create_user [--admin] ``` (TODO: implement that script.) ## Resetting your admin password ```sh docker compose exec api python -m mist.scripts.reset_password ``` (TODO: implement that script.) ## Health checks (manual) ```sh curl -s https://api.mist.example/healthz # expect {"ok": true} curl -s https://api.mist.example/readyz # expect 200 if DB/Redis/RabbitMQ all reachable ```