Initial Mist scaffold
admin-web / build (push) Successful in 22s
backend / test (push) Failing after 52s
mistpipe / test (push) Successful in 10s
admin-web / build-and-push (push) Failing after 5s
backend / build-and-push (push) Has been skipped

Successor to the Josh Steam prototypes. Single-VM Docker Compose stack with
the load-bearing core/ logic ported from JoshSteam CDN with bug fixes.

Contents:
- backend/  FastAPI + Celery (same image, two entrypoints)
            core/  hdiff, librsync, chain_replay, manifest, compression,
                   discord, steam, unrealpak, paths
            api/   auth, catalog, admin, builds (skeletons) + downloads (real)
            worker/  Celery factory replacing the missing prototype Tasks/__init__.py
            db/    SQLAlchemy models + Alembic initial migration
- admin-web/  SvelteKit + Tailwind skeleton
- client/    Tauri 2 + Svelte skeleton (Mist placeholder UI)
- mistpipe/  click-based admin CLI with subcommand stubs
- docs/      ARCHITECTURE, DECISIONS (9 ADRs), RUNBOOK
- docker-compose.yml + dev overlay + .github/workflows

Bugs fixed during port:
- Routes/download.py:2 stray backslash on import line
- Utils/celery.py inspect.reserved() missing parens + double active() typo
- Hardcoded OneDrive/Desktop paths replaced with pydantic-settings config
- Discord webhook URL + RabbitMQ password moved to env vars
- Missing Tasks/__init__.py reconstructed as worker/__init__.py

Out of scope for this commit: route bodies, UI screens, mistpipe subcommand
bodies, real image builds.
This commit is contained in:
2026-06-07 19:39:25 -04:00
commit bfd6771a9a
76 changed files with 3890 additions and 0 deletions
+140
View File
@@ -0,0 +1,140 @@
# Mist — Architecture
## Purpose
A private Steam-clone for ~510 friends. Distributes games and updates with bandwidth-efficient delta patching. Sized to the actual problem — the complexity is in the delta-patching system, not in the deployment.
## Topology
Friends use a regular web/desktop client over the open internet. Public DNS resolves to a border server. Nginx Proxy Manager terminates TLS (Let's Encrypt) and reverse-proxies to the backend VM over Tailscale. The backend VM itself has no public exposure.
```
┌──────────────────────────────┐
Friend │ Tauri client (Svelte UI │
(open │ inside Rust shell) │
internet) └──────────────┬───────────────┘
│ HTTPS, public domain
│ store.mist.example
│ admin.mist.example
│ dl.mist.example
┌─────────────────────────────────┐
│ Border server (public IP) │
│ Nginx Proxy Manager + Let's │
│ Encrypt TLS termination │
└────────────────┬────────────────┘
│ Tailscale (private backhaul)
┌─────────────────────────────────────────────┐
│ Proxmox VM "mist" │
│ Docker Compose stack: │
│ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ api │ │ admin-web │ │
│ │ FastAPI: │ │ static Svelte/ │ │
│ │ /auth │ │ TS served by │ │
│ │ /catalog │ │ tiny nginx │ │
│ │ /admin │ │ (calls api with │ │
│ │ /downloads │ │ admin JWT) │ │
│ │ /builds │ └──────────────────┘ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ worker │ │
│ │ Celery │ same image, │
│ │ delta-gen, │ different entrypoint │
│ │ archive prep,│ │
│ │ notifications│ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌────────┐ ┌─────────────┐ │
│ │ postgres │ │ redis │ │ rabbitmq │ │
│ └──────────┘ └────────┘ └─────────────┘ │
│ │
│ Volumes: │
│ - hot cache (.tar.zst archives) │
│ - postgres data │
│ - redis data │
│ - rabbitmq data │
│ - /mnt/nas → NFS to NAS (games) │
└────────────────────────────────────────────┘
```
## Containers
| Container | Stack | Owns |
|---|---|---|
| **api** | FastAPI + SQLAlchemy + Postgres + Redis | Single web app with internally-modular code: `auth/`, `catalog/`, `admin/`, `downloads/`, `builds/`. Issues JWTs. Serves resumable downloads. Receives `mistpipe` uploads. Queues background work into RabbitMQ. |
| **worker** | Celery (same image as `api`, different entrypoint) | Consumes RabbitMQ. Runs the heavy stuff: hdiff delta generation, librsync indirect-delta generation, `chain_replay` cold reconstruction, `.tar.zst` archive packing, Discord notifications. |
| **admin-web** | SvelteKit, built static + tiny nginx | Admin UI. Calls `api/admin/*` with admin JWT. |
| **postgres** | postgres:16 | Catalog, users, build job state. |
| **redis** | redis:7 | Celery result backend, cache, ephemeral session data. |
| **rabbitmq** | rabbitmq:3.13-management | Celery broker, event bus for `notification.*` events. |
## Non-container artifacts
| Artifact | Stack | Notes |
|---|---|---|
| **client** | Tauri 2 (Rust shell + Svelte UI) | Friend-facing app. Distributed as a per-platform installer. Embeds (or spawns) the patch-application logic. |
| **mistpipe** | Python + click | Admin CLI. `login`, `new-game`, `push`, `ls`, `rm`, `resync-steam`. JWT stored in OS keychain. |
## Storage
**NAS** (mounted at `/mnt/nas` inside the VM via NFS) is the **source of truth** for game files:
```
/mnt/nas/mist/games/<Title>/
base_version.7z ← immutable original
depot/ ← current latest version's files
manifests/
<Title>.json ← ordered linear version list (legacy; will migrate to Postgres)
<version>.json ← per-version SHA-256 manifest
deltas/<version>/
delta_manifest.json
new_files/...
*.patch ← hdiff direct patches
```
**Docker volumes** on the VM hold the hot path:
- `cache-vol``.tar.zst` archives ready to serve, reconstructed historical versions
- `tmp-vol` — in-flight delta-gen working dirs
- `postgres-vol`, `redis-vol`, `rabbitmq-vol` — service data
## Update modes
Two delta strategies, decided at request time:
- **Direct update** = consecutive versions (`1.0.0.2``1.0.0.3`). Deltas were pre-generated by `hdiff` at push time. Serve from cache or zip on demand.
- **Indirect update** = arbitrary version jumps (`1.0.0.0``1.0.0.3`). Server tells client which files changed. Client generates `librsync` signatures of its local files, POSTs them. Worker generates `rdiff` deltas against the server's copy and packs them into a `.tar.zst`. Client applies with `rdiff patch`.
For arbitrary historical reconstructions the worker runs `chain_replay` — starts from the base or closest cached version and walks forward applying hdiff patches at each step, caching results for next time.
## Auth
- Username + password, you provision accounts via the admin portal
- Passwords hashed with argon2id
- JWT issued on login, scope claim distinguishes `user` from `admin`
- Admin scope required for `/admin/*` and `/builds/*`
- Per-game `is_private` boolean flag; non-admin users only see public games + games they've been explicitly granted (future)
## Catalog metadata
Steam appdetails pull-through: when an admin adds a game with an `app_id`, the catalog service fetches `https://store.steampowered.com/api/appdetails` and stores `short_description` + `header_image`. Admin can override either via hand-edited fields.
## Out of scope (for MVP)
- Branches (stable/beta/internal)
- Save sync, achievements, friend lists, in-app chat
- Payments
- Multi-tenancy (one store)
- Public self-serve signup
- Per-user entitlements beyond `is_private`
- Client-side delta-gen in `mistpipe` (server does it for MVP)
- High availability — single VM is fine at this scale; backup + restore covers failure
## Why this shape
The original "Josh Steam" prototypes proved out the hard parts: two-mode delta-patching, chain-replay for cold reconstruction, resumable downloads. The 2-year-later rebuild focuses on **finishing the product**, not re-exploring the design space. Microservices and k8s were considered and rejected for the actual scale — see `DECISIONS.md`.
+109
View File
@@ -0,0 +1,109 @@
# Mist — Decisions Log
This file is an append-only log of significant decisions, in lightweight ADR (Architecture Decision Record) format. The goal is that future-you (or a contributor) can reconstruct *why* a choice was made, not just *what* was chosen.
## Format
Each entry:
```
## NNNN — Short title (YYYY-MM-DD)
**Status:** Accepted | Superseded | Deprecated
**Context:** What problem were we solving? What forces were at play?
**Decision:** What did we decide?
**Consequences:** What does this make easier? Harder?
**Alternatives considered:** What else we looked at and why we passed.
```
---
## 0001 — Project named "Mist" (supersedes "Josh Steam") (2026-06-07)
**Status:** Accepted
**Context:** The original prototype was named "Josh Steam" because it was a personal project for the author and his friends. The rebuild is a real product (private but real) and benefits from a name that travels.
**Decision:** Project name is **Mist**. CLI is `mistpipe` (homage to Steam's SteamPipe). Docker images namespaced `mist-*`. Domain pattern `*.mist.example` in docs.
**Consequences:** All references to "Josh Steam" or "joshsteamctl" in any new code/docs must use the new name. Existing prototypes at `Josh Steam/` on disk stay untouched as historical reference.
**Alternatives considered:** Keep "Josh Steam". Rejected — uncomfortable to share, and the name doesn't say what it is.
---
## 0002 — Single-VM Docker Compose instead of Kubernetes (2026-06-07)
**Status:** Accepted
**Context:** Original draft of the architecture proposed 7 microservices on a multi-node k3s/rke2 cluster with ArgoCD GitOps, Longhorn storage, MetalLB load balancer, and the Tailscale operator. The framing was "use this as an excuse to learn k8s and microservices."
**Decision:** Run the backend as Docker Compose on a single Proxmox VM. Six containers total: `api`, `worker`, `admin-web`, `postgres`, `redis`, `rabbitmq`. Stateful services share the same compose stack with named volumes. NAS mounted via NFS.
**Consequences:** Massively less operational complexity. Deploy is `docker compose pull && up -d` over SSH. No service mesh, no ingress controller, no GitOps tooling to learn before the product runs. The project itself (delta-patching, content distribution) is already complex enough; deployment shouldn't compound it. Trade-off: less k8s/microservices résumé padding.
**Alternatives considered:**
- Multi-node k8s + GitOps + 7 microservices. Rejected — adds learning surface unrelated to the actual problem and is wildly oversized for ~10 users.
- Modular monolith on bare metal (no containers). Rejected — losing the reproducibility / portability of containers isn't worth the marginal simplicity.
---
## 0003 — Monorepo across services (2026-06-07)
**Status:** Accepted
**Context:** Backend, worker, admin-web, client, and CLI all evolve together. Sharing types/contracts is easier when they share a repo.
**Decision:** Single git repo with one top-level folder per deployable. Backend and worker share a Python package (`backend/src/mist/`) and run as different entrypoints of the same Docker image.
**Consequences:** One CI workflow per artifact, but a single source of truth for the system. Refactors that cross boundaries are atomic.
**Alternatives considered:** Polyrepo. Rejected — friend-scale doesn't justify the coordination overhead.
---
## 0004 — Modular monolith for backend (api + worker, same code) (2026-06-07)
**Status:** Accepted
**Context:** Original plan split the backend into multiple services (identity, catalog, builds, downloads, client-bff, notifications). At ~10 users this is overkill.
**Decision:** Single FastAPI app with internal modules per domain (`api/auth.py`, `api/catalog.py`, etc.). Celery worker shares the same Python package and Docker image; only the entrypoint differs.
**Consequences:** Refactoring boundaries is a code-level concern, not an ops concern. If a domain genuinely outgrows the monolith later, extract it then.
**Alternatives considered:** True microservices. Rejected per ADR 0002.
---
## 0005 — Linear versions only, no branches (2026-06-07)
**Status:** Accepted
**Context:** Steam supports branches (stable / beta / internal). Useful for a real game publisher; overkill here.
**Decision:** Versions form a linear ordered list per game. No branches in MVP.
**Consequences:** Catalog data model is simpler (just `ordinal` on `Version`). Direct/indirect update routing logic is unchanged from prototype. If we ever want betas, we add a `branch` column and migrate.
**Alternatives considered:** Steam-style branches. Rejected for MVP — no current need.
---
## 0006 — Public-by-default catalog with an `is_private` flag (2026-06-07)
**Status:** Accepted
**Context:** Real entitlements ("Tim owns Game X, Tom doesn't") add an entitlements service. Friend-scale doesn't justify it.
**Decision:** Single boolean `is_private` on `Game`. Public games are visible to anyone logged in. Private games are admin-only (future: explicit grants).
**Consequences:** No entitlements service. If we want per-user grants later, add a `game_user_grants` table without breaking anything.
**Alternatives considered:** Full Steam-style ownership. Rejected as premature.
---
## 0007 — Tauri client (Rust shell + Svelte UI) (2026-06-07)
**Status:** Accepted
**Context:** Original prototype client was PyQt5. Tauri is smaller, modern, builds tiny installers, and lets the UI be written in web tech.
**Decision:** Client is a Tauri 2 app with Svelte UI inside.
**Consequences:** Need to either port patch-application logic to Rust or ship a Python sidecar the Tauri shell shells out to. UI is web tech (good ecosystem). Installer is small.
**Alternatives considered:** Keep PyQt5 (familiar but dated), Electron (huge install), web-only (loses native install/launch).
---
## 0008 — Username + password auth, admin-provisioned (2026-06-07)
**Status:** Accepted
**Context:** Options were Discord OAuth (natural fit since friends use Discord), self-hosted SSO (Authentik/Keycloak), or username/password.
**Decision:** Username + password, argon2id hashing, admin provisions accounts manually via admin portal.
**Consequences:** Simplest. No OAuth integration. No self-serve signup. Adding Discord OAuth later is straightforward.
**Alternatives considered:** Discord OAuth (declined — adds dependency on Discord availability), magic-link email (needs SMTP).
---
## 0009 — Border-server reverse proxy + Tailscale backhaul, not cluster-side ingress (2026-06-07)
**Status:** Accepted
**Context:** Friends shouldn't need to install Tailscale to use the service. Backend VM shouldn't be on the public internet.
**Decision:** Public DNS → border server (public IP) → Nginx Proxy Manager terminates TLS via Let's Encrypt → Tailscale backhaul → VM. VM has no public exposure.
**Consequences:** Friends use a regular browser/client over HTTPS. TLS lives at the border, not in the cluster. Cluster-side cert-manager not needed.
**Alternatives considered:** Tailscale on every friend's machine (rejected — bad UX), public IP on the VM (rejected — security surface).
+151
View File
@@ -0,0 +1,151 @@
# Mist — Operational Runbook
A short, dense reference for "what do I do when X happens." Fill in as we hit real situations.
## Backend VM access
SSH: `ssh mist@<vm-tailnet-name>` (or `<vm-tailnet-ip>`).
Compose lives at: `/opt/mist/` (TODO: confirm during deploy).
All `docker compose` commands run from that directory.
## Normal operations
### Deploy a new image
CI pushes images to GHCR on merge to `main`. To pull and restart:
```sh
cd /opt/mist
docker compose pull
docker compose up -d
docker compose ps
```
### View logs
```sh
docker compose logs -f api
docker compose logs -f worker
docker compose logs --tail=200 api worker
```
### Restart the stack
```sh
cd /opt/mist
docker compose down
docker compose up -d
```
### Restart a single service
```sh
docker compose restart worker
```
### Run a Celery task manually (debugging)
```sh
docker compose exec api python -c "from mist.worker.tasks import generate_direct_update; generate_direct_update.delay('Satisfactory', '1.0.0.0', '1.0.0.1')"
```
## Failure scenarios
### NAS is unreachable
**Symptoms:** worker tasks fail with `FileNotFoundError` for `/mnt/nas/...`, API `/downloads/*` returns 404 for non-cached files.
**Action:**
1. Verify NAS reachability from the VM: `ls /mnt/nas/mist/games/`
2. If empty/error, NFS mount is broken. Check mount: `mount | grep nas`
3. Remount: `sudo mount -a` (assuming `/etc/fstab` has the entry)
4. If still broken, log into NAS, verify it's serving NFS
5. Stack will recover automatically once NAS is back; in-flight jobs will retry per Celery config
### Postgres won't start
**Symptoms:** `api` container restarts in a loop, logs show `connection refused` to `postgres`.
**Action:**
1. `docker compose logs postgres` — look for the actual error
2. Common cause: out of disk space. `df -h` on the VM.
3. If corrupted volume: stop stack, restore from last `pg_dump` (see "Restore from backup")
### Worker queue is backed up
**Symptoms:** Builds take forever, RabbitMQ UI (`http://<vm>:15672/`) shows growing queue depth.
**Action:**
1. Check worker logs for stuck tasks
2. Scale workers: edit `docker-compose.yml`, set `worker.deploy.replicas: 2`, `docker compose up -d`
3. If a specific task is hanging, purge it: `docker compose exec worker celery -A mist.worker purge`
### Cache disk is full
**Symptoms:** Build jobs fail with `OSError: no space left on device`.
**Action:**
1. `df -h` to confirm
2. `docker compose exec api python -m mist.core.paths --clear-cache` (TODO: implement this maintenance task)
3. Or manually: stop stack, `rm -rf /var/lib/docker/volumes/mist_cache-vol/_data/*`, restart
### Stack won't come back up after VM reboot
**Symptoms:** SSH in after reboot, `docker compose ps` shows nothing or services are Exited.
**Action:**
1. Verify Docker daemon: `systemctl status docker`
2. `cd /opt/mist && docker compose up -d`
3. If still failing, check `restart: unless-stopped` is set on all services in `docker-compose.yml`
## Backups
### What we back up
- Postgres (full dump) — daily
- `Mist/.env` (passwords, secrets) — versioned outside this repo
- `docker-compose.yml` and any host-level config — in git
### What we DON'T back up here
- Game files on NAS — NAS has its own backup story (assumed RAID + remote replication)
- Hot cache — regenerable from NAS
### Take a Postgres backup
```sh
docker compose exec -T postgres pg_dump -U mist mist | zstd > /mnt/nas/mist/backups/pg-$(date +%F).sql.zst
```
### Restore from a Postgres backup
```sh
docker compose stop api worker
zstd -d < /mnt/nas/mist/backups/pg-YYYY-MM-DD.sql.zst | docker compose exec -T postgres psql -U mist mist
docker compose start api worker
```
## Provisioning a new friend account
(Until the admin portal supports this end-to-end.)
```sh
docker compose exec api python -m mist.scripts.create_user <username> <password> [--admin]
```
(TODO: implement that script.)
## Resetting your admin password
```sh
docker compose exec api python -m mist.scripts.reset_password <username> <new-password>
```
(TODO: implement that script.)
## Health checks (manual)
```sh
curl -s https://api.mist.example/healthz # expect {"ok": true}
curl -s https://api.mist.example/readyz # expect 200 if DB/Redis/RabbitMQ all reachable
```