docs: sync README and docs/ with current codebase

Surfaces features that landed after the last big docs pass: per-ride history pages, Fast Lane wait times, outage shading on the today chart, Tier-5 wait-time sampler, production-hardening pieces (rate limiter, structured logger, env validation, graceful shutdown), and the new rides + ride_wait_samples tables. Also corrects the weather-delay rule to match the "open" vs "closing" gate now in rides.ts.
2026-06-02 15:31:50 -04:00
parent 2e9cec0b56
commit f87462385c
5 changed files with 397 additions and 72 deletions
@@ -116,9 +116,12 @@ volumes:
 |----------|---------|-------------|
 | `TZ` | `UTC` | Process timezone. Controls when cron jobs fire. Set to `America/New_York` in production so schedules align with US Eastern parks. |
 | `PARK_HOURS_STALENESS_HOURS` | `72` | Hours before park schedule data is considered stale and re-fetched. Lower values increase API load; higher values increase data lag. |
+| `RATE_LIMIT_PER_MIN` | `60` | Per-IP request limit for the public API. Over-limit requests return `429 Too Many Requests` with a `Retry-After` header. Enforced by `backend/src/middleware/rate-limit.ts`. Behind a proxy, ensure `x-forwarded-for` is set or every client looks like the proxy IP. |
 | `NODE_ENV` | -- | Set to `production` in Docker. |
 | `PORT` | `3001` | Server listen port. |

+`backend/src/config.ts` parses and validates these at startup. A bad value (e.g. `PORT=foo`) fails fast with a thrown `Error` rather than surfacing in a request handler later.
+
 ---

 ## CI/CD Pipeline
@@ -167,9 +170,10 @@ These are configured in the Gitea repository settings under **Settings > Actions
 3. **Verify the backend started:**
   ```bash
   docker compose logs backend
-   # Look for: [backend] database initialized
-   #           [scheduler] cron jobs registered
-   #           [backend] listening on http://localhost:3001
+   # Look for (structured log lines, see the Log Reference section):
+   #   [INFO] [startup] database initialized
+   #   [INFO] [scheduler] cron jobs registered ...
+   #   [INFO] [startup] listening url=http://localhost:3001
   ```

 4. **Check database status (will be empty on first run):**
@@ -251,7 +255,7 @@ Backups are recommended for continuity (avoiding the 5-10 minute re-scrape windo

 ### Tiered Cron Schedule

-The backend runs four scraping tiers via `node-cron`:
+The backend runs five scraping tiers via `node-cron`:

 | Tier | Cron Expression | Schedule | Scope | Delay |
 |------|-----------------|----------|-------|-------|
@@ -259,10 +263,24 @@ The backend runs four scraping tiers via `node-cron`:
 | 2 | `0 */6 * * *` | Every 6 hours | Current month for all parks | 1000ms |
 | 3 | `0 3,15 * * *` | 3 AM and 3 PM | Current + next month | 1000ms |
 | 4 | `0 3 * * *` | Daily at 3 AM | Full year (all 12 months) | 1000ms |
+| 5 | `*/5 * * * *` | Every 5 minutes | Wait-time samples for currently-open parks into `ride_wait_samples` | parallel chunks of 6 |

-**Staleness:** Tiers 2-4 skip any park-month that was scraped within `PARK_HOURS_STALENESS_HOURS` (default 72h). Tier 1 always fetches (uses diff-before-write instead).
+**Staleness:** Tiers 2-4 skip any park-month that was scraped within `PARK_HOURS_STALENESS_HOURS` (default 72h). Tier 1 always fetches (uses diff-before-write instead). Tier 5 only samples parks whose `park_days` row marks them open today *and* whose current local time is inside the operating window (with a 1-hour closing buffer).

-**Off-season:** Tier 1 only runs from March through December. The month constraint `3-12` in the cron expression skips January and February when most parks are closed.
+**Off-season:** Tier 1 only runs from March through December. The month constraint `3-12` in the cron expression skips January and February when most parks are closed. Tier 5 runs year-round but is effectively a no-op when no parks are open.
+
+**Concurrency latches:** Every tier is wrapped in `withLatch()` (see `backend/src/services/scheduler.ts`). If a tick is still running when the next would fire, the new tick is *skipped* and logged with a `previous run still in progress` warning rather than stacking. Each tier has its own latch so a slow Tier-4 doesn't block Tier-5's 5-minute cadence.
+
+**Weather-delayed parks skipped from sampling:** Tier 5 detects the "rides exist but all closed during scheduled hours" case and skips writes for that park, so a storm doesn't poison the uptime statistics with hours of `is_open=0` samples.
+
+### Startup Behavior
+
+On boot, the scheduler checks `getParkDayCount()` against a threshold of 50 rows:
+
+- **Empty / nearly-empty database** (< 50 rows): runs `scrapeToday()` followed by `scrapeFullYear()` in sequence. Logs `[scheduler.startup]` lines for each phase.
+- **Populated database** (≥ 50 rows): skips the startup scrape and relies on cron tiers. Logs `skipping startup scrape — relying on cron`.
+
+This replaces the earlier behavior of full-scraping on every container start, which doubled outbound API load and delayed readiness on every deploy.

 ### Timezone Sensitivity

@@ -374,7 +392,7 @@ curl http://localhost:3001/api/status
   ```bash
   docker compose logs backend --tail 50
   ```
-   Look for `[backend] listening on http://localhost:3001`.
+   Look for an `[INFO] [startup] listening url=http://localhost:3001` line.

 2. **Check if the database has data:**
   ```bash
@@ -452,28 +470,39 @@ If the database becomes corrupted (unlikely with SQLite WAL mode, but possible a

 ## Log Reference

-| Prefix | Source | Meaning |
-|--------|--------|---------|
-| `[backend]` | `index.ts` | Startup messages: DB initialized, server listening |
-| `[scheduler]` | `scheduler.ts` | Cron job triggers with tier number |
-| `[today]` | `scraper.ts` | Per-park results for the today tier (updated/skipped/error) |
-| `[month]` | `scraper.ts` | Per-park-month results (open days count, rate limited, errors) |
-| `[rate-limited]` | `sixflags.ts` | HTTP 429/503 with backoff timing and retry attempt count |
+The backend uses a small structured logger (`backend/src/log.ts`). Every line has the format:
+
+```
+<ISO timestamp> [<LEVEL>] [<tag>] <message> key1=value1 key2=value2 …
+```
+
+Levels are `INFO`, `WARN`, `ERROR`. `ERROR` writes to stderr; the others write to stdout. Grep-friendly: filter by tag (`grep '\[scheduler.tier1\]'`) or by key (`grep 'park=cedarpoint'`).
+
+| Tag | Source | Meaning |
+|-----|--------|---------|
+| `startup` | `index.ts` | Config loaded, DB initialized, server listening |
+| `shutdown` | `index.ts` | `SIGTERM`/`SIGINT` received; graceful shutdown progress |
+| `http` | `index.ts` | One line per request: `method`, `path`, `status`, `ms` |
+| `scheduler` | `scheduler.ts` | Cron job registration summary on boot |
+| `scheduler.tier1` … `scheduler.tier5` | `scheduler.ts` | Each tier's tick; includes skip-due-to-latch warnings |
+| `scheduler.startup` | `scheduler.ts` | Result of the "database empty" startup scrape |
+| `today` / `month` | `scraper.ts` | Per-park / per-month scrape results |
+| `wait-sampler` | `wait-sampler.ts` | Tier-5 per-park sample writes, errors, weather-delay skips |
+| `rate-limit` | `middleware/rate-limit.ts` | `blocked` event with `ip`, `count`, `retryAfter` |
+| `rides` | `routes/rides.ts` | Per-request warnings when upstream calls fail |
+| `rate-limited` | `lib/scrapers/sixflags.ts` | HTTP 429/503 from Six Flags with backoff timing |

 **Example log output:**

 ```
-[backend] database initialized
-[scheduler] cron jobs registered
-  tier-1: today        — hourly (Mar-Dec)
-  tier-2: current month — every 6h
-  tier-3: upcoming     — 3 AM + 3 PM
-  tier-4: full year    — 3 AM daily
-[backend] listening on http://localhost:3001
-[scheduler] tier-1: scraping today @ 2026-04-23T14:00:00.000Z
-[today] Great Adventure: updated (open 10am - 6pm)
-[today] Cedar Point: updated (open 10am - 8pm)
-[today] done: 24 fetched, 3 updated, 0 skipped, 0 errors
+2026-04-23T14:00:00.012Z [INFO] [startup] config loaded port=3001 nodeEnv=production parkHoursStalenessHours=72 rateLimitPerMin=60
+2026-04-23T14:00:00.034Z [INFO] [startup] database initialized
+2026-04-23T14:00:00.041Z [INFO] [scheduler] cron jobs registered tiers="tier1=hourly(Mar-Dec) tier2=6h tier3=3am+3pm tier4=3am-daily tier5=5min"
+2026-04-23T14:00:00.042Z [INFO] [scheduler] skipping startup scrape — relying on cron existingRows=8742
+2026-04-23T14:00:00.045Z [INFO] [startup] listening url=http://localhost:3001
+2026-04-23T14:00:00.123Z [INFO] [http] GET /api/calendar/week status=200 ms=18
+2026-04-23T14:00:10.001Z [INFO] [scheduler.tier1] scraping today
+2026-04-23T14:05:00.001Z [INFO] [scheduler.tier5] sample run complete parksSampled=14 parksSkipped=10 samplesWritten=612 weatherDelayed=0 errors=0
 ```

 ---