"body":"Flagged by AI reviewer in PR #909.\n\n## Problem\n\n`tools/vault-import.sh` still uses hardcoded `secret/data/${path}` for its curl-based KV write (lines 149, 151, 162, 166, 170). The rest of the codebase was migrated to the configurable `VAULT_KV_MOUNT` variable (defaulting to `kv`) via PR #909. Any deployment with `kv/` as its KV mount will see 403/404 failures when `vault-import.sh` runs.\n\n## Fix\n\nEither:\n1. Refactor the write in `vault-import.sh` to call `hvault_kv_put` (which now respects `VAULT_KV_MOUNT`), or\n2. Replace the hardcoded `secret/data` reference with `${VAULT_KV_MOUNT:-kv}/data` matching the convention in `lib/hvault.sh`.\n\n---\n*Auto-created from AI review*\n\n## Affected files\n\n- `tools/vault-import.sh` (lines 149, 151, 162, 166, 170 — hardcoded `secret/data` references)\n- `lib/hvault.sh` (reference implementation using `VAULT_KV_MOUNT`)\n\n## Acceptance criteria\n\n- [ ] `tools/vault-import.sh` uses `${VAULT_KV_MOUNT:-kv}/data` (or calls `hvault_kv_put`) instead of hardcoded `secret/data`\n- [ ] No hardcoded `secret/data` path references remain in `tools/vault-import.sh`\n- [ ] Vault KV writes succeed when `VAULT_KV_MOUNT=kv` is set (matching the standard deployment config)\n- [ ] `shellcheck` clean\n"
"body":"Flagged by AI reviewer in PR #911.\n\n## Problem\n\n`lib/generators.sh` fixes the `agents` service missing `pull_policy: build` in `--build` mode (PR #893), but the `edge` service has the same root cause: the sed replacement at line 664 produces `build: ./docker/edge` with no `pull_policy: build`. Without it, `docker compose up -d --force-recreate` reuses the cached edge image and silently keeps running stale code even after source changes.\n\n## Fix\n\nAdd `\\n pull_policy: build` to the edge sed replacement, matching the pattern applied to agents in PR #893.\n\n---\n*Auto-created from AI review*\n\n## Affected files\n\n- `lib/generators.sh` (line 664 — edge service sed replacement missing `pull_policy: build`)\n\n## Acceptance criteria\n\n- [ ] `lib/generators.sh` edge service block emits `pull_policy: build` when `--build` mode is active (matching the pattern from PR #893 for the agents service)\n- [ ] `docker compose up -d --force-recreate` after source changes rebuilds the edge image rather than using the cached layer\n- [ ] Generated `docker-compose.yml` edge service stanza contains `pull_policy: build`\n- [ ] `shellcheck` clean\n"
"body":"## Incident\n\n**2026-04-16 ~10:55–11:52 UTC.** Woodpecker CI agent (`disinto-woodpecker-agent`) entered a repeated gRPC-error crashloop (Codeberg #813 class — gRPC-in-nested-docker). Every workflow it accepted exited 1 within seconds, never actually running pipeline steps.\n\n**Blast radius:** dev-qwen took issue #842 at 10:55, opened PR #859, and burned its full 3-attempt `pr-lifecycle` CI-fix budget between 10:55 and 11:08 reacting to these infra-flake \"CI failures.\" Each failure arrived in ~30–60 seconds, too fast to be a real test run. After exhausting the budget, dev-qwen marked #842 as `blocked: ci_exhausted` and moved on. No real bug was being detected; the real failure surfaced later only after an operator restarted the WP agent and manually retriggered pipeline #966 — which then returned a legitimate `bats-init-nomad` failure in test #6 (different issue).\n\n**Root cause of the infra-flake:** gRPC-in-nested-docker bug, Woodpecker server ↔ agent comms inside nested containers. Known-flaky; restart of `disinto-woodpecker-agent` clears it.\n\n**Recovery:** operator `docker restart disinto-woodpecker-agent` + retrigger pipelines via WP API POST `/api/repos/2/pipelines/<N>`. Fresh run reached real stage signal.\n\n## Why this burned dev-qwen's budget\n\n`pr-lifecycle`'s CI-fix budget treats every failed commit-status as a signal to invoke the agent. It has no notion of \"infra flake\" vs. \"real test failure\" and no heuristic to distinguish them. Four infra-flake failures in 13 minutes looked identical to four real code-bug failures.\n\n## Suggestions — what supervisor can check every 20min\n\nSupervisor runs every `1200s` already. Add these probes:\n\n**1. WP agent container health.**\n```\ndocker inspect disinto-woodpecker-agent --format '{{.State.Health.Status}}'\n```\nIf `unhealthy` for the second consecutive supervisor tick → **restart it automatically + post a comment on any currently-running dev-bot/dev-qwen issues warning \"CI agent was restarted; subsequent failures before this marker may be infra-flake.\"**\n\n**2. Fast-failure heuristic on WP pipelines.**\nQuery WP API `GET /api/repos/2/pipelines?page=1`. For each pipeline in state `failure`, compute `finished - started`. If duration < 60s, flag as probable infra-flake. Three flagged flakes within a 15-min window → trigger agent restart as in (1) and a bulk-retrigger via POST `/api/repos/2/pipelines/<N>` for each.\n\n**3. grpc error pattern in agent log.**\n`docker logs --since 20m disinto-woodpecker-agent 2>&1 | grep -c 'grpc error'` — if ≥3 matches, agent is probably wedged. Trigger restart as in (1).\n\n**4. Issue-level guard.**\nWhen supervisor detects an agent restart, scan for issues updated in the preceding 30min with label `blocked: ci_exhausted` and for each one:\n- unassign + remove `blocked` label (return to pool)\n- comment on the issue: *\"CI agent was unhealthy between HH:MM and HH:MM — prior 3/3 retry budget may have been spent on infra flake, not real failures. Re-queueing for a fresh attempt.\"*\n-retriggerthePR'slatestWPpipeline\n\nThislaststepisthekeycorrection:**`ci_exhausted`precededbyWP-agent-unhealth=falsepositive;returntopoolwithcontext.**\n\n##Whythismattersforthemigration\n\nBetweennowandcutovereveryWPCIflakethatsilentlyexhaustsanagent'sbudgetstealshoursofclocktime.Withoutanautomaticrecoverypath,thepaceofthestep-Nbacklogsfallsoffacliffthemomenttheagentnextgoesunhealthy—andit*will*gounhealthyagain(Codeberg#813isnotfixedupstreamyet).\n\n##Fixforthisspecificincident(alreadyappliedmanually)\n\n-Restarted`disinto-woodpecker-agent`.\n-ClosedPR#859(keptbranch`fix/issue-842`at`64080232`).\n-Unassigneddev-qwenfrom#842,removed`blocked`label,appendedprior-artsection+pipeline#966test-#6failuredetailstoissuebodysothenextclaimantstartswithfullcontext.\n\n##Non-goals\n\n-NottryingtofixCodeberg#813itself(upstreamgRPC-in-nested-dockerissue).\n-Nottryingtofix`pr-li