docs: document Claude Code OAuth concurrency model and external flock rationale
The external flock on ${HOME}/.claude/session.lock in lib/agent-sdk.sh
is load-bearing, not belt-and-suspenders. Claude Code internally locks
OAuth refresh via proper-lockfile.lock(claudeDir) but uses the default
sibling lockfile path (<target>.lock), which lands in /home/agent/ —
the container-local overlay, not the bind-mounted .claude/ directory.
So the internal lock is a no-op across our containers.
Documents the rationale, the empirical verification, and a decision
matrix for future containers that need to run Claude Code (notably
the chat container in #623).
Refs: #623
This commit is contained in:
parent
507fd952ea
commit
586b142154
2 changed files with 136 additions and 1 deletions
135
docs/CLAUDE-AUTH-CONCURRENCY.md
Normal file
135
docs/CLAUDE-AUTH-CONCURRENCY.md
Normal file
|
|
@ -0,0 +1,135 @@
|
||||||
|
# Claude Code OAuth Concurrency Model
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
The factory runs N+1 concurrent Claude Code processes across containers
|
||||||
|
that all share `~/.claude` via bind mount. To avoid OAuth refresh races,
|
||||||
|
they MUST be serialized by the external `flock` on
|
||||||
|
`${HOME}/.claude/session.lock` in `lib/agent-sdk.sh`. Claude Code's
|
||||||
|
internal OAuth refresh lock does **not** work across containers in our
|
||||||
|
mount layout. Do not remove the external flock without also fixing the
|
||||||
|
lockfile placement upstream.
|
||||||
|
|
||||||
|
## What we run
|
||||||
|
|
||||||
|
| Container | Claude Code processes | Mount of `~/.claude` |
|
||||||
|
|---|---|---|
|
||||||
|
| `disinto-agents` (persistent) | polling-loop agents via `lib/agent-sdk.sh::agent_run` | `/home/johba/.claude` → `/home/agent/.claude` (rw) |
|
||||||
|
| `disinto-edge` (persistent) | none directly — spawns transient containers via `docker/edge/dispatcher.sh` | n/a |
|
||||||
|
| transient containers spawned by `dispatcher.sh` | one-shot `claude` per invocation | same mount, same path |
|
||||||
|
|
||||||
|
All N+1 processes can hit the OAuth refresh window concurrently when
|
||||||
|
the access token nears expiry.
|
||||||
|
|
||||||
|
## The race
|
||||||
|
|
||||||
|
OAuth access tokens are short-lived; refresh tokens rotate on each
|
||||||
|
refresh. If two processes both POST the same refresh token to
|
||||||
|
Anthropic's token endpoint simultaneously, only one wins — the other
|
||||||
|
gets `invalid_grant` and the operator is forced to re-login.
|
||||||
|
|
||||||
|
Historically this manifested as "agents losing auth, frequent re-logins",
|
||||||
|
which is the original reason `lib/agent-sdk.sh` introduced the external
|
||||||
|
flock. The current shape (post-#606 watchdog work) is at
|
||||||
|
`lib/agent-sdk.sh:139,144`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
local lock_file="${HOME}/.claude/session.lock"
|
||||||
|
...
|
||||||
|
output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1;
|
||||||
|
claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)
|
||||||
|
```
|
||||||
|
|
||||||
|
This serializes every `claude` invocation across every process that
|
||||||
|
shares `${HOME}/.claude/`.
|
||||||
|
|
||||||
|
## Why Claude Code's internal lock does not save us
|
||||||
|
|
||||||
|
`src/utils/auth.ts:1491` (read from a leaked TS source — current as of
|
||||||
|
April 2026) calls:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
release = await lockfile.lock(claudeDir)
|
||||||
|
```
|
||||||
|
|
||||||
|
with no `lockfilePath` option. `proper-lockfile` defaults to creating
|
||||||
|
the lock at `<target>.lock` as a **sibling**, so for
|
||||||
|
`claudeDir = /home/agent/.claude`, the lockfile is created at
|
||||||
|
`/home/agent/.claude.lock`.
|
||||||
|
|
||||||
|
`/home/agent/.claude` is bind-mounted from the host, but `/home/agent/`
|
||||||
|
itself is part of each container's local overlay filesystem. So each
|
||||||
|
container creates its own private `/home/agent/.claude.lock` — they
|
||||||
|
never see each other's locks. The internal cross-process lock is a
|
||||||
|
no-op across our containers.
|
||||||
|
|
||||||
|
Verified empirically:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ docker exec disinto-agents findmnt /home/agent/.claude
|
||||||
|
TARGET SOURCE FSTYPE
|
||||||
|
/home/agent/.claude /dev/loop15[/...rootfs/home/johba/.claude] btrfs
|
||||||
|
|
||||||
|
$ docker exec disinto-agents findmnt /home/agent
|
||||||
|
(blank — not a mount, container-local overlay)
|
||||||
|
|
||||||
|
$ docker exec disinto-agents touch /home/agent/test-marker
|
||||||
|
$ docker exec disinto-edge ls /home/agent/test-marker
|
||||||
|
ls: cannot access '/home/agent/test-marker': No such file or directory
|
||||||
|
```
|
||||||
|
|
||||||
|
(Compare with `src/services/mcp/auth.ts:2097`, which does it correctly
|
||||||
|
by passing `lockfilePath: join(claudeDir, "mcp-refresh-X.lock")` — that
|
||||||
|
lockfile lives inside the bind-mounted directory and IS shared. The
|
||||||
|
OAuth refresh path is an upstream oversight worth filing once we have
|
||||||
|
bandwidth.)
|
||||||
|
|
||||||
|
## How the external flock fixes it
|
||||||
|
|
||||||
|
The lock file path `${HOME}/.claude/session.lock` is **inside**
|
||||||
|
`~/.claude/`, which IS shared via the bind mount. All containers see
|
||||||
|
the same inode and serialize correctly via `flock`. This is a
|
||||||
|
sledgehammer (it serializes the entire `claude -p` call, not just the
|
||||||
|
refresh window) but it works.
|
||||||
|
|
||||||
|
## Decision matrix for new claude-using containers
|
||||||
|
|
||||||
|
When adding a new container that runs Claude Code:
|
||||||
|
|
||||||
|
1. **If the container is a batch / agent context** (long-running calls,
|
||||||
|
tolerant of serialization): mount the same `~/.claude` and route
|
||||||
|
all `claude` calls through `lib/agent-sdk.sh::agent_run` so they
|
||||||
|
take the external flock.
|
||||||
|
|
||||||
|
2. **If the container is interactive** (chat, REPL, anything where the
|
||||||
|
operator is waiting on a response): do NOT join the external flock.
|
||||||
|
Interactive starvation under the agent loop would be unusable —
|
||||||
|
chat messages would block waiting for the current agent's
|
||||||
|
`claude -p` call to finish, which can be minutes, and the 10-min
|
||||||
|
`flock -w 600` would frequently expire under a busy loop. Instead,
|
||||||
|
pick one of:
|
||||||
|
- **Separate OAuth identity**: new `~/.claude-chat/` on the host with
|
||||||
|
its own `claude auth login`, mounted to the new container's
|
||||||
|
`/home/agent/.claude`. Independent refresh state.
|
||||||
|
- **`ANTHROPIC_API_KEY` fallback**: the codebase already supports it
|
||||||
|
in `docker/agents/entrypoint.sh:119-125`. Different billing track
|
||||||
|
but trivial config and zero coupling to the agents' OAuth.
|
||||||
|
|
||||||
|
3. **Never** mount the parent directory `/home/agent/` instead of just
|
||||||
|
`.claude/` to "fix" the lockfile placement — exposes too much host
|
||||||
|
state to the container.
|
||||||
|
|
||||||
|
## Future fix
|
||||||
|
|
||||||
|
The right long-term fix is upstream: file an issue against Anthropic's
|
||||||
|
claude-code repo asking that `src/utils/auth.ts:1491` be changed to
|
||||||
|
follow the pattern at `src/services/mcp/auth.ts:2097` and pass an
|
||||||
|
explicit `lockfilePath` inside `claudeDir`. Once that lands and we
|
||||||
|
upgrade, the external flock can become a fast-path no-op or be removed
|
||||||
|
entirely.
|
||||||
|
|
||||||
|
## See also
|
||||||
|
|
||||||
|
- `lib/agent-sdk.sh:139,144` — the external flock
|
||||||
|
- `docker/agents/entrypoint.sh:119-125` — the `ANTHROPIC_API_KEY` fallback
|
||||||
|
- Issue #623 — chat container, auth strategy (informed by this doc)
|
||||||
|
|
@ -24,7 +24,7 @@ sourced as needed.
|
||||||
| `lib/issue-lifecycle.sh` | Reusable issue lifecycle library: `issue_claim()` (add in-progress, remove backlog), `issue_release()` (remove in-progress, add backlog), `issue_block()` (post diagnostic comment with secret redaction, add blocked label), `issue_close()`, `issue_check_deps()` (parse deps, check transitive closure; sets `_ISSUE_BLOCKED_BY`, `_ISSUE_SUGGESTION`), `issue_suggest_next()` (find next unblocked backlog issue; sets `_ISSUE_NEXT`), `issue_post_refusal()` (structured refusal comment with dedup). Label IDs cached in globals on first lookup. Sources `lib/secret-scan.sh`. | dev-agent.sh (future) |
|
| `lib/issue-lifecycle.sh` | Reusable issue lifecycle library: `issue_claim()` (add in-progress, remove backlog), `issue_release()` (remove in-progress, add backlog), `issue_block()` (post diagnostic comment with secret redaction, add blocked label), `issue_close()`, `issue_check_deps()` (parse deps, check transitive closure; sets `_ISSUE_BLOCKED_BY`, `_ISSUE_SUGGESTION`), `issue_suggest_next()` (find next unblocked backlog issue; sets `_ISSUE_NEXT`), `issue_post_refusal()` (structured refusal comment with dedup). Label IDs cached in globals on first lookup. Sources `lib/secret-scan.sh`. | dev-agent.sh (future) |
|
||||||
| `lib/vault.sh` | **Vault PR helper** — create vault action PRs on ops repo via Forgejo API (works from containers without SSH). `vault_request <action_id> <toml_content>` validates TOML (using `validate_vault_action` from `vault/vault-env.sh`), creates branch `vault/<action-id>`, writes `vault/actions/<action-id>.toml`, creates PR targeting `main` with title `vault: <action-id>` and body from context field, returns PR number. Idempotent: if PR exists, returns existing number. **Low-tier bypass**: if the action's `blast_radius` classifies as `low` (via `vault/classify.sh`), `vault_request` calls `_vault_commit_direct()` which commits directly to ops `main` using `FORGE_ADMIN_TOKEN` — no PR, no approval wait. Returns `0` (not a PR number) for direct commits. Requires `FORGE_TOKEN`, `FORGE_ADMIN_TOKEN` (low-tier only), `FORGE_URL`, `FORGE_REPO`, `FORGE_OPS_REPO`. Uses the calling agent's own token (saves/restores `FORGE_TOKEN` around sourcing `vault-env.sh`), so approval workflow respects individual agent identities. | dev-agent (vault actions), future vault dispatcher |
|
| `lib/vault.sh` | **Vault PR helper** — create vault action PRs on ops repo via Forgejo API (works from containers without SSH). `vault_request <action_id> <toml_content>` validates TOML (using `validate_vault_action` from `vault/vault-env.sh`), creates branch `vault/<action-id>`, writes `vault/actions/<action-id>.toml`, creates PR targeting `main` with title `vault: <action-id>` and body from context field, returns PR number. Idempotent: if PR exists, returns existing number. **Low-tier bypass**: if the action's `blast_radius` classifies as `low` (via `vault/classify.sh`), `vault_request` calls `_vault_commit_direct()` which commits directly to ops `main` using `FORGE_ADMIN_TOKEN` — no PR, no approval wait. Returns `0` (not a PR number) for direct commits. Requires `FORGE_TOKEN`, `FORGE_ADMIN_TOKEN` (low-tier only), `FORGE_URL`, `FORGE_REPO`, `FORGE_OPS_REPO`. Uses the calling agent's own token (saves/restores `FORGE_TOKEN` around sourcing `vault-env.sh`), so approval workflow respects individual agent identities. | dev-agent (vault actions), future vault dispatcher |
|
||||||
| `lib/branch-protection.sh` | Branch protection helpers for Forgejo repos. `setup_vault_branch_protection()` — configures admin-only merge protection on main (require 1 approval, restrict merge to admin role, block direct pushes). `setup_profile_branch_protection()` — same protection for `.profile` repos. `verify_branch_protection()` — checks protection is correctly configured. `remove_branch_protection()` — removes protection (cleanup/testing). Handles race condition after initial push: retries with backoff if Forgejo hasn't processed the branch yet. Requires `FORGE_TOKEN`, `FORGE_URL`, `FORGE_OPS_REPO`. | bin/disinto (hire-an-agent) |
|
| `lib/branch-protection.sh` | Branch protection helpers for Forgejo repos. `setup_vault_branch_protection()` — configures admin-only merge protection on main (require 1 approval, restrict merge to admin role, block direct pushes). `setup_profile_branch_protection()` — same protection for `.profile` repos. `verify_branch_protection()` — checks protection is correctly configured. `remove_branch_protection()` — removes protection (cleanup/testing). Handles race condition after initial push: retries with backoff if Forgejo hasn't processed the branch yet. Requires `FORGE_TOKEN`, `FORGE_URL`, `FORGE_OPS_REPO`. | bin/disinto (hire-an-agent) |
|
||||||
| `lib/agent-sdk.sh` | `agent_run([--resume SESSION_ID] [--worktree DIR] PROMPT)` — one-shot `claude -p` invocation with session persistence. Saves session ID to `SID_FILE`, reads it back on resume. `agent_recover_session()` — restore previous session ID from `SID_FILE` on startup. **Nudge guard**: skips nudge injection if the worktree is clean and no push is expected, preventing spurious re-invocations. Callers must define `SID_FILE`, `LOGFILE`, and `log()` before sourcing. | formula-driven agents (dev-agent, planner-run, predictor-run, gardener-run) |
|
| `lib/agent-sdk.sh` | `agent_run([--resume SESSION_ID] [--worktree DIR] PROMPT)` — one-shot `claude -p` invocation with session persistence. Saves session ID to `SID_FILE`, reads it back on resume. `agent_recover_session()` — restore previous session ID from `SID_FILE` on startup. **Nudge guard**: skips nudge injection if the worktree is clean and no push is expected, preventing spurious re-invocations. Callers must define `SID_FILE`, `LOGFILE`, and `log()` before sourcing. **Concurrency**: every `claude` invocation is wrapped in `flock -w 600` on `${HOME}/.claude/session.lock` to serialize OAuth refresh across containers — see [`docs/CLAUDE-AUTH-CONCURRENCY.md`](../docs/CLAUDE-AUTH-CONCURRENCY.md) for why this is load-bearing and when a new container should bypass it. | formula-driven agents (dev-agent, planner-run, predictor-run, gardener-run) |
|
||||||
| `lib/forge-setup.sh` | `setup_forge()` — Forgejo instance provisioning: creates admin user, bot accounts, org, repos (code + ops), configures webhooks, sets repo topics. Extracted from `bin/disinto`. Requires `FORGE_URL`, `FORGE_TOKEN`, `FACTORY_ROOT`. **Password storage (#361)**: after creating each bot account, stores its password in `.env` as `FORGE_<BOT>_PASS` (e.g. `FORGE_PASS`, `FORGE_REVIEW_PASS`, etc.) for use by `forge-push.sh`. | bin/disinto (init) |
|
| `lib/forge-setup.sh` | `setup_forge()` — Forgejo instance provisioning: creates admin user, bot accounts, org, repos (code + ops), configures webhooks, sets repo topics. Extracted from `bin/disinto`. Requires `FORGE_URL`, `FORGE_TOKEN`, `FACTORY_ROOT`. **Password storage (#361)**: after creating each bot account, stores its password in `.env` as `FORGE_<BOT>_PASS` (e.g. `FORGE_PASS`, `FORGE_REVIEW_PASS`, etc.) for use by `forge-push.sh`. | bin/disinto (init) |
|
||||||
| `lib/forge-push.sh` | `push_to_forge()` — pushes a local clone to the Forgejo remote and verifies the push. `_assert_forge_push_globals()` validates required env vars before use. Requires `FORGE_URL`, `FORGE_PASS`, `FACTORY_ROOT`, `PRIMARY_BRANCH`. **Auth**: uses `FORGE_PASS` (bot password) for git HTTP push — Forgejo 11.x rejects API tokens for `git push` (#361). | bin/disinto (init) |
|
| `lib/forge-push.sh` | `push_to_forge()` — pushes a local clone to the Forgejo remote and verifies the push. `_assert_forge_push_globals()` validates required env vars before use. Requires `FORGE_URL`, `FORGE_PASS`, `FACTORY_ROOT`, `PRIMARY_BRANCH`. **Auth**: uses `FORGE_PASS` (bot password) for git HTTP push — Forgejo 11.x rejects API tokens for `git push` (#361). | bin/disinto (init) |
|
||||||
| `lib/ops-setup.sh` | `setup_ops_repo()` — creates ops repo on Forgejo if it doesn't exist, configures bot collaborators, clones/initializes ops repo locally, seeds directory structure (vault, knowledge, evidence, sprints). Evidence subdirectories seeded: engagement/, red-team/, holdout/, evolution/, user-test/. Also seeds sprints/ for architect output. Exports `_ACTUAL_OPS_SLUG`. `migrate_ops_repo(ops_root, [primary_branch])` — idempotent migration helper that seeds missing directories and .gitkeep files on existing ops repos (pre-#407 deployments). | bin/disinto (init) |
|
| `lib/ops-setup.sh` | `setup_ops_repo()` — creates ops repo on Forgejo if it doesn't exist, configures bot collaborators, clones/initializes ops repo locally, seeds directory structure (vault, knowledge, evidence, sprints). Evidence subdirectories seeded: engagement/, red-team/, holdout/, evolution/, user-test/. Also seeds sprints/ for architect output. Exports `_ACTUAL_OPS_SLUG`. `migrate_ops_repo(ops_root, [primary_branch])` — idempotent migration helper that seeds missing directories and .gitkeep files on existing ops repos (pre-#407 deployments). | bin/disinto (init) |
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue