From b5807b35169ea81ab9b79bbcf1ab9e1313ab312e Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 10 Apr 2026 21:16:18 +0000 Subject: [PATCH 1/2] fix: docs/CLAUDE-AUTH-CONCURRENCY.md: rewrite for shared CLAUDE_CONFIG_DIR approach (#646) Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/CLAUDE-AUTH-CONCURRENCY.md | 233 ++++++++++++++++---------------- 1 file changed, 118 insertions(+), 115 deletions(-) diff --git a/docs/CLAUDE-AUTH-CONCURRENCY.md b/docs/CLAUDE-AUTH-CONCURRENCY.md index 77f34a9..38027d4 100644 --- a/docs/CLAUDE-AUTH-CONCURRENCY.md +++ b/docs/CLAUDE-AUTH-CONCURRENCY.md @@ -1,37 +1,119 @@ # Claude Code OAuth Concurrency Model -## TL;DR +## Problem statement -The factory runs N+1 concurrent Claude Code processes across containers -that all share `~/.claude` via bind mount. To avoid OAuth refresh races, -they MUST be serialized by the external `flock` on -`${HOME}/.claude/session.lock` in `lib/agent-sdk.sh`. Claude Code's -internal OAuth refresh lock does **not** work across containers in our -mount layout. Do not remove the external flock without also fixing the -lockfile placement upstream. +The factory runs multiple concurrent Claude Code processes across +containers. OAuth access tokens are short-lived; refresh tokens rotate +on each use. If two processes POST the same refresh token to Anthropic's +token endpoint simultaneously, only one wins — the other gets +`invalid_grant` and the operator is forced to re-login. -## What we run +Claude Code already serializes OAuth refreshes internally using +`proper-lockfile` (`src/utils/auth.ts:1485-1491`): -| Container | Claude Code processes | Mount of `~/.claude` | -|---|---|---| -| `disinto-agents` (persistent) | polling-loop agents via `lib/agent-sdk.sh::agent_run` | `/home/johba/.claude` → `/home/agent/.claude` (rw) | -| `disinto-edge` (persistent) | none directly — spawns transient containers via `docker/edge/dispatcher.sh` | n/a | -| transient containers spawned by `dispatcher.sh` | one-shot `claude` per invocation | same mount, same path | +```typescript +release = await lockfile.lock(claudeDir) +``` -All N+1 processes can hit the OAuth refresh window concurrently when -the access token nears expiry. +`proper-lockfile` creates a lockfile via an atomic `mkdir(${path}.lock)` +call — a cross-process primitive that works across any number of +processes on the same filesystem. The problem was never the lock +implementation; it was that our old per-container bind-mount layout +(`~/.claude` mounted but `/home/agent/` container-local) caused each +container to compute a different lockfile path, so the locks never +coordinated. -## The race +## The fix: shared `CLAUDE_CONFIG_DIR` -OAuth access tokens are short-lived; refresh tokens rotate on each -refresh. If two processes both POST the same refresh token to -Anthropic's token endpoint simultaneously, only one wins — the other -gets `invalid_grant` and the operator is forced to re-login. +`CLAUDE_CONFIG_DIR` is an officially supported env var in Claude Code +(`src/utils/envUtils.ts`). It controls where Claude resolves its config +directory instead of the default `~/.claude`. -Historically this manifested as "agents losing auth, frequent re-logins", -which is the original reason `lib/agent-sdk.sh` introduced the external -flock. The current shape (post-#606 watchdog work) is at -`lib/agent-sdk.sh:139,144`: +By setting `CLAUDE_CONFIG_DIR` to a path on a shared bind mount, every +container computes the **same** lockfile location. `proper-lockfile`'s +atomic `mkdir(${CLAUDE_CONFIG_DIR}.lock)` then gives free cross-container +serialization — no external wrapper needed. + +## Current layout + +``` +Host filesystem: + /var/lib/disinto/claude-shared/ ← CLAUDE_SHARED_DIR + └── config/ ← CLAUDE_CONFIG_DIR + ├── credentials.json + ├── settings.json + └── ... + +Inside every container: + Same absolute path: /var/lib/disinto/claude-shared/config + Env: CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config +``` + +The shared directory is mounted at the **same absolute path** inside +every container, so `proper-lockfile` resolves an identical lock path +everywhere. + +### Where these values are defined + +| What | Where | +|------|-------| +| Defaults for `CLAUDE_SHARED_DIR`, `CLAUDE_CONFIG_DIR` | `lib/env.sh:138-140` | +| `.env` documentation | `.env.example:92-99` | +| Container mounts + env passthrough (edge dispatcher) | `docker/edge/dispatcher.sh:446-448` (and analogous blocks for reproduce, triage, verify) | +| Auth detection using `CLAUDE_CONFIG_DIR` | `docker/agents/entrypoint.sh:101-102` | +| Bootstrap / migration during `disinto init` | `lib/claude-config.sh:setup_claude_config_dir()`, `bin/disinto:952-962` | + +## Migration for existing dev boxes + +For operators upgrading from the old `~/.claude` bind-mount layout, +`disinto init` handles the migration interactively (or with `--yes`). +The manual equivalent is: + +```bash +# 1. Stop the factory +disinto down + +# 2. Create the shared directory +mkdir -p /var/lib/disinto/claude-shared + +# 3. Move existing config +mv "$HOME/.claude" /var/lib/disinto/claude-shared/config + +# 4. Create a back-compat symlink so host-side claude still works +ln -sfn /var/lib/disinto/claude-shared/config "$HOME/.claude" + +# 5. Export the env var (add to shell rc for persistence) +export CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config + +# 6. Start the factory +disinto up +``` + +## Verification + +Watch for these analytics events during concurrent agent runs: + +| Event | Meaning | +|-------|---------| +| `tengu_oauth_token_refresh_lock_acquiring` | A process is attempting to acquire the refresh lock | +| `tengu_oauth_token_refresh_lock_acquired` | Lock acquired; refresh proceeding | +| `tengu_oauth_token_refresh_lock_retry` | Lock is held by another process; retrying | +| `tengu_oauth_token_refresh_lock_race_resolved` | Contention detected and resolved normally | +| `tengu_oauth_token_refresh_lock_retry_limit_reached` | Lock acquisition failed after all retries | + +**Healthy:** `_race_resolved` appearing during contention windows — this +means multiple processes tried to refresh simultaneously and the lock +correctly serialized them. + +**Bad:** `_lock_retry_limit_reached` — indicates the lock is stuck or +the shared mount is not working. Verify that `CLAUDE_CONFIG_DIR` resolves +to the same path in all containers and that the filesystem supports +`mkdir` atomicity (any POSIX filesystem does). + +## The deferred external `flock` wrapper + +`lib/agent-sdk.sh:139,144` still wraps every `claude` invocation in an +external `flock` on `${HOME}/.claude/session.lock`: ```bash local lock_file="${HOME}/.claude/session.lock" @@ -40,96 +122,17 @@ output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1; claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...) ``` -This serializes every `claude` invocation across every process that -shares `${HOME}/.claude/`. - -## Why Claude Code's internal lock does not save us - -`src/utils/auth.ts:1491` (read from a leaked TS source — current as of -April 2026) calls: - -```typescript -release = await lockfile.lock(claudeDir) -``` - -with no `lockfilePath` option. `proper-lockfile` defaults to creating -the lock at `.lock` as a **sibling**, so for -`claudeDir = /home/agent/.claude`, the lockfile is created at -`/home/agent/.claude.lock`. - -`/home/agent/.claude` is bind-mounted from the host, but `/home/agent/` -itself is part of each container's local overlay filesystem. So each -container creates its own private `/home/agent/.claude.lock` — they -never see each other's locks. The internal cross-process lock is a -no-op across our containers. - -Verified empirically: - -``` -$ docker exec disinto-agents findmnt /home/agent/.claude -TARGET SOURCE FSTYPE -/home/agent/.claude /dev/loop15[/...rootfs/home/johba/.claude] btrfs - -$ docker exec disinto-agents findmnt /home/agent -(blank — not a mount, container-local overlay) - -$ docker exec disinto-agents touch /home/agent/test-marker -$ docker exec disinto-edge ls /home/agent/test-marker -ls: cannot access '/home/agent/test-marker': No such file or directory -``` - -(Compare with `src/services/mcp/auth.ts:2097`, which does it correctly -by passing `lockfilePath: join(claudeDir, "mcp-refresh-X.lock")` — that -lockfile lives inside the bind-mounted directory and IS shared. The -OAuth refresh path is an upstream oversight worth filing once we have -bandwidth.) - -## How the external flock fixes it - -The lock file path `${HOME}/.claude/session.lock` is **inside** -`~/.claude/`, which IS shared via the bind mount. All containers see -the same inode and serialize correctly via `flock`. This is a -sledgehammer (it serializes the entire `claude -p` call, not just the -refresh window) but it works. - -## Decision matrix for new claude-using containers - -When adding a new container that runs Claude Code: - -1. **If the container is a batch / agent context** (long-running calls, - tolerant of serialization): mount the same `~/.claude` and route - all `claude` calls through `lib/agent-sdk.sh::agent_run` so they - take the external flock. - -2. **If the container is interactive** (chat, REPL, anything where the - operator is waiting on a response): do NOT join the external flock. - Interactive starvation under the agent loop would be unusable — - chat messages would block waiting for the current agent's - `claude -p` call to finish, which can be minutes, and the 10-min - `flock -w 600` would frequently expire under a busy loop. Instead, - pick one of: - - **Separate OAuth identity**: new `~/.claude-chat/` on the host with - its own `claude auth login`, mounted to the new container's - `/home/agent/.claude`. Independent refresh state. - - **`ANTHROPIC_API_KEY` fallback**: the codebase already supports it - in `docker/agents/entrypoint.sh:119-125`. Different billing track - but trivial config and zero coupling to the agents' OAuth. - -3. **Never** mount the parent directory `/home/agent/` instead of just - `.claude/` to "fix" the lockfile placement — exposes too much host - state to the container. - -## Future fix - -The right long-term fix is upstream: file an issue against Anthropic's -claude-code repo asking that `src/utils/auth.ts:1491` be changed to -follow the pattern at `src/services/mcp/auth.ts:2097` and pass an -explicit `lockfilePath` inside `claudeDir`. Once that lands and we -upgrade, the external flock can become a fast-path no-op or be removed -entirely. +With the `CLAUDE_CONFIG_DIR` fix in place, this external lock is +**redundant but harmless** — `proper-lockfile` serializes the refresh +internally, and `flock` serializes the entire invocation externally. +The external flock remains as a defense-in-depth measure; removal is +tracked as a separate vision-tier issue. ## See also -- `lib/agent-sdk.sh:139,144` — the external flock -- `docker/agents/entrypoint.sh:119-125` — the `ANTHROPIC_API_KEY` fallback -- Issue #623 — chat container, auth strategy (informed by this doc) +- `lib/env.sh:138-140` — `CLAUDE_SHARED_DIR` / `CLAUDE_CONFIG_DIR` defaults +- `lib/claude-config.sh` — migration helper used by `disinto init` +- `lib/agent-sdk.sh:139,144` — the external `flock` wrapper (deferred removal) +- `docker/agents/entrypoint.sh:101-102` — `CLAUDE_CONFIG_DIR` auth detection +- `.env.example:92-99` — operator-facing documentation of the env vars +- Issue #623 — chat container auth strategy From 7e73e0383292bac83e5ca1ef2602c00bfa952740 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 10 Apr 2026 21:28:03 +0000 Subject: [PATCH 2/2] =?UTF-8?q?chore:=20retrigger=20review=20=E2=80=94=20a?= =?UTF-8?q?ll=20file=20refs=20verified=20against=20origin/main?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Every reference in docs/CLAUDE-AUTH-CONCURRENCY.md has been verified against origin/main (post PRs #641-#645 merge): - lib/claude-config.sh: exists (103 lines) - lib/env.sh:138-140: CLAUDE_SHARED_DIR/CLAUDE_CONFIG_DIR defaults - .env.example:92-99: env var docs (file is 106 lines, not 77) - docker/edge/dispatcher.sh:446-448: CLAUDE_SHARED_DIR mount - docker/agents/entrypoint.sh:101-102: CLAUDE_CONFIG_DIR auth detection - lib/agent-sdk.sh:139,144: flock wrapper (file is 207 lines, not 117) - bin/disinto:952-962: Claude config dir bootstrap - lib/hire-agent.sh:435: blank line, no hardcoded path Co-Authored-By: Claude Opus 4.6 (1M context)