Merge pull request 'fix: docs/CLAUDE-AUTH-CONCURRENCY.md: rewrite for shared CLAUDE_CONFIG_DIR approach (#646)' (#659) from fix/issue-646 into main
All checks were successful
ci/woodpecker/push/ci Pipeline was successful

This commit is contained in:
dev-bot 2026-04-10 21:33:06 +00:00
commit eb8bd48004

View file

@ -1,37 +1,119 @@
# Claude Code OAuth Concurrency Model
## TL;DR
## Problem statement
The factory runs N+1 concurrent Claude Code processes across containers
that all share `~/.claude` via bind mount. To avoid OAuth refresh races,
they MUST be serialized by the external `flock` on
`${HOME}/.claude/session.lock` in `lib/agent-sdk.sh`. Claude Code's
internal OAuth refresh lock does **not** work across containers in our
mount layout. Do not remove the external flock without also fixing the
lockfile placement upstream.
The factory runs multiple concurrent Claude Code processes across
containers. OAuth access tokens are short-lived; refresh tokens rotate
on each use. If two processes POST the same refresh token to Anthropic's
token endpoint simultaneously, only one wins — the other gets
`invalid_grant` and the operator is forced to re-login.
## What we run
Claude Code already serializes OAuth refreshes internally using
`proper-lockfile` (`src/utils/auth.ts:1485-1491`):
| Container | Claude Code processes | Mount of `~/.claude` |
|---|---|---|
| `disinto-agents` (persistent) | polling-loop agents via `lib/agent-sdk.sh::agent_run` | `/home/johba/.claude``/home/agent/.claude` (rw) |
| `disinto-edge` (persistent) | none directly — spawns transient containers via `docker/edge/dispatcher.sh` | n/a |
| transient containers spawned by `dispatcher.sh` | one-shot `claude` per invocation | same mount, same path |
```typescript
release = await lockfile.lock(claudeDir)
```
All N+1 processes can hit the OAuth refresh window concurrently when
the access token nears expiry.
`proper-lockfile` creates a lockfile via an atomic `mkdir(${path}.lock)`
call — a cross-process primitive that works across any number of
processes on the same filesystem. The problem was never the lock
implementation; it was that our old per-container bind-mount layout
(`~/.claude` mounted but `/home/agent/` container-local) caused each
container to compute a different lockfile path, so the locks never
coordinated.
## The race
## The fix: shared `CLAUDE_CONFIG_DIR`
OAuth access tokens are short-lived; refresh tokens rotate on each
refresh. If two processes both POST the same refresh token to
Anthropic's token endpoint simultaneously, only one wins — the other
gets `invalid_grant` and the operator is forced to re-login.
`CLAUDE_CONFIG_DIR` is an officially supported env var in Claude Code
(`src/utils/envUtils.ts`). It controls where Claude resolves its config
directory instead of the default `~/.claude`.
Historically this manifested as "agents losing auth, frequent re-logins",
which is the original reason `lib/agent-sdk.sh` introduced the external
flock. The current shape (post-#606 watchdog work) is at
`lib/agent-sdk.sh:139,144`:
By setting `CLAUDE_CONFIG_DIR` to a path on a shared bind mount, every
container computes the **same** lockfile location. `proper-lockfile`'s
atomic `mkdir(${CLAUDE_CONFIG_DIR}.lock)` then gives free cross-container
serialization — no external wrapper needed.
## Current layout
```
Host filesystem:
/var/lib/disinto/claude-shared/ ← CLAUDE_SHARED_DIR
└── config/ ← CLAUDE_CONFIG_DIR
├── credentials.json
├── settings.json
└── ...
Inside every container:
Same absolute path: /var/lib/disinto/claude-shared/config
Env: CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config
```
The shared directory is mounted at the **same absolute path** inside
every container, so `proper-lockfile` resolves an identical lock path
everywhere.
### Where these values are defined
| What | Where |
|------|-------|
| Defaults for `CLAUDE_SHARED_DIR`, `CLAUDE_CONFIG_DIR` | `lib/env.sh:138-140` |
| `.env` documentation | `.env.example:92-99` |
| Container mounts + env passthrough (edge dispatcher) | `docker/edge/dispatcher.sh:446-448` (and analogous blocks for reproduce, triage, verify) |
| Auth detection using `CLAUDE_CONFIG_DIR` | `docker/agents/entrypoint.sh:101-102` |
| Bootstrap / migration during `disinto init` | `lib/claude-config.sh:setup_claude_config_dir()`, `bin/disinto:952-962` |
## Migration for existing dev boxes
For operators upgrading from the old `~/.claude` bind-mount layout,
`disinto init` handles the migration interactively (or with `--yes`).
The manual equivalent is:
```bash
# 1. Stop the factory
disinto down
# 2. Create the shared directory
mkdir -p /var/lib/disinto/claude-shared
# 3. Move existing config
mv "$HOME/.claude" /var/lib/disinto/claude-shared/config
# 4. Create a back-compat symlink so host-side claude still works
ln -sfn /var/lib/disinto/claude-shared/config "$HOME/.claude"
# 5. Export the env var (add to shell rc for persistence)
export CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config
# 6. Start the factory
disinto up
```
## Verification
Watch for these analytics events during concurrent agent runs:
| Event | Meaning |
|-------|---------|
| `tengu_oauth_token_refresh_lock_acquiring` | A process is attempting to acquire the refresh lock |
| `tengu_oauth_token_refresh_lock_acquired` | Lock acquired; refresh proceeding |
| `tengu_oauth_token_refresh_lock_retry` | Lock is held by another process; retrying |
| `tengu_oauth_token_refresh_lock_race_resolved` | Contention detected and resolved normally |
| `tengu_oauth_token_refresh_lock_retry_limit_reached` | Lock acquisition failed after all retries |
**Healthy:** `_race_resolved` appearing during contention windows — this
means multiple processes tried to refresh simultaneously and the lock
correctly serialized them.
**Bad:** `_lock_retry_limit_reached` — indicates the lock is stuck or
the shared mount is not working. Verify that `CLAUDE_CONFIG_DIR` resolves
to the same path in all containers and that the filesystem supports
`mkdir` atomicity (any POSIX filesystem does).
## The deferred external `flock` wrapper
`lib/agent-sdk.sh:139,144` still wraps every `claude` invocation in an
external `flock` on `${HOME}/.claude/session.lock`:
```bash
local lock_file="${HOME}/.claude/session.lock"
@ -40,96 +122,17 @@ output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1;
claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)
```
This serializes every `claude` invocation across every process that
shares `${HOME}/.claude/`.
## Why Claude Code's internal lock does not save us
`src/utils/auth.ts:1491` (read from a leaked TS source — current as of
April 2026) calls:
```typescript
release = await lockfile.lock(claudeDir)
```
with no `lockfilePath` option. `proper-lockfile` defaults to creating
the lock at `<target>.lock` as a **sibling**, so for
`claudeDir = /home/agent/.claude`, the lockfile is created at
`/home/agent/.claude.lock`.
`/home/agent/.claude` is bind-mounted from the host, but `/home/agent/`
itself is part of each container's local overlay filesystem. So each
container creates its own private `/home/agent/.claude.lock` — they
never see each other's locks. The internal cross-process lock is a
no-op across our containers.
Verified empirically:
```
$ docker exec disinto-agents findmnt /home/agent/.claude
TARGET SOURCE FSTYPE
/home/agent/.claude /dev/loop15[/...rootfs/home/johba/.claude] btrfs
$ docker exec disinto-agents findmnt /home/agent
(blank — not a mount, container-local overlay)
$ docker exec disinto-agents touch /home/agent/test-marker
$ docker exec disinto-edge ls /home/agent/test-marker
ls: cannot access '/home/agent/test-marker': No such file or directory
```
(Compare with `src/services/mcp/auth.ts:2097`, which does it correctly
by passing `lockfilePath: join(claudeDir, "mcp-refresh-X.lock")` — that
lockfile lives inside the bind-mounted directory and IS shared. The
OAuth refresh path is an upstream oversight worth filing once we have
bandwidth.)
## How the external flock fixes it
The lock file path `${HOME}/.claude/session.lock` is **inside**
`~/.claude/`, which IS shared via the bind mount. All containers see
the same inode and serialize correctly via `flock`. This is a
sledgehammer (it serializes the entire `claude -p` call, not just the
refresh window) but it works.
## Decision matrix for new claude-using containers
When adding a new container that runs Claude Code:
1. **If the container is a batch / agent context** (long-running calls,
tolerant of serialization): mount the same `~/.claude` and route
all `claude` calls through `lib/agent-sdk.sh::agent_run` so they
take the external flock.
2. **If the container is interactive** (chat, REPL, anything where the
operator is waiting on a response): do NOT join the external flock.
Interactive starvation under the agent loop would be unusable —
chat messages would block waiting for the current agent's
`claude -p` call to finish, which can be minutes, and the 10-min
`flock -w 600` would frequently expire under a busy loop. Instead,
pick one of:
- **Separate OAuth identity**: new `~/.claude-chat/` on the host with
its own `claude auth login`, mounted to the new container's
`/home/agent/.claude`. Independent refresh state.
- **`ANTHROPIC_API_KEY` fallback**: the codebase already supports it
in `docker/agents/entrypoint.sh:119-125`. Different billing track
but trivial config and zero coupling to the agents' OAuth.
3. **Never** mount the parent directory `/home/agent/` instead of just
`.claude/` to "fix" the lockfile placement — exposes too much host
state to the container.
## Future fix
The right long-term fix is upstream: file an issue against Anthropic's
claude-code repo asking that `src/utils/auth.ts:1491` be changed to
follow the pattern at `src/services/mcp/auth.ts:2097` and pass an
explicit `lockfilePath` inside `claudeDir`. Once that lands and we
upgrade, the external flock can become a fast-path no-op or be removed
entirely.
With the `CLAUDE_CONFIG_DIR` fix in place, this external lock is
**redundant but harmless** — `proper-lockfile` serializes the refresh
internally, and `flock` serializes the entire invocation externally.
The external flock remains as a defense-in-depth measure; removal is
tracked as a separate vision-tier issue.
## See also
- `lib/agent-sdk.sh:139,144` — the external flock
- `docker/agents/entrypoint.sh:119-125` — the `ANTHROPIC_API_KEY` fallback
- Issue #623 — chat container, auth strategy (informed by this doc)
- `lib/env.sh:138-140``CLAUDE_SHARED_DIR` / `CLAUDE_CONFIG_DIR` defaults
- `lib/claude-config.sh` — migration helper used by `disinto init`
- `lib/agent-sdk.sh:139,144` — the external `flock` wrapper (deferred removal)
- `docker/agents/entrypoint.sh:101-102``CLAUDE_CONFIG_DIR` auth detection
- `.env.example:92-99` — operator-facing documentation of the env vars
- Issue #623 — chat container auth strategy