Merge pull request 'fix: docs/CLAUDE-AUTH-CONCURRENCY.md: rewrite for shared CLAUDE_CONFIG_DIR approach (#646)' (#659) from fix/issue-646 into main
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
This commit is contained in:
commit
eb8bd48004
1 changed files with 118 additions and 115 deletions
|
|
@ -1,37 +1,119 @@
|
||||||
# Claude Code OAuth Concurrency Model
|
# Claude Code OAuth Concurrency Model
|
||||||
|
|
||||||
## TL;DR
|
## Problem statement
|
||||||
|
|
||||||
The factory runs N+1 concurrent Claude Code processes across containers
|
The factory runs multiple concurrent Claude Code processes across
|
||||||
that all share `~/.claude` via bind mount. To avoid OAuth refresh races,
|
containers. OAuth access tokens are short-lived; refresh tokens rotate
|
||||||
they MUST be serialized by the external `flock` on
|
on each use. If two processes POST the same refresh token to Anthropic's
|
||||||
`${HOME}/.claude/session.lock` in `lib/agent-sdk.sh`. Claude Code's
|
token endpoint simultaneously, only one wins — the other gets
|
||||||
internal OAuth refresh lock does **not** work across containers in our
|
`invalid_grant` and the operator is forced to re-login.
|
||||||
mount layout. Do not remove the external flock without also fixing the
|
|
||||||
lockfile placement upstream.
|
|
||||||
|
|
||||||
## What we run
|
Claude Code already serializes OAuth refreshes internally using
|
||||||
|
`proper-lockfile` (`src/utils/auth.ts:1485-1491`):
|
||||||
|
|
||||||
| Container | Claude Code processes | Mount of `~/.claude` |
|
```typescript
|
||||||
|---|---|---|
|
release = await lockfile.lock(claudeDir)
|
||||||
| `disinto-agents` (persistent) | polling-loop agents via `lib/agent-sdk.sh::agent_run` | `/home/johba/.claude` → `/home/agent/.claude` (rw) |
|
```
|
||||||
| `disinto-edge` (persistent) | none directly — spawns transient containers via `docker/edge/dispatcher.sh` | n/a |
|
|
||||||
| transient containers spawned by `dispatcher.sh` | one-shot `claude` per invocation | same mount, same path |
|
|
||||||
|
|
||||||
All N+1 processes can hit the OAuth refresh window concurrently when
|
`proper-lockfile` creates a lockfile via an atomic `mkdir(${path}.lock)`
|
||||||
the access token nears expiry.
|
call — a cross-process primitive that works across any number of
|
||||||
|
processes on the same filesystem. The problem was never the lock
|
||||||
|
implementation; it was that our old per-container bind-mount layout
|
||||||
|
(`~/.claude` mounted but `/home/agent/` container-local) caused each
|
||||||
|
container to compute a different lockfile path, so the locks never
|
||||||
|
coordinated.
|
||||||
|
|
||||||
## The race
|
## The fix: shared `CLAUDE_CONFIG_DIR`
|
||||||
|
|
||||||
OAuth access tokens are short-lived; refresh tokens rotate on each
|
`CLAUDE_CONFIG_DIR` is an officially supported env var in Claude Code
|
||||||
refresh. If two processes both POST the same refresh token to
|
(`src/utils/envUtils.ts`). It controls where Claude resolves its config
|
||||||
Anthropic's token endpoint simultaneously, only one wins — the other
|
directory instead of the default `~/.claude`.
|
||||||
gets `invalid_grant` and the operator is forced to re-login.
|
|
||||||
|
|
||||||
Historically this manifested as "agents losing auth, frequent re-logins",
|
By setting `CLAUDE_CONFIG_DIR` to a path on a shared bind mount, every
|
||||||
which is the original reason `lib/agent-sdk.sh` introduced the external
|
container computes the **same** lockfile location. `proper-lockfile`'s
|
||||||
flock. The current shape (post-#606 watchdog work) is at
|
atomic `mkdir(${CLAUDE_CONFIG_DIR}.lock)` then gives free cross-container
|
||||||
`lib/agent-sdk.sh:139,144`:
|
serialization — no external wrapper needed.
|
||||||
|
|
||||||
|
## Current layout
|
||||||
|
|
||||||
|
```
|
||||||
|
Host filesystem:
|
||||||
|
/var/lib/disinto/claude-shared/ ← CLAUDE_SHARED_DIR
|
||||||
|
└── config/ ← CLAUDE_CONFIG_DIR
|
||||||
|
├── credentials.json
|
||||||
|
├── settings.json
|
||||||
|
└── ...
|
||||||
|
|
||||||
|
Inside every container:
|
||||||
|
Same absolute path: /var/lib/disinto/claude-shared/config
|
||||||
|
Env: CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config
|
||||||
|
```
|
||||||
|
|
||||||
|
The shared directory is mounted at the **same absolute path** inside
|
||||||
|
every container, so `proper-lockfile` resolves an identical lock path
|
||||||
|
everywhere.
|
||||||
|
|
||||||
|
### Where these values are defined
|
||||||
|
|
||||||
|
| What | Where |
|
||||||
|
|------|-------|
|
||||||
|
| Defaults for `CLAUDE_SHARED_DIR`, `CLAUDE_CONFIG_DIR` | `lib/env.sh:138-140` |
|
||||||
|
| `.env` documentation | `.env.example:92-99` |
|
||||||
|
| Container mounts + env passthrough (edge dispatcher) | `docker/edge/dispatcher.sh:446-448` (and analogous blocks for reproduce, triage, verify) |
|
||||||
|
| Auth detection using `CLAUDE_CONFIG_DIR` | `docker/agents/entrypoint.sh:101-102` |
|
||||||
|
| Bootstrap / migration during `disinto init` | `lib/claude-config.sh:setup_claude_config_dir()`, `bin/disinto:952-962` |
|
||||||
|
|
||||||
|
## Migration for existing dev boxes
|
||||||
|
|
||||||
|
For operators upgrading from the old `~/.claude` bind-mount layout,
|
||||||
|
`disinto init` handles the migration interactively (or with `--yes`).
|
||||||
|
The manual equivalent is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Stop the factory
|
||||||
|
disinto down
|
||||||
|
|
||||||
|
# 2. Create the shared directory
|
||||||
|
mkdir -p /var/lib/disinto/claude-shared
|
||||||
|
|
||||||
|
# 3. Move existing config
|
||||||
|
mv "$HOME/.claude" /var/lib/disinto/claude-shared/config
|
||||||
|
|
||||||
|
# 4. Create a back-compat symlink so host-side claude still works
|
||||||
|
ln -sfn /var/lib/disinto/claude-shared/config "$HOME/.claude"
|
||||||
|
|
||||||
|
# 5. Export the env var (add to shell rc for persistence)
|
||||||
|
export CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config
|
||||||
|
|
||||||
|
# 6. Start the factory
|
||||||
|
disinto up
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
Watch for these analytics events during concurrent agent runs:
|
||||||
|
|
||||||
|
| Event | Meaning |
|
||||||
|
|-------|---------|
|
||||||
|
| `tengu_oauth_token_refresh_lock_acquiring` | A process is attempting to acquire the refresh lock |
|
||||||
|
| `tengu_oauth_token_refresh_lock_acquired` | Lock acquired; refresh proceeding |
|
||||||
|
| `tengu_oauth_token_refresh_lock_retry` | Lock is held by another process; retrying |
|
||||||
|
| `tengu_oauth_token_refresh_lock_race_resolved` | Contention detected and resolved normally |
|
||||||
|
| `tengu_oauth_token_refresh_lock_retry_limit_reached` | Lock acquisition failed after all retries |
|
||||||
|
|
||||||
|
**Healthy:** `_race_resolved` appearing during contention windows — this
|
||||||
|
means multiple processes tried to refresh simultaneously and the lock
|
||||||
|
correctly serialized them.
|
||||||
|
|
||||||
|
**Bad:** `_lock_retry_limit_reached` — indicates the lock is stuck or
|
||||||
|
the shared mount is not working. Verify that `CLAUDE_CONFIG_DIR` resolves
|
||||||
|
to the same path in all containers and that the filesystem supports
|
||||||
|
`mkdir` atomicity (any POSIX filesystem does).
|
||||||
|
|
||||||
|
## The deferred external `flock` wrapper
|
||||||
|
|
||||||
|
`lib/agent-sdk.sh:139,144` still wraps every `claude` invocation in an
|
||||||
|
external `flock` on `${HOME}/.claude/session.lock`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
local lock_file="${HOME}/.claude/session.lock"
|
local lock_file="${HOME}/.claude/session.lock"
|
||||||
|
|
@ -40,96 +122,17 @@ output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1;
|
||||||
claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)
|
claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)
|
||||||
```
|
```
|
||||||
|
|
||||||
This serializes every `claude` invocation across every process that
|
With the `CLAUDE_CONFIG_DIR` fix in place, this external lock is
|
||||||
shares `${HOME}/.claude/`.
|
**redundant but harmless** — `proper-lockfile` serializes the refresh
|
||||||
|
internally, and `flock` serializes the entire invocation externally.
|
||||||
## Why Claude Code's internal lock does not save us
|
The external flock remains as a defense-in-depth measure; removal is
|
||||||
|
tracked as a separate vision-tier issue.
|
||||||
`src/utils/auth.ts:1491` (read from a leaked TS source — current as of
|
|
||||||
April 2026) calls:
|
|
||||||
|
|
||||||
```typescript
|
|
||||||
release = await lockfile.lock(claudeDir)
|
|
||||||
```
|
|
||||||
|
|
||||||
with no `lockfilePath` option. `proper-lockfile` defaults to creating
|
|
||||||
the lock at `<target>.lock` as a **sibling**, so for
|
|
||||||
`claudeDir = /home/agent/.claude`, the lockfile is created at
|
|
||||||
`/home/agent/.claude.lock`.
|
|
||||||
|
|
||||||
`/home/agent/.claude` is bind-mounted from the host, but `/home/agent/`
|
|
||||||
itself is part of each container's local overlay filesystem. So each
|
|
||||||
container creates its own private `/home/agent/.claude.lock` — they
|
|
||||||
never see each other's locks. The internal cross-process lock is a
|
|
||||||
no-op across our containers.
|
|
||||||
|
|
||||||
Verified empirically:
|
|
||||||
|
|
||||||
```
|
|
||||||
$ docker exec disinto-agents findmnt /home/agent/.claude
|
|
||||||
TARGET SOURCE FSTYPE
|
|
||||||
/home/agent/.claude /dev/loop15[/...rootfs/home/johba/.claude] btrfs
|
|
||||||
|
|
||||||
$ docker exec disinto-agents findmnt /home/agent
|
|
||||||
(blank — not a mount, container-local overlay)
|
|
||||||
|
|
||||||
$ docker exec disinto-agents touch /home/agent/test-marker
|
|
||||||
$ docker exec disinto-edge ls /home/agent/test-marker
|
|
||||||
ls: cannot access '/home/agent/test-marker': No such file or directory
|
|
||||||
```
|
|
||||||
|
|
||||||
(Compare with `src/services/mcp/auth.ts:2097`, which does it correctly
|
|
||||||
by passing `lockfilePath: join(claudeDir, "mcp-refresh-X.lock")` — that
|
|
||||||
lockfile lives inside the bind-mounted directory and IS shared. The
|
|
||||||
OAuth refresh path is an upstream oversight worth filing once we have
|
|
||||||
bandwidth.)
|
|
||||||
|
|
||||||
## How the external flock fixes it
|
|
||||||
|
|
||||||
The lock file path `${HOME}/.claude/session.lock` is **inside**
|
|
||||||
`~/.claude/`, which IS shared via the bind mount. All containers see
|
|
||||||
the same inode and serialize correctly via `flock`. This is a
|
|
||||||
sledgehammer (it serializes the entire `claude -p` call, not just the
|
|
||||||
refresh window) but it works.
|
|
||||||
|
|
||||||
## Decision matrix for new claude-using containers
|
|
||||||
|
|
||||||
When adding a new container that runs Claude Code:
|
|
||||||
|
|
||||||
1. **If the container is a batch / agent context** (long-running calls,
|
|
||||||
tolerant of serialization): mount the same `~/.claude` and route
|
|
||||||
all `claude` calls through `lib/agent-sdk.sh::agent_run` so they
|
|
||||||
take the external flock.
|
|
||||||
|
|
||||||
2. **If the container is interactive** (chat, REPL, anything where the
|
|
||||||
operator is waiting on a response): do NOT join the external flock.
|
|
||||||
Interactive starvation under the agent loop would be unusable —
|
|
||||||
chat messages would block waiting for the current agent's
|
|
||||||
`claude -p` call to finish, which can be minutes, and the 10-min
|
|
||||||
`flock -w 600` would frequently expire under a busy loop. Instead,
|
|
||||||
pick one of:
|
|
||||||
- **Separate OAuth identity**: new `~/.claude-chat/` on the host with
|
|
||||||
its own `claude auth login`, mounted to the new container's
|
|
||||||
`/home/agent/.claude`. Independent refresh state.
|
|
||||||
- **`ANTHROPIC_API_KEY` fallback**: the codebase already supports it
|
|
||||||
in `docker/agents/entrypoint.sh:119-125`. Different billing track
|
|
||||||
but trivial config and zero coupling to the agents' OAuth.
|
|
||||||
|
|
||||||
3. **Never** mount the parent directory `/home/agent/` instead of just
|
|
||||||
`.claude/` to "fix" the lockfile placement — exposes too much host
|
|
||||||
state to the container.
|
|
||||||
|
|
||||||
## Future fix
|
|
||||||
|
|
||||||
The right long-term fix is upstream: file an issue against Anthropic's
|
|
||||||
claude-code repo asking that `src/utils/auth.ts:1491` be changed to
|
|
||||||
follow the pattern at `src/services/mcp/auth.ts:2097` and pass an
|
|
||||||
explicit `lockfilePath` inside `claudeDir`. Once that lands and we
|
|
||||||
upgrade, the external flock can become a fast-path no-op or be removed
|
|
||||||
entirely.
|
|
||||||
|
|
||||||
## See also
|
## See also
|
||||||
|
|
||||||
- `lib/agent-sdk.sh:139,144` — the external flock
|
- `lib/env.sh:138-140` — `CLAUDE_SHARED_DIR` / `CLAUDE_CONFIG_DIR` defaults
|
||||||
- `docker/agents/entrypoint.sh:119-125` — the `ANTHROPIC_API_KEY` fallback
|
- `lib/claude-config.sh` — migration helper used by `disinto init`
|
||||||
- Issue #623 — chat container, auth strategy (informed by this doc)
|
- `lib/agent-sdk.sh:139,144` — the external `flock` wrapper (deferred removal)
|
||||||
|
- `docker/agents/entrypoint.sh:101-102` — `CLAUDE_CONFIG_DIR` auth detection
|
||||||
|
- `.env.example:92-99` — operator-facing documentation of the env vars
|
||||||
|
- Issue #623 — chat container auth strategy
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue