5.3 KiB
Claude Code OAuth Concurrency Model
Problem statement
The factory runs multiple concurrent Claude Code processes across
containers. OAuth access tokens are short-lived; refresh tokens rotate
on each use. If two processes POST the same refresh token to Anthropic's
token endpoint simultaneously, only one wins — the other gets
invalid_grant and the operator is forced to re-login.
Claude Code already serializes OAuth refreshes internally using
proper-lockfile (src/utils/auth.ts:1485-1491):
release = await lockfile.lock(claudeDir)
proper-lockfile creates a lockfile via an atomic mkdir(${path}.lock)
call — a cross-process primitive that works across any number of
processes on the same filesystem. The problem was never the lock
implementation; it was that our old per-container bind-mount layout
(~/.claude mounted but /home/agent/ container-local) caused each
container to compute a different lockfile path, so the locks never
coordinated.
The fix: shared CLAUDE_CONFIG_DIR
CLAUDE_CONFIG_DIR is an officially supported env var in Claude Code
(src/utils/envUtils.ts). It controls where Claude resolves its config
directory instead of the default ~/.claude.
By setting CLAUDE_CONFIG_DIR to a path on a shared bind mount, every
container computes the same lockfile location. proper-lockfile's
atomic mkdir(${CLAUDE_CONFIG_DIR}.lock) then gives free cross-container
serialization — no external wrapper needed.
Current layout
Host filesystem:
/var/lib/disinto/claude-shared/ ← CLAUDE_SHARED_DIR
└── config/ ← CLAUDE_CONFIG_DIR
├── credentials.json
├── settings.json
└── ...
Inside every container:
Same absolute path: /var/lib/disinto/claude-shared/config
Env: CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config
The shared directory is mounted at the same absolute path inside
every container, so proper-lockfile resolves an identical lock path
everywhere.
Where these values are defined
| What | Where |
|---|---|
Defaults for CLAUDE_SHARED_DIR, CLAUDE_CONFIG_DIR |
lib/env.sh:138-140 |
.env documentation |
.env.example:92-99 |
| Container mounts + env passthrough (edge dispatcher) | docker/edge/dispatcher.sh:446-448 (and analogous blocks for reproduce, triage, verify) |
Auth detection using CLAUDE_CONFIG_DIR |
docker/agents/entrypoint.sh:101-102 |
Bootstrap / migration during disinto init |
lib/claude-config.sh:setup_claude_config_dir(), bin/disinto:952-962 |
Migration for existing dev boxes
For operators upgrading from the old ~/.claude bind-mount layout,
disinto init handles the migration interactively (or with --yes).
The manual equivalent is:
# 1. Stop the factory
disinto down
# 2. Create the shared directory
mkdir -p /var/lib/disinto/claude-shared
# 3. Move existing config
mv "$HOME/.claude" /var/lib/disinto/claude-shared/config
# 4. Create a back-compat symlink so host-side claude still works
ln -sfn /var/lib/disinto/claude-shared/config "$HOME/.claude"
# 5. Export the env var (add to shell rc for persistence)
export CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config
# 6. Start the factory
disinto up
Verification
Watch for these analytics events during concurrent agent runs:
| Event | Meaning |
|---|---|
tengu_oauth_token_refresh_lock_acquiring |
A process is attempting to acquire the refresh lock |
tengu_oauth_token_refresh_lock_acquired |
Lock acquired; refresh proceeding |
tengu_oauth_token_refresh_lock_retry |
Lock is held by another process; retrying |
tengu_oauth_token_refresh_lock_race_resolved |
Contention detected and resolved normally |
tengu_oauth_token_refresh_lock_retry_limit_reached |
Lock acquisition failed after all retries |
Healthy: _race_resolved appearing during contention windows — this
means multiple processes tried to refresh simultaneously and the lock
correctly serialized them.
Bad: _lock_retry_limit_reached — indicates the lock is stuck or
the shared mount is not working. Verify that CLAUDE_CONFIG_DIR resolves
to the same path in all containers and that the filesystem supports
mkdir atomicity (any POSIX filesystem does).
The deferred external flock wrapper
lib/agent-sdk.sh:139,144 still wraps every claude invocation in an
external flock on ${HOME}/.claude/session.lock:
local lock_file="${HOME}/.claude/session.lock"
...
output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1;
claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)
With the CLAUDE_CONFIG_DIR fix in place, this external lock is
redundant but harmless — proper-lockfile serializes the refresh
internally, and flock serializes the entire invocation externally.
The external flock remains as a defense-in-depth measure; removal is
tracked as a separate vision-tier issue.
See also
lib/env.sh:138-140—CLAUDE_SHARED_DIR/CLAUDE_CONFIG_DIRdefaultslib/claude-config.sh— migration helper used bydisinto initlib/agent-sdk.sh:139,144— the externalflockwrapper (deferred removal)docker/agents/entrypoint.sh:101-102—CLAUDE_CONFIG_DIRauth detection.env.example:92-99— operator-facing documentation of the env vars- Issue #623 — chat container auth strategy