disinto/docs/CLAUDE-AUTH-CONCURRENCY.md
Claude b5807b3516
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
fix: docs/CLAUDE-AUTH-CONCURRENCY.md: rewrite for shared CLAUDE_CONFIG_DIR approach (#646)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 21:16:18 +00:00

5.3 KiB

Claude Code OAuth Concurrency Model

Problem statement

The factory runs multiple concurrent Claude Code processes across containers. OAuth access tokens are short-lived; refresh tokens rotate on each use. If two processes POST the same refresh token to Anthropic's token endpoint simultaneously, only one wins — the other gets invalid_grant and the operator is forced to re-login.

Claude Code already serializes OAuth refreshes internally using proper-lockfile (src/utils/auth.ts:1485-1491):

release = await lockfile.lock(claudeDir)

proper-lockfile creates a lockfile via an atomic mkdir(${path}.lock) call — a cross-process primitive that works across any number of processes on the same filesystem. The problem was never the lock implementation; it was that our old per-container bind-mount layout (~/.claude mounted but /home/agent/ container-local) caused each container to compute a different lockfile path, so the locks never coordinated.

The fix: shared CLAUDE_CONFIG_DIR

CLAUDE_CONFIG_DIR is an officially supported env var in Claude Code (src/utils/envUtils.ts). It controls where Claude resolves its config directory instead of the default ~/.claude.

By setting CLAUDE_CONFIG_DIR to a path on a shared bind mount, every container computes the same lockfile location. proper-lockfile's atomic mkdir(${CLAUDE_CONFIG_DIR}.lock) then gives free cross-container serialization — no external wrapper needed.

Current layout

Host filesystem:
  /var/lib/disinto/claude-shared/          ← CLAUDE_SHARED_DIR
  └── config/                              ← CLAUDE_CONFIG_DIR
      ├── credentials.json
      ├── settings.json
      └── ...

Inside every container:
  Same absolute path: /var/lib/disinto/claude-shared/config
  Env: CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config

The shared directory is mounted at the same absolute path inside every container, so proper-lockfile resolves an identical lock path everywhere.

Where these values are defined

What Where
Defaults for CLAUDE_SHARED_DIR, CLAUDE_CONFIG_DIR lib/env.sh:138-140
.env documentation .env.example:92-99
Container mounts + env passthrough (edge dispatcher) docker/edge/dispatcher.sh:446-448 (and analogous blocks for reproduce, triage, verify)
Auth detection using CLAUDE_CONFIG_DIR docker/agents/entrypoint.sh:101-102
Bootstrap / migration during disinto init lib/claude-config.sh:setup_claude_config_dir(), bin/disinto:952-962

Migration for existing dev boxes

For operators upgrading from the old ~/.claude bind-mount layout, disinto init handles the migration interactively (or with --yes). The manual equivalent is:

# 1. Stop the factory
disinto down

# 2. Create the shared directory
mkdir -p /var/lib/disinto/claude-shared

# 3. Move existing config
mv "$HOME/.claude" /var/lib/disinto/claude-shared/config

# 4. Create a back-compat symlink so host-side claude still works
ln -sfn /var/lib/disinto/claude-shared/config "$HOME/.claude"

# 5. Export the env var (add to shell rc for persistence)
export CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config

# 6. Start the factory
disinto up

Verification

Watch for these analytics events during concurrent agent runs:

Event Meaning
tengu_oauth_token_refresh_lock_acquiring A process is attempting to acquire the refresh lock
tengu_oauth_token_refresh_lock_acquired Lock acquired; refresh proceeding
tengu_oauth_token_refresh_lock_retry Lock is held by another process; retrying
tengu_oauth_token_refresh_lock_race_resolved Contention detected and resolved normally
tengu_oauth_token_refresh_lock_retry_limit_reached Lock acquisition failed after all retries

Healthy: _race_resolved appearing during contention windows — this means multiple processes tried to refresh simultaneously and the lock correctly serialized them.

Bad: _lock_retry_limit_reached — indicates the lock is stuck or the shared mount is not working. Verify that CLAUDE_CONFIG_DIR resolves to the same path in all containers and that the filesystem supports mkdir atomicity (any POSIX filesystem does).

The deferred external flock wrapper

lib/agent-sdk.sh:139,144 still wraps every claude invocation in an external flock on ${HOME}/.claude/session.lock:

local lock_file="${HOME}/.claude/session.lock"
...
output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1;
  claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)

With the CLAUDE_CONFIG_DIR fix in place, this external lock is redundant but harmlessproper-lockfile serializes the refresh internally, and flock serializes the entire invocation externally. The external flock remains as a defense-in-depth measure; removal is tracked as a separate vision-tier issue.

See also

  • lib/env.sh:138-140CLAUDE_SHARED_DIR / CLAUDE_CONFIG_DIR defaults
  • lib/claude-config.sh — migration helper used by disinto init
  • lib/agent-sdk.sh:139,144 — the external flock wrapper (deferred removal)
  • docker/agents/entrypoint.sh:101-102CLAUDE_CONFIG_DIR auth detection
  • .env.example:92-99 — operator-facing documentation of the env vars
  • Issue #623 — chat container auth strategy