Files
AX-Copilot-Codex/docs/CODE_CONTEXT_RELIABILITY_PLAN.md
lacvet eb884e9263 지침과 문서에 코드 컨텍스트 안정화 계획을 반영한다
- AGENTS.md에 코드 파일 주석 영문화와 인코딩 손상 문자열 정리 규칙을 추가한다.

- 최근 Code 탭 실행 로그를 재분석해 메시지 수 증가 대비 컨텍스트 충실도 저하 원인을 정리한다.

- Code working set, task-aware pruning, tool trace invariant, bootstrap observability를 포함한 장기 수정 계획 문서를 추가한다.

- README와 DEVELOPMENT 문서에 2026-04-16 01:28 KST 기준 분석 결과와 후속 계획을 기록한다.

- 검증: dotnet build src\\AxCopilot\\AxCopilot.csproj -c Release -v minimal -p:OutputPath=bin\\verify_context_plan_docs\\ -p:IntermediateOutputPath=obj\\verify_context_plan_docs\\ (경고 0 / 오류 0)
2026-04-16 01:20:14 +09:00

10 KiB

Code Context Reliability Plan

Update: 2026-04-16 01:37 (KST)

Background

Recent Code tab runs show that the LLM request payload is still growing over time. In the 2026-04-16 00:46:26 to 00:50:52 run, the request size grew from messages=7 to messages=125. That means the failure mode is not "context does not grow at all." The real problem is context fidelity: detailed evidence that the model still needs is being replaced too quickly by previews, repair notes, and low-signal summaries.

The same log window repeatedly shows:

  • tool_calls/tool mismatch detected - flattening assistant message
  • orphan tool message detected - converting to user
  • repeated rereads of nearby files after build failures
  • shifting build failures such as MC3089 followed by CS0017 without a stable working set that preserves what was already changed and what remains broken

In short, the current system grows the raw message count but does not preserve a stable working set for long-running code tasks.

Current Findings

1. Workspace context bootstrap is weak on first load

  • AX targets:
    • src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs
    • src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
  • Finding:
    • When .ax-context.md is missing, the first Code request can return before background workspace-context generation becomes useful.
  • Impact:
    • Empty-workspace and fresh-project tasks start without a reliable folder or project summary in the early loops.

2. Build and file evidence is compacted too aggressively

  • AX targets:
    • src/AxCopilot/Services/Agent/AgentToolResultBudget.cs
    • src/AxCopilot/Services/Agent/ContextCondenser.cs
  • Current values:
    • DefaultSoftCharLimit = 900
    • DefaultAggregateBudgetChars = 7_500
    • RecentKeepCount = 6
  • Impact:
    • Code tasks lose detailed build, test, and file-read evidence too early and fall back to previews instead of actionable context.

3. Session learning is not a durable code working set

  • AX targets:
    • src/AxCopilot/Services/Agent/SessionLearningCollector.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
    • src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
  • Finding:
    • Session learnings are injected every loop, but they are not structured strongly enough to lock in:
      • current goal
      • current architecture
      • changed files
      • latest build or test failure
      • next repair target
  • Impact:
    • The model must repeatedly reconstruct project state from noisy history instead of reading a stable code-task memory layer.

4. Tool-trace invariant repairs are too common

  • AX targets:
    • src/AxCopilot/Services/LlmService.ToolUse.cs
    • src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
  • Finding:
    • The recent logs show repeated mismatch and orphan corrections.
  • Impact:
    • Even if the total message count grows, the semantic chain between assistant reasoning, tool call, and tool result becomes less reliable.

5. There is no Code-specific working-set layer

  • AX targets:
    • new service required
    • injection path should go through:
      • AgentLoopLlmRequestPreparationService
      • AgentQueryContextBuilder
      • AgentLoopService
  • Finding:
    • The current request mixes raw chat history, session learnings, project context, and workspace context, but it does not maintain a dedicated code-task state ledger.
  • Impact:
    • Long-running runs become increasingly inconsistent because the model keeps rediscovering facts that should already be fixed in memory.

External Research Notes

Anthropic Claude Code memory docs

  • Claude Code explicitly documents memory files that are auto-loaded at startup and inspectable via /memory.
  • Planning implication:
    • AX should have a clearly observable memory hierarchy for Code tasks, including what was auto-loaded and why.
  • Source:

OpenAI practical guide to building agents

  • The guide emphasizes observability, eval baselines, and explicit tool and system design before optimizing agent behavior.
  • Planning implication:
    • AX should log the exact context sections that enter each Code request, including what was compacted and why.
  • Source:

SWE-Pruner

  • The paper argues that task-aware adaptive pruning outperforms naive fixed truncation for coding agents.
  • Planning implication:
    • AX should protect code-task evidence such as latest build failures and changed-file summaries instead of applying mostly size-based pruning.
  • Source:

claude-code Reference Points

Reference targets:

  • claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/query.ts
  • claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/history.ts
  • claw-code/en/concepts/memory-context.md

Observed direction:

  • claude-code builds a dedicated messagesForQuery window.
  • It stages compaction through boundary filtering, tool-result budgeting, snip, microcompact, and autocompact.
  • It treats memory and post-compaction query windows as first-class parts of the request path.

AX already has similar mechanisms, but the Code flow still lacks stronger working-set preservation and cleaner invariant handling.

Remediation Plan

Phase 1. Context observability and bootstrap repair

  • Reference targets:
    • claw-code/.../src/query.ts
    • claw-code/en/concepts/memory-context.md
  • AX targets:
    • src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs
    • src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
    • src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
    • src/AxCopilot/Services/Agent/AgentLoopLlmRequestPreparationService.cs
  • Work items:
    • guarantee workspace-context generation starts even on first miss
    • log the exact context sections injected into each request
    • add diagnostics for omitted sections
  • Done criteria:
    • empty-workspace runs show workspace context generation by loop 2
    • logs show section names, sizes, and compaction status
  • Quality scenario:
    • a fresh E:\code WPF scaffolding run should show folder and project context in the first two request cycles

Phase 2. Code working-set memory layer

  • Reference targets:
    • claw-code/.../src/query.ts
    • claw-code/.../src/history.ts
    • Anthropic memory docs
  • AX targets:
    • new CodeTaskWorkingSetService
    • src/AxCopilot/Services/Agent/SessionLearningCollector.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
    • src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
  • Work items:
    • maintain a stable structured ledger with:
      • current goal
      • selected architecture
      • changed files
      • latest successful writes
      • open diagnostics
      • next repair target
    • inject it only when changed
    • replace superseded failures with the latest active issue
  • Done criteria:
    • long Code runs keep a single coherent working-set block without noisy duplication
    • build and test failures are preserved as part of the working set
  • Quality scenario:
    • after fixing MC3089, the run should still remember the earlier structure change while focusing on the new CS0017 entry-point failure

Phase 3. Task-aware pruning and protected evidence

  • Reference targets:
    • claw-code/.../src/query.ts
    • SWE-Pruner
  • AX targets:
    • src/AxCopilot/Services/Agent/AgentToolResultBudget.cs
    • src/AxCopilot/Services/Agent/ContextCondenser.cs
    • src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
  • Work items:
    • protect:
      • latest build error block
      • latest test failure block
      • current plan or working set
      • latest folder tree snapshot
      • last N write diffs
    • move from pure char-based truncation toward semantic snapshots
    • tune compaction rules specifically for Code tasks
  • Done criteria:
    • active repair evidence survives across loops until superseded
    • older noise shrinks without losing the current failure context
  • Quality scenario:
    • a 30-plus-loop Code run should still preserve the latest failure and target files in the request payload

Phase 4. Tool-trace invariant hardening

  • Reference targets:
    • claw-code/.../src/query.ts
    • claw-code/.../src/history.ts
  • AX targets:
    • src/AxCopilot/Services/LlmService.ToolUse.cs
    • src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
  • Work items:
    • shift from after-the-fact flattening to pre-request validation and normalization
    • classify mismatch and orphan causes and lock them with regression tests
    • add a final integrity pass before query submission
  • Done criteria:
    • standard Code runs approach zero mismatch or orphan repair logs
    • assistant, tool, and tool_result chains remain intact end to end
  • Quality scenario:
    • a 50-loop Code run should complete without repeated tool-trace repair events

Phase 5. Encoding hygiene and prompt cleanup

  • Reference targets:
    • Anthropic memory docs
    • OpenAI practical guide eval and observability recommendations
  • AX targets:
    • src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
    • src/AxCopilot/Services/Agent/SessionLearningCollector.cs
    • active status, prompt, and catalog files
    • AGENTS.md
  • Work items:
    • enforce English-only comments in code files
    • rewrite mojibake strings in active prompt paths into English
    • add long-run Code evals to catch prompt and status encoding regressions
  • Done criteria:
    • no broken strings remain in active prompt or status paths
    • touched code files keep English comments only
  • Quality scenario:
    • Windows Korean environments should show readable build, test, and status output without mojibake feedback loops

Priority

  1. Phase 1: bootstrap and observability
  2. Phase 2: working-set memory
  3. Phase 3: task-aware pruning
  4. Phase 4: tool-trace invariants
  5. Phase 5: encoding and prompt cleanup

Expected Outcome

  • fewer repeated build-failure loops
  • better structural consistency for project generation and large edits
  • less drift in long-running Code tasks
  • fewer quality losses caused by broken strings and low-signal context replacements