Files
AX-Copilot-Codex/docs/CODE_CONTEXT_RELIABILITY_PLAN.md
lacvet 0f64bf3f84 Code 탭 컨텍스트 누적 신뢰성과 작업 연속성을 전면 보강한다
이번 커밋은 Code 탭 장기 실행에서 build/file 근거가 너무 빨리 축약되고, 이전 수정 맥락이 다음 LLM 요청에 안정적으로 누적되지 않던 문제를 해결하기 위한 전면 보강을 담는다.

핵심 수정사항:
- CodeTaskWorkingSetService를 추가해 최근 생성 디렉터리, 최근 읽기/쓰기 파일, 최신 build/test 진단, 다음 복구 초점을 구조화된 working set으로 유지하고 각 반복 요청에 보조 system context로 주입한다.
- AgentQueryContextBuilder와 AgentToolResultBudget에 code profile을 도입해 protected recent window와 tool_result budget을 확장하고 build_run, test_loop, file_read, multi_read, lsp_code_intel, git_tool 같은 고가치 evidence가 기본 탭보다 덜 잘리도록 조정한다.
- AgentLoopIterationPreparationService와 AgentLoopLlmRequestPreparationService를 확장해 query-context options와 supplemental messages를 함께 전달하고, AgentLoopService에서는 Code 탭에서 generic session learnings 대신 working set 중심으로 요청을 구성하도록 변경한다.
- ChatWindow.UtilityPresentation에서 workspace context 첫 부트스트랩을 강화해 .ax-context.md가 아직 없더라도 첫 요청 시점부터 background generation과 language workflow bootstrap hints가 반영되도록 수정한다.
- LlmService.ToolUse에서 historical tool trace sanitization 결과를 assistant flatten/orphan conversion 건수로 요약 로그에 남겨 tool-trace 불변식 문제를 추적 가능하게 만든다.
- 관련 테스트를 추가·갱신해 working set 누적, code profile budget, supplemental message 주입, query-context option 전달을 회귀 고정한다.

검증 결과:
- dotnet build src/AxCopilot/AxCopilot.csproj -c Release -v minimal -p:OutputPath=bin\\verify_context_reliability_full\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_full\\ : 경고 0 / 오류 0
- dotnet test src/AxCopilot.Tests/AxCopilot.Tests.csproj -c Release -v minimal --filter "AgentQueryContextBuilderTests|AgentToolResultBudgetTests|AgentLoopIterationPreparationServiceTests|AgentLoopLlmRequestPreparationServiceTests|CodeTaskWorkingSetServiceTests|AgentLoopCodeQualityTests" -p:OutputPath=bin\\verify_context_reliability_full_tests\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_full_tests\\ : 통과 150
- dotnet test src/AxCopilot.Tests/AxCopilot.Tests.csproj -c Release -v minimal --filter "AgentLoopE2ETests|AgentMessageInvariantHelperTests" -p:OutputPath=bin\\verify_context_reliability_e2e\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_e2e\\ : 통과 21
2026-04-16 01:45:28 +09:00

12 KiB

Code Context Reliability Plan

Update: 2026-04-16 01:37 (KST)

Background

Recent Code tab runs show that the LLM request payload is still growing over time. In the 2026-04-16 00:46:26 to 00:50:52 run, the request size grew from messages=7 to messages=125. That means the failure mode is not "context does not grow at all." The real problem is context fidelity: detailed evidence that the model still needs is being replaced too quickly by previews, repair notes, and low-signal summaries.

The same log window repeatedly shows:

  • tool_calls/tool mismatch detected - flattening assistant message
  • orphan tool message detected - converting to user
  • repeated rereads of nearby files after build failures
  • shifting build failures such as MC3089 followed by CS0017 without a stable working set that preserves what was already changed and what remains broken

In short, the current system grows the raw message count but does not preserve a stable working set for long-running code tasks.

Current Findings

1. Workspace context bootstrap is weak on first load

  • AX targets:
    • src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs
    • src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
  • Finding:
    • When .ax-context.md is missing, the first Code request can return before background workspace-context generation becomes useful.
  • Impact:
    • Empty-workspace and fresh-project tasks start without a reliable folder or project summary in the early loops.

2. Build and file evidence is compacted too aggressively

  • AX targets:
    • src/AxCopilot/Services/Agent/AgentToolResultBudget.cs
    • src/AxCopilot/Services/Agent/ContextCondenser.cs
  • Current values:
    • DefaultSoftCharLimit = 900
    • DefaultAggregateBudgetChars = 7_500
    • RecentKeepCount = 6
  • Impact:
    • Code tasks lose detailed build, test, and file-read evidence too early and fall back to previews instead of actionable context.

3. Session learning is not a durable code working set

  • AX targets:
    • src/AxCopilot/Services/Agent/SessionLearningCollector.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
    • src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
  • Finding:
    • Session learnings are injected every loop, but they are not structured strongly enough to lock in:
      • current goal
      • current architecture
      • changed files
      • latest build or test failure
      • next repair target
  • Impact:
    • The model must repeatedly reconstruct project state from noisy history instead of reading a stable code-task memory layer.

4. Tool-trace invariant repairs are too common

  • AX targets:
    • src/AxCopilot/Services/LlmService.ToolUse.cs
    • src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
  • Finding:
    • The recent logs show repeated mismatch and orphan corrections.
  • Impact:
    • Even if the total message count grows, the semantic chain between assistant reasoning, tool call, and tool result becomes less reliable.

5. There is no Code-specific working-set layer

  • AX targets:
    • new service required
    • injection path should go through:
      • AgentLoopLlmRequestPreparationService
      • AgentQueryContextBuilder
      • AgentLoopService
  • Finding:
    • The current request mixes raw chat history, session learnings, project context, and workspace context, but it does not maintain a dedicated code-task state ledger.
  • Impact:
    • Long-running runs become increasingly inconsistent because the model keeps rediscovering facts that should already be fixed in memory.

External Research Notes

Anthropic Claude Code memory docs

  • Claude Code explicitly documents memory files that are auto-loaded at startup and inspectable via /memory.
  • Planning implication:
    • AX should have a clearly observable memory hierarchy for Code tasks, including what was auto-loaded and why.
  • Source:

OpenAI practical guide to building agents

  • The guide emphasizes observability, eval baselines, and explicit tool and system design before optimizing agent behavior.
  • Planning implication:
    • AX should log the exact context sections that enter each Code request, including what was compacted and why.
  • Source:

SWE-Pruner

  • The paper argues that task-aware adaptive pruning outperforms naive fixed truncation for coding agents.
  • Planning implication:
    • AX should protect code-task evidence such as latest build failures and changed-file summaries instead of applying mostly size-based pruning.
  • Source:

claude-code Reference Points

Reference targets:

  • claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/query.ts
  • claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/history.ts
  • claw-code/en/concepts/memory-context.md

Observed direction:

  • claude-code builds a dedicated messagesForQuery window.
  • It stages compaction through boundary filtering, tool-result budgeting, snip, microcompact, and autocompact.
  • It treats memory and post-compaction query windows as first-class parts of the request path.

AX already has similar mechanisms, but the Code flow still lacks stronger working-set preservation and cleaner invariant handling.

Remediation Plan

Phase 1. Context observability and bootstrap repair

  • Reference targets:
    • claw-code/.../src/query.ts
    • claw-code/en/concepts/memory-context.md
  • AX targets:
    • src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs
    • src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
    • src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
    • src/AxCopilot/Services/Agent/AgentLoopLlmRequestPreparationService.cs
  • Work items:
    • guarantee workspace-context generation starts even on first miss
    • log the exact context sections injected into each request
    • add diagnostics for omitted sections
  • Done criteria:
    • empty-workspace runs show workspace context generation by loop 2
    • logs show section names, sizes, and compaction status
  • Quality scenario:
    • a fresh E:\code WPF scaffolding run should show folder and project context in the first two request cycles

Phase 2. Code working-set memory layer

  • Reference targets:
    • claw-code/.../src/query.ts
    • claw-code/.../src/history.ts
    • Anthropic memory docs
  • AX targets:
    • new CodeTaskWorkingSetService
    • src/AxCopilot/Services/Agent/SessionLearningCollector.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
    • src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
  • Work items:
    • maintain a stable structured ledger with:
      • current goal
      • selected architecture
      • changed files
      • latest successful writes
      • open diagnostics
      • next repair target
    • inject it only when changed
    • replace superseded failures with the latest active issue
  • Done criteria:
    • long Code runs keep a single coherent working-set block without noisy duplication
    • build and test failures are preserved as part of the working set
  • Quality scenario:
    • after fixing MC3089, the run should still remember the earlier structure change while focusing on the new CS0017 entry-point failure

Phase 3. Task-aware pruning and protected evidence

  • Reference targets:
    • claw-code/.../src/query.ts
    • SWE-Pruner
  • AX targets:
    • src/AxCopilot/Services/Agent/AgentToolResultBudget.cs
    • src/AxCopilot/Services/Agent/ContextCondenser.cs
    • src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
  • Work items:
    • protect:
      • latest build error block
      • latest test failure block
      • current plan or working set
      • latest folder tree snapshot
      • last N write diffs
    • move from pure char-based truncation toward semantic snapshots
    • tune compaction rules specifically for Code tasks
  • Done criteria:
    • active repair evidence survives across loops until superseded
    • older noise shrinks without losing the current failure context
  • Quality scenario:
    • a 30-plus-loop Code run should still preserve the latest failure and target files in the request payload

Phase 4. Tool-trace invariant hardening

  • Reference targets:
    • claw-code/.../src/query.ts
    • claw-code/.../src/history.ts
  • AX targets:
    • src/AxCopilot/Services/LlmService.ToolUse.cs
    • src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
  • Work items:
    • shift from after-the-fact flattening to pre-request validation and normalization
    • classify mismatch and orphan causes and lock them with regression tests
    • add a final integrity pass before query submission
  • Done criteria:
    • standard Code runs approach zero mismatch or orphan repair logs
    • assistant, tool, and tool_result chains remain intact end to end
  • Quality scenario:
    • a 50-loop Code run should complete without repeated tool-trace repair events

Phase 5. Encoding hygiene and prompt cleanup

  • Reference targets:
    • Anthropic memory docs
    • OpenAI practical guide eval and observability recommendations
  • AX targets:
    • src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
    • src/AxCopilot/Services/Agent/SessionLearningCollector.cs
    • active status, prompt, and catalog files
    • AGENTS.md
  • Work items:
    • enforce English-only comments in code files
    • rewrite mojibake strings in active prompt paths into English
    • add long-run Code evals to catch prompt and status encoding regressions
  • Done criteria:
    • no broken strings remain in active prompt or status paths
    • touched code files keep English comments only
  • Quality scenario:
    • Windows Korean environments should show readable build, test, and status output without mojibake feedback loops

Priority

  1. Phase 1: bootstrap and observability
  2. Phase 2: working-set memory
  3. Phase 3: task-aware pruning
  4. Phase 4: tool-trace invariants
  5. Phase 5: encoding and prompt cleanup

Expected Outcome

  • fewer repeated build-failure loops
  • better structural consistency for project generation and large edits
  • less drift in long-running Code tasks
  • fewer quality losses caused by broken strings and low-signal context replacements

Latest Delivery

Updated: 2026-04-16 01:41 (KST)

  • Delivered in this pass:
    • Phase 1 foundation:
      • ChatWindow.UtilityPresentation.cs now bootstraps workspace context generation on first access and returns language-workflow fallback hints while .ax-context.md is still being generated.
      • AgentLoopService.cs now records query_context workflow transitions with query-window, budget, supplemental-context, and working-set summaries.
    • Phase 2 foundation:
      • CodeTaskWorkingSetService.cs adds a Code-only structured ledger for:
        • goal
        • selected scaffold/profile
        • created directories
        • recent reads/writes
        • latest diagnostics
        • next repair focus
      • the working set is injected into each Code request as a supplemental code_working_set system message.
    • Phase 3 foundation:
      • AgentToolResultBudget.cs and AgentQueryContextBuilder.cs now expose a code query profile with a larger protected-recent window and larger retained budgets for build_run, test_loop, process, file_read, multi_read, lsp_code_intel, and git_tool.
    • Phase 4 observability step:
      • LlmService.ToolUse.cs now logs sanitization counts for flattened assistant tool traces and converted orphan tool messages, so tool-trace repair frequency can be measured per run.
  • Remaining follow-up:
    • extend pre-request tool-trace validation so the flattening/orphan repair count trends toward zero rather than being logged after repair
    • replace more mojibake prompt/status strings in active Code execution paths with English equivalents