이번 커밋은 Code 탭 장기 실행에서 build/file 근거가 너무 빨리 축약되고, 이전 수정 맥락이 다음 LLM 요청에 안정적으로 누적되지 않던 문제를 해결하기 위한 전면 보강을 담는다. 핵심 수정사항: - CodeTaskWorkingSetService를 추가해 최근 생성 디렉터리, 최근 읽기/쓰기 파일, 최신 build/test 진단, 다음 복구 초점을 구조화된 working set으로 유지하고 각 반복 요청에 보조 system context로 주입한다. - AgentQueryContextBuilder와 AgentToolResultBudget에 code profile을 도입해 protected recent window와 tool_result budget을 확장하고 build_run, test_loop, file_read, multi_read, lsp_code_intel, git_tool 같은 고가치 evidence가 기본 탭보다 덜 잘리도록 조정한다. - AgentLoopIterationPreparationService와 AgentLoopLlmRequestPreparationService를 확장해 query-context options와 supplemental messages를 함께 전달하고, AgentLoopService에서는 Code 탭에서 generic session learnings 대신 working set 중심으로 요청을 구성하도록 변경한다. - ChatWindow.UtilityPresentation에서 workspace context 첫 부트스트랩을 강화해 .ax-context.md가 아직 없더라도 첫 요청 시점부터 background generation과 language workflow bootstrap hints가 반영되도록 수정한다. - LlmService.ToolUse에서 historical tool trace sanitization 결과를 assistant flatten/orphan conversion 건수로 요약 로그에 남겨 tool-trace 불변식 문제를 추적 가능하게 만든다. - 관련 테스트를 추가·갱신해 working set 누적, code profile budget, supplemental message 주입, query-context option 전달을 회귀 고정한다. 검증 결과: - dotnet build src/AxCopilot/AxCopilot.csproj -c Release -v minimal -p:OutputPath=bin\\verify_context_reliability_full\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_full\\ : 경고 0 / 오류 0 - dotnet test src/AxCopilot.Tests/AxCopilot.Tests.csproj -c Release -v minimal --filter "AgentQueryContextBuilderTests|AgentToolResultBudgetTests|AgentLoopIterationPreparationServiceTests|AgentLoopLlmRequestPreparationServiceTests|CodeTaskWorkingSetServiceTests|AgentLoopCodeQualityTests" -p:OutputPath=bin\\verify_context_reliability_full_tests\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_full_tests\\ : 통과 150 - dotnet test src/AxCopilot.Tests/AxCopilot.Tests.csproj -c Release -v minimal --filter "AgentLoopE2ETests|AgentMessageInvariantHelperTests" -p:OutputPath=bin\\verify_context_reliability_e2e\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_e2e\\ : 통과 21
12 KiB
Code Context Reliability Plan
Update: 2026-04-16 01:37 (KST)
Background
Recent Code tab runs show that the LLM request payload is still growing over time. In the 2026-04-16 00:46:26 to 00:50:52 run, the request size grew from messages=7 to messages=125. That means the failure mode is not "context does not grow at all." The real problem is context fidelity: detailed evidence that the model still needs is being replaced too quickly by previews, repair notes, and low-signal summaries.
The same log window repeatedly shows:
tool_calls/tool mismatch detected - flattening assistant messageorphan tool message detected - converting to user- repeated rereads of nearby files after build failures
- shifting build failures such as
MC3089followed byCS0017without a stable working set that preserves what was already changed and what remains broken
In short, the current system grows the raw message count but does not preserve a stable working set for long-running code tasks.
Current Findings
1. Workspace context bootstrap is weak on first load
- AX targets:
src/AxCopilot/Views/ChatWindow.UtilityPresentation.cssrc/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
- Finding:
- When
.ax-context.mdis missing, the first Code request can return before background workspace-context generation becomes useful.
- When
- Impact:
- Empty-workspace and fresh-project tasks start without a reliable folder or project summary in the early loops.
2. Build and file evidence is compacted too aggressively
- AX targets:
src/AxCopilot/Services/Agent/AgentToolResultBudget.cssrc/AxCopilot/Services/Agent/ContextCondenser.cs
- Current values:
DefaultSoftCharLimit = 900DefaultAggregateBudgetChars = 7_500RecentKeepCount = 6
- Impact:
- Code tasks lose detailed build, test, and file-read evidence too early and fall back to previews instead of actionable context.
3. Session learning is not a durable code working set
- AX targets:
src/AxCopilot/Services/Agent/SessionLearningCollector.cssrc/AxCopilot/Services/Agent/AgentLoopService.cssrc/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
- Finding:
- Session learnings are injected every loop, but they are not structured strongly enough to lock in:
- current goal
- current architecture
- changed files
- latest build or test failure
- next repair target
- Session learnings are injected every loop, but they are not structured strongly enough to lock in:
- Impact:
- The model must repeatedly reconstruct project state from noisy history instead of reading a stable code-task memory layer.
4. Tool-trace invariant repairs are too common
- AX targets:
src/AxCopilot/Services/LlmService.ToolUse.cssrc/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cssrc/AxCopilot/Services/Agent/AgentLoopService.cs
- Finding:
- The recent logs show repeated mismatch and orphan corrections.
- Impact:
- Even if the total message count grows, the semantic chain between assistant reasoning, tool call, and tool result becomes less reliable.
5. There is no Code-specific working-set layer
- AX targets:
- new service required
- injection path should go through:
AgentLoopLlmRequestPreparationServiceAgentQueryContextBuilderAgentLoopService
- Finding:
- The current request mixes raw chat history, session learnings, project context, and workspace context, but it does not maintain a dedicated code-task state ledger.
- Impact:
- Long-running runs become increasingly inconsistent because the model keeps rediscovering facts that should already be fixed in memory.
External Research Notes
Anthropic Claude Code memory docs
- Claude Code explicitly documents memory files that are auto-loaded at startup and inspectable via
/memory. - Planning implication:
- AX should have a clearly observable memory hierarchy for Code tasks, including what was auto-loaded and why.
- Source:
OpenAI practical guide to building agents
- The guide emphasizes observability, eval baselines, and explicit tool and system design before optimizing agent behavior.
- Planning implication:
- AX should log the exact context sections that enter each Code request, including what was compacted and why.
- Source:
SWE-Pruner
- The paper argues that task-aware adaptive pruning outperforms naive fixed truncation for coding agents.
- Planning implication:
- AX should protect code-task evidence such as latest build failures and changed-file summaries instead of applying mostly size-based pruning.
- Source:
claude-code Reference Points
Reference targets:
claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/query.tsclaw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/history.tsclaw-code/en/concepts/memory-context.md
Observed direction:
claude-codebuilds a dedicatedmessagesForQuerywindow.- It stages compaction through boundary filtering, tool-result budgeting, snip, microcompact, and autocompact.
- It treats memory and post-compaction query windows as first-class parts of the request path.
AX already has similar mechanisms, but the Code flow still lacks stronger working-set preservation and cleaner invariant handling.
Remediation Plan
Phase 1. Context observability and bootstrap repair
- Reference targets:
claw-code/.../src/query.tsclaw-code/en/concepts/memory-context.md
- AX targets:
src/AxCopilot/Views/ChatWindow.UtilityPresentation.cssrc/AxCopilot/Services/Agent/WorkspaceContextGenerator.cssrc/AxCopilot/Services/Agent/AgentQueryContextBuilder.cssrc/AxCopilot/Services/Agent/AgentLoopLlmRequestPreparationService.cs
- Work items:
- guarantee workspace-context generation starts even on first miss
- log the exact context sections injected into each request
- add diagnostics for omitted sections
- Done criteria:
- empty-workspace runs show workspace context generation by loop 2
- logs show section names, sizes, and compaction status
- Quality scenario:
- a fresh
E:\codeWPF scaffolding run should show folder and project context in the first two request cycles
- a fresh
Phase 2. Code working-set memory layer
- Reference targets:
claw-code/.../src/query.tsclaw-code/.../src/history.ts- Anthropic memory docs
- AX targets:
- new
CodeTaskWorkingSetService src/AxCopilot/Services/Agent/SessionLearningCollector.cssrc/AxCopilot/Services/Agent/AgentLoopService.cssrc/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
- new
- Work items:
- maintain a stable structured ledger with:
- current goal
- selected architecture
- changed files
- latest successful writes
- open diagnostics
- next repair target
- inject it only when changed
- replace superseded failures with the latest active issue
- maintain a stable structured ledger with:
- Done criteria:
- long Code runs keep a single coherent working-set block without noisy duplication
- build and test failures are preserved as part of the working set
- Quality scenario:
- after fixing
MC3089, the run should still remember the earlier structure change while focusing on the newCS0017entry-point failure
- after fixing
Phase 3. Task-aware pruning and protected evidence
- Reference targets:
claw-code/.../src/query.ts- SWE-Pruner
- AX targets:
src/AxCopilot/Services/Agent/AgentToolResultBudget.cssrc/AxCopilot/Services/Agent/ContextCondenser.cssrc/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
- Work items:
- protect:
- latest build error block
- latest test failure block
- current plan or working set
- latest folder tree snapshot
- last N write diffs
- move from pure char-based truncation toward semantic snapshots
- tune compaction rules specifically for Code tasks
- protect:
- Done criteria:
- active repair evidence survives across loops until superseded
- older noise shrinks without losing the current failure context
- Quality scenario:
- a 30-plus-loop Code run should still preserve the latest failure and target files in the request payload
Phase 4. Tool-trace invariant hardening
- Reference targets:
claw-code/.../src/query.tsclaw-code/.../src/history.ts
- AX targets:
src/AxCopilot/Services/LlmService.ToolUse.cssrc/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cssrc/AxCopilot/Services/Agent/AgentLoopService.cs
- Work items:
- shift from after-the-fact flattening to pre-request validation and normalization
- classify mismatch and orphan causes and lock them with regression tests
- add a final integrity pass before query submission
- Done criteria:
- standard Code runs approach zero mismatch or orphan repair logs
- assistant, tool, and tool_result chains remain intact end to end
- Quality scenario:
- a 50-loop Code run should complete without repeated tool-trace repair events
Phase 5. Encoding hygiene and prompt cleanup
- Reference targets:
- Anthropic memory docs
- OpenAI practical guide eval and observability recommendations
- AX targets:
src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cssrc/AxCopilot/Services/Agent/SessionLearningCollector.cs- active status, prompt, and catalog files
AGENTS.md
- Work items:
- enforce English-only comments in code files
- rewrite mojibake strings in active prompt paths into English
- add long-run Code evals to catch prompt and status encoding regressions
- Done criteria:
- no broken strings remain in active prompt or status paths
- touched code files keep English comments only
- Quality scenario:
- Windows Korean environments should show readable build, test, and status output without mojibake feedback loops
Priority
- Phase 1: bootstrap and observability
- Phase 2: working-set memory
- Phase 3: task-aware pruning
- Phase 4: tool-trace invariants
- Phase 5: encoding and prompt cleanup
Expected Outcome
- fewer repeated build-failure loops
- better structural consistency for project generation and large edits
- less drift in long-running Code tasks
- fewer quality losses caused by broken strings and low-signal context replacements
Latest Delivery
Updated: 2026-04-16 01:41 (KST)
- Delivered in this pass:
- Phase 1 foundation:
ChatWindow.UtilityPresentation.csnow bootstraps workspace context generation on first access and returns language-workflow fallback hints while.ax-context.mdis still being generated.AgentLoopService.csnow recordsquery_contextworkflow transitions with query-window, budget, supplemental-context, and working-set summaries.
- Phase 2 foundation:
CodeTaskWorkingSetService.csadds a Code-only structured ledger for:- goal
- selected scaffold/profile
- created directories
- recent reads/writes
- latest diagnostics
- next repair focus
- the working set is injected into each Code request as a supplemental
code_working_setsystem message.
- Phase 3 foundation:
AgentToolResultBudget.csandAgentQueryContextBuilder.csnow expose acodequery profile with a larger protected-recent window and larger retained budgets forbuild_run,test_loop,process,file_read,multi_read,lsp_code_intel, andgit_tool.
- Phase 4 observability step:
LlmService.ToolUse.csnow logs sanitization counts for flattened assistant tool traces and converted orphan tool messages, so tool-trace repair frequency can be measured per run.
- Phase 1 foundation:
- Remaining follow-up:
- extend pre-request tool-trace validation so the flattening/orphan repair count trends toward zero rather than being logged after repair
- replace more mojibake prompt/status strings in active Code execution paths with English equivalents