- AxAgentExecutionEngine에서 시스템 프롬프트 중복을 제거하고 structured tool_use/tool_result 전사본을 conversation.Messages로 동기화해 다음 턴과 저장 이력에서도 코드 작업 컨텍스트가 유지되도록 수정 - AgentQueryContextBuilder와 ContextCondenser에 post-compact tool snippet 복원, recent window 확대, tool result 보존 강화 로직을 추가해 장기 코드 실행 중 빌드/파일 근거 손실을 줄임 - MaxContextTokens=0 Auto 모드를 AppSettings, SettingsService 마이그레이션, 설정 UI, 오버레이 UI, 컨텍스트 사용량 표시, LLM 요청 본문에 연결하고 Auto 모드에서는 provider output cap 강제 주입을 제거 - 관련 회귀 테스트와 문서 README/DEVELOPMENT/CODE_CONTEXT_RELIABILITY_PLAN을 갱신하고 깨진 진단 문자열 기대값을 영어 기준으로 정리 검증: - dotnet build src/AxCopilot/AxCopilot.csproj -c Release -v minimal -p:OutputPath=bin\\verify_context_reliability_followup\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_followup\\ - dotnet test src/AxCopilot.Tests/AxCopilot.Tests.csproj -c Release -v minimal --filter "AxAgentExecutionEngineTests|AgentQueryContextBuilderTests|ContextCondenserTests|SettingsServiceTests|AgentLoopDiagnosticsFormatterTests" -p:OutputPath=bin\\verify_context_reliability_followup_tests\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_followup_tests\\ - dotnet test src/AxCopilot.Tests/AxCopilot.Tests.csproj -c Release -v minimal --filter "AgentLoopQueryAssemblyServiceTests|AgentLoopPreLlmStageServiceTests|AgentLoopLlmRequestPreparationServiceTests|AgentMessageInvariantHelperTests|CodeTaskWorkingSetServiceTests|AgentLoopE2ETests" -p:OutputPath=bin\\verify_context_reliability_followup_tests2\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_followup_tests2\\ - dotnet build src/AxCopilot/AxCopilot.csproj -c Release -v minimal -p:OutputPath=bin\\verify_context_reliability_final\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_final\\
17 KiB
Code Context Reliability Plan
Update: 2026-04-16 07:40 (KST)
- Closed the main gaps that were still open versus the comparison checklist:
- durable structured tool transcript persistence back into
conversation.Messages MaxContextTokens = 0Auto mode with model-aware context budget resolution- snippet-based post-compact tool trace restoration
- repeated system prompt deduplication before each prepared turn
- durable structured tool transcript persistence back into
- The active implementation now spans:
src/AxCopilot/Services/Agent/AxAgentExecutionEngine.cssrc/AxCopilot/Services/Agent/AgentQueryContextBuilder.cssrc/AxCopilot/Services/Agent/ContextCondenser.cssrc/AxCopilot/Models/AppSettings.cssrc/AxCopilot/Services/SettingsService.cssrc/AxCopilot/ViewModels/SettingsViewModel.cssrc/AxCopilot/Views/SettingsWindow.xamlsrc/AxCopilot/Views/SettingsWindow.xaml.cssrc/AxCopilot/Views/AgentSettingsWindow.xaml.cssrc/AxCopilot/Views/ChatWindow.xamlsrc/AxCopilot/Views/ChatWindow.xaml.cssrc/AxCopilot/Views/ChatWindow.OverlaySettingsPresentation.cssrc/AxCopilot/Views/ChatWindow.ContextUsagePresentation.cssrc/AxCopilot/Services/LlmService.cssrc/AxCopilot/Services/LlmService.ToolUse.cs
- Remaining follow-up is now narrower:
- consider splitting context budget and output budget into separate user-facing controls if future provider tuning needs it
- continue cleaning legacy mojibake strings in low-traffic diagnostic paths as they are touched
Update: 2026-04-16 06:41 (KST)
- Added
src/AxCopilot/Services/Agent/AgentLoopLlmDispatchStageService.csso the LLM dispatch path is now split into:- history/query assembly
- pre-LLM stage planning
- dispatch/stream stage execution
- tool execution and recovery
AgentLoopServiceno longer owns the inline stream preview callback or the directSendWithToolsWithRecoveryAsync(...)setup for the primary loop.StreamingToolExecutionCoordinator.cswas also normalized to English-only active-path status strings so the staged dispatch path no longer reintroduces mojibake text during wait/retry handling.- Remaining structural gap versus the target
claw-codeshape:- the
NotSupportedException/ToolCallNotSupportedExceptionfallback branch still lives inAgentLoopService - the next extraction target should be a narrower fallback policy stage so the main loop keeps shrinking toward a pure orchestrator
- the
Update: 2026-04-16 01:37 (KST)
Background
Recent Code tab runs show that the LLM request payload is still growing over time. In the 2026-04-16 00:46:26 to 00:50:52 run, the request size grew from messages=7 to messages=125. That means the failure mode is not "context does not grow at all." The real problem is context fidelity: detailed evidence that the model still needs is being replaced too quickly by previews, repair notes, and low-signal summaries.
The same log window repeatedly shows:
tool_calls/tool mismatch detected - flattening assistant messageorphan tool message detected - converting to user- repeated rereads of nearby files after build failures
- shifting build failures such as
MC3089followed byCS0017without a stable working set that preserves what was already changed and what remains broken
In short, the current system grows the raw message count but does not preserve a stable working set for long-running code tasks.
Current Findings
1. Workspace context bootstrap is weak on first load
- AX targets:
src/AxCopilot/Views/ChatWindow.UtilityPresentation.cssrc/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
- Finding:
- When
.ax-context.mdis missing, the first Code request can return before background workspace-context generation becomes useful.
- When
- Impact:
- Empty-workspace and fresh-project tasks start without a reliable folder or project summary in the early loops.
2. Build and file evidence is compacted too aggressively
- AX targets:
src/AxCopilot/Services/Agent/AgentToolResultBudget.cssrc/AxCopilot/Services/Agent/ContextCondenser.cs
- Current values:
DefaultSoftCharLimit = 900DefaultAggregateBudgetChars = 7_500RecentKeepCount = 6
- Impact:
- Code tasks lose detailed build, test, and file-read evidence too early and fall back to previews instead of actionable context.
3. Session learning is not a durable code working set
- AX targets:
src/AxCopilot/Services/Agent/SessionLearningCollector.cssrc/AxCopilot/Services/Agent/AgentLoopService.cssrc/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
- Finding:
- Session learnings are injected every loop, but they are not structured strongly enough to lock in:
- current goal
- current architecture
- changed files
- latest build or test failure
- next repair target
- Session learnings are injected every loop, but they are not structured strongly enough to lock in:
- Impact:
- The model must repeatedly reconstruct project state from noisy history instead of reading a stable code-task memory layer.
4. Tool-trace invariant repairs are too common
- AX targets:
src/AxCopilot/Services/LlmService.ToolUse.cssrc/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cssrc/AxCopilot/Services/Agent/AgentLoopService.cs
- Finding:
- The recent logs show repeated mismatch and orphan corrections.
- Impact:
- Even if the total message count grows, the semantic chain between assistant reasoning, tool call, and tool result becomes less reliable.
5. There is no Code-specific working-set layer
- AX targets:
- new service required
- injection path should go through:
AgentLoopLlmRequestPreparationServiceAgentQueryContextBuilderAgentLoopService
- Finding:
- The current request mixes raw chat history, session learnings, project context, and workspace context, but it does not maintain a dedicated code-task state ledger.
- Impact:
- Long-running runs become increasingly inconsistent because the model keeps rediscovering facts that should already be fixed in memory.
External Research Notes
Anthropic Claude Code memory docs
- Claude Code explicitly documents memory files that are auto-loaded at startup and inspectable via
/memory. - Planning implication:
- AX should have a clearly observable memory hierarchy for Code tasks, including what was auto-loaded and why.
- Source:
OpenAI practical guide to building agents
- The guide emphasizes observability, eval baselines, and explicit tool and system design before optimizing agent behavior.
- Planning implication:
- AX should log the exact context sections that enter each Code request, including what was compacted and why.
- Source:
SWE-Pruner
- The paper argues that task-aware adaptive pruning outperforms naive fixed truncation for coding agents.
- Planning implication:
- AX should protect code-task evidence such as latest build failures and changed-file summaries instead of applying mostly size-based pruning.
- Source:
claude-code Reference Points
Reference targets:
claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/query.tsclaw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/history.tsclaw-code/en/concepts/memory-context.md
Observed direction:
claude-codebuilds a dedicatedmessagesForQuerywindow.- It stages compaction through boundary filtering, tool-result budgeting, snip, microcompact, and autocompact.
- It treats memory and post-compaction query windows as first-class parts of the request path.
AX already has similar mechanisms, but the Code flow still lacks stronger working-set preservation and cleaner invariant handling.
Remediation Plan
Phase 1. Context observability and bootstrap repair
- Reference targets:
claw-code/.../src/query.tsclaw-code/en/concepts/memory-context.md
- AX targets:
src/AxCopilot/Views/ChatWindow.UtilityPresentation.cssrc/AxCopilot/Services/Agent/WorkspaceContextGenerator.cssrc/AxCopilot/Services/Agent/AgentQueryContextBuilder.cssrc/AxCopilot/Services/Agent/AgentLoopLlmRequestPreparationService.cs
- Work items:
- guarantee workspace-context generation starts even on first miss
- log the exact context sections injected into each request
- add diagnostics for omitted sections
- Done criteria:
- empty-workspace runs show workspace context generation by loop 2
- logs show section names, sizes, and compaction status
- Quality scenario:
- a fresh
E:\codeWPF scaffolding run should show folder and project context in the first two request cycles
- a fresh
Phase 2. Code working-set memory layer
- Reference targets:
claw-code/.../src/query.tsclaw-code/.../src/history.ts- Anthropic memory docs
- AX targets:
- new
CodeTaskWorkingSetService src/AxCopilot/Services/Agent/SessionLearningCollector.cssrc/AxCopilot/Services/Agent/AgentLoopService.cssrc/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
- new
- Work items:
- maintain a stable structured ledger with:
- current goal
- selected architecture
- changed files
- latest successful writes
- open diagnostics
- next repair target
- inject it only when changed
- replace superseded failures with the latest active issue
- maintain a stable structured ledger with:
- Done criteria:
- long Code runs keep a single coherent working-set block without noisy duplication
- build and test failures are preserved as part of the working set
- Quality scenario:
- after fixing
MC3089, the run should still remember the earlier structure change while focusing on the newCS0017entry-point failure
- after fixing
Phase 3. Task-aware pruning and protected evidence
- Reference targets:
claw-code/.../src/query.ts- SWE-Pruner
- AX targets:
src/AxCopilot/Services/Agent/AgentToolResultBudget.cssrc/AxCopilot/Services/Agent/ContextCondenser.cssrc/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
- Work items:
- protect:
- latest build error block
- latest test failure block
- current plan or working set
- latest folder tree snapshot
- last N write diffs
- move from pure char-based truncation toward semantic snapshots
- tune compaction rules specifically for Code tasks
- protect:
- Done criteria:
- active repair evidence survives across loops until superseded
- older noise shrinks without losing the current failure context
- Quality scenario:
- a 30-plus-loop Code run should still preserve the latest failure and target files in the request payload
Phase 4. Tool-trace invariant hardening
- Reference targets:
claw-code/.../src/query.tsclaw-code/.../src/history.ts
- AX targets:
src/AxCopilot/Services/LlmService.ToolUse.cssrc/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cssrc/AxCopilot/Services/Agent/AgentLoopService.cs
- Work items:
- shift from after-the-fact flattening to pre-request validation and normalization
- classify mismatch and orphan causes and lock them with regression tests
- add a final integrity pass before query submission
- Done criteria:
- standard Code runs approach zero mismatch or orphan repair logs
- assistant, tool, and tool_result chains remain intact end to end
- Quality scenario:
- a 50-loop Code run should complete without repeated tool-trace repair events
Phase 5. Encoding hygiene and prompt cleanup
- Reference targets:
- Anthropic memory docs
- OpenAI practical guide eval and observability recommendations
- AX targets:
src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cssrc/AxCopilot/Services/Agent/SessionLearningCollector.cs- active status, prompt, and catalog files
AGENTS.md
- Work items:
- enforce English-only comments in code files
- rewrite mojibake strings in active prompt paths into English
- add long-run Code evals to catch prompt and status encoding regressions
- Done criteria:
- no broken strings remain in active prompt or status paths
- touched code files keep English comments only
- Quality scenario:
- Windows Korean environments should show readable build, test, and status output without mojibake feedback loops
Priority
- Phase 1: bootstrap and observability
- Phase 2: working-set memory
- Phase 3: task-aware pruning
- Phase 4: tool-trace invariants
- Phase 5: encoding and prompt cleanup
Expected Outcome
- fewer repeated build-failure loops
- better structural consistency for project generation and large edits
- less drift in long-running Code tasks
- fewer quality losses caused by broken strings and low-signal context replacements
Latest Delivery
Updated: 2026-04-16 01:41 (KST)
- Delivered in this pass:
- Phase 1 foundation:
ChatWindow.UtilityPresentation.csnow bootstraps workspace context generation on first access and returns language-workflow fallback hints while.ax-context.mdis still being generated.AgentLoopService.csnow recordsquery_contextworkflow transitions with query-window, budget, supplemental-context, and working-set summaries.
- Phase 2 foundation:
CodeTaskWorkingSetService.csadds a Code-only structured ledger for:- goal
- selected scaffold/profile
- created directories
- recent reads/writes
- latest diagnostics
- next repair focus
- the working set is injected into each Code request as a supplemental
code_working_setsystem message.
- Phase 3 foundation:
AgentToolResultBudget.csandAgentQueryContextBuilder.csnow expose acodequery profile with a larger protected-recent window and larger retained budgets forbuild_run,test_loop,process,file_read,multi_read,lsp_code_intel, andgit_tool.
- Phase 4 observability step:
LlmService.ToolUse.csnow logs sanitization counts for flattened assistant tool traces and converted orphan tool messages, so tool-trace repair frequency can be measured per run.
- Phase 1 foundation:
- Remaining follow-up:
- extend pre-request tool-trace validation so the flattening/orphan repair count trends toward zero rather than being logged after repair
- replace more mojibake prompt/status strings in active Code execution paths with English equivalents
Updated: 2026-04-16 01:57 (KST)
- Delivered in this pass:
- Phase 4 partial delivery:
AgentMessageInvariantHelper.csnow normalizes historical tool traces before the request leaves the agent loop.- structured assistant tool-call messages without matching
tool_resultnow flatten into plain assistant transcript text. - orphan
tool_resultmessages now flatten into plain user transcript text instead of relying only on late OpenAI payload repair. AgentLoopLlmRequestPreparationService.csclones the query window first, then applies normalization, so request cleanup does not mutate stored conversation history.AgentLoopContextReliability.csnow logs tool-trace repair counts inside thequery_contexttransition for run-by-run observability.
- Phase 5 partial delivery:
SessionLearningCollector.cswas rewritten with English-only comments and English injection text.AgentLoopDiagnosticsFormatter.csno longer emits mojibake-prone compaction status text in active Code paths.
- Phase 4 partial delivery:
- Remaining follow-up:
- measure whether
tool_trace_repaircounts keep trending down in long Code runs after this preflight normalization - continue replacing older mojibake strings outside the active Code execution path
- measure whether
Updated: 2026-04-16 02:05 (KST)
- Delivered in this pass:
- structural alignment step:
AgentLoopQueryAssemblyService.csnow owns the staged query/history assembly path.PrepareHistory(...)handles session-learning refresh plus queued-command/query-window preparation.PrepareRequest(...)handles Code working-set supplemental context and request-message assembly before dispatch.AgentLoopService.csnow delegates those responsibilities instead of manually stitching them together inline.
- test and encoding hygiene step:
AgentLoopQueryAssemblyServiceTests.cslocks the new staged assembly behavior.SessionLearningCollectorTests.cswas rewritten to English-only comments and assertions to match the new repository rule.
- structural alignment step:
- Remaining follow-up:
- keep extracting more inline AgentLoop responsibilities into smaller staged services where it improves observability or retry correctness
- continue measuring long Code runs against claw-code-style continuity scenarios
Updated: 2026-04-16 02:13 (KST)
- Delivered in this pass:
- structural alignment step:
AgentLoopPreLlmStageService.csnow owns the iteration decisions immediately before the LLM call.- the service centralizes:
- thinking-summary selection
- Gemini free-tier delay planning
- user-prompt submit hook fingerprint/payload planning
- missing-tool guard shaping
- request assembly handoff
AgentLoopService.csnow consumes that stage result instead of computing those branches inline.
- test coverage step:
AgentLoopPreLlmStageServiceTests.csnow locks the new pre-LLM decision layer.
- structural alignment step:
- Remaining follow-up:
- continue extracting the actual LLM dispatch / streaming callback branch into a narrower execution service
- compare long-running Code traces against claw-code-style staged transitions and keep reducing inline loop logic