Files

lacvet 0f64bf3f84 Code 탭 컨텍스트 누적 신뢰성과 작업 연속성을 전면 보강한다

이번 커밋은 Code 탭 장기 실행에서 build/file 근거가 너무 빨리 축약되고, 이전 수정 맥락이 다음 LLM 요청에 안정적으로 누적되지 않던 문제를 해결하기 위한 전면 보강을 담는다.

핵심 수정사항:
- CodeTaskWorkingSetService를 추가해 최근 생성 디렉터리, 최근 읽기/쓰기 파일, 최신 build/test 진단, 다음 복구 초점을 구조화된 working set으로 유지하고 각 반복 요청에 보조 system context로 주입한다.
- AgentQueryContextBuilder와 AgentToolResultBudget에 code profile을 도입해 protected recent window와 tool_result budget을 확장하고 build_run, test_loop, file_read, multi_read, lsp_code_intel, git_tool 같은 고가치 evidence가 기본 탭보다 덜 잘리도록 조정한다.
- AgentLoopIterationPreparationService와 AgentLoopLlmRequestPreparationService를 확장해 query-context options와 supplemental messages를 함께 전달하고, AgentLoopService에서는 Code 탭에서 generic session learnings 대신 working set 중심으로 요청을 구성하도록 변경한다.
- ChatWindow.UtilityPresentation에서 workspace context 첫 부트스트랩을 강화해 .ax-context.md가 아직 없더라도 첫 요청 시점부터 background generation과 language workflow bootstrap hints가 반영되도록 수정한다.
- LlmService.ToolUse에서 historical tool trace sanitization 결과를 assistant flatten/orphan conversion 건수로 요약 로그에 남겨 tool-trace 불변식 문제를 추적 가능하게 만든다.
- 관련 테스트를 추가·갱신해 working set 누적, code profile budget, supplemental message 주입, query-context option 전달을 회귀 고정한다.

검증 결과:
- dotnet build src/AxCopilot/AxCopilot.csproj -c Release -v minimal -p:OutputPath=bin\\verify_context_reliability_full\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_full\\ : 경고 0 / 오류 0
- dotnet test src/AxCopilot.Tests/AxCopilot.Tests.csproj -c Release -v minimal --filter "AgentQueryContextBuilderTests|AgentToolResultBudgetTests|AgentLoopIterationPreparationServiceTests|AgentLoopLlmRequestPreparationServiceTests|CodeTaskWorkingSetServiceTests|AgentLoopCodeQualityTests" -p:OutputPath=bin\\verify_context_reliability_full_tests\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_full_tests\\ : 통과 150
- dotnet test src/AxCopilot.Tests/AxCopilot.Tests.csproj -c Release -v minimal --filter "AgentLoopE2ETests|AgentMessageInvariantHelperTests" -p:OutputPath=bin\\verify_context_reliability_e2e\\ -p:IntermediateOutputPath=obj\\verify_context_reliability_e2e\\ : 통과 21

2026-04-16 01:45:28 +09:00

12 KiB

Raw Blame History

Code Context Reliability Plan

Update: 2026-04-16 01:37 (KST)

Background

Recent Code tab runs show that the LLM request payload is still growing over time. In the 2026-04-16 00:46:26 to 00:50:52 run, the request size grew from messages=7 to messages=125. That means the failure mode is not "context does not grow at all." The real problem is context fidelity: detailed evidence that the model still needs is being replaced too quickly by previews, repair notes, and low-signal summaries.

The same log window repeatedly shows:

tool_calls/tool mismatch detected - flattening assistant message
orphan tool message detected - converting to user
repeated rereads of nearby files after build failures
shifting build failures such as MC3089 followed by CS0017 without a stable working set that preserves what was already changed and what remains broken

In short, the current system grows the raw message count but does not preserve a stable working set for long-running code tasks.

Current Findings

1. Workspace context bootstrap is weak on first load

AX targets:
- src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs
- src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
Finding:
- When .ax-context.md is missing, the first Code request can return before background workspace-context generation becomes useful.
Impact:
- Empty-workspace and fresh-project tasks start without a reliable folder or project summary in the early loops.

2. Build and file evidence is compacted too aggressively

AX targets:
- src/AxCopilot/Services/Agent/AgentToolResultBudget.cs
- src/AxCopilot/Services/Agent/ContextCondenser.cs
Current values:
- DefaultSoftCharLimit = 900
- DefaultAggregateBudgetChars = 7_500
- RecentKeepCount = 6
Impact:
- Code tasks lose detailed build, test, and file-read evidence too early and fall back to previews instead of actionable context.

3. Session learning is not a durable code working set

AX targets:
- src/AxCopilot/Services/Agent/SessionLearningCollector.cs
- src/AxCopilot/Services/Agent/AgentLoopService.cs
- src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
Finding:
- Session learnings are injected every loop, but they are not structured strongly enough to lock in:
  - current goal
  - current architecture
  - changed files
  - latest build or test failure
  - next repair target
Impact:
- The model must repeatedly reconstruct project state from noisy history instead of reading a stable code-task memory layer.

4. Tool-trace invariant repairs are too common

AX targets:
- src/AxCopilot/Services/LlmService.ToolUse.cs
- src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs
- src/AxCopilot/Services/Agent/AgentLoopService.cs
Finding:
- The recent logs show repeated mismatch and orphan corrections.
Impact:
- Even if the total message count grows, the semantic chain between assistant reasoning, tool call, and tool result becomes less reliable.

5. There is no Code-specific working-set layer

AX targets:
- new service required
- injection path should go through:
  - AgentLoopLlmRequestPreparationService
  - AgentQueryContextBuilder
  - AgentLoopService
Finding:
- The current request mixes raw chat history, session learnings, project context, and workspace context, but it does not maintain a dedicated code-task state ledger.
Impact:
- Long-running runs become increasingly inconsistent because the model keeps rediscovering facts that should already be fixed in memory.

External Research Notes

Anthropic Claude Code memory docs

Claude Code explicitly documents memory files that are auto-loaded at startup and inspectable via /memory.
Planning implication:
- AX should have a clearly observable memory hierarchy for Code tasks, including what was auto-loaded and why.
Source:
- Anthropic Claude Code memory docs

OpenAI practical guide to building agents

The guide emphasizes observability, eval baselines, and explicit tool and system design before optimizing agent behavior.
Planning implication:
- AX should log the exact context sections that enter each Code request, including what was compacted and why.
Source:
- OpenAI practical guide to building agents

SWE-Pruner

The paper argues that task-aware adaptive pruning outperforms naive fixed truncation for coding agents.
Planning implication:
- AX should protect code-task evidence such as latest build failures and changed-file summaries instead of applying mostly size-based pruning.
Source:
- SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

`claude-code` Reference Points

Reference targets:

claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/query.ts
claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/history.ts
claw-code/en/concepts/memory-context.md

Observed direction:

claude-code builds a dedicated messagesForQuery window.
It stages compaction through boundary filtering, tool-result budgeting, snip, microcompact, and autocompact.
It treats memory and post-compaction query windows as first-class parts of the request path.

AX already has similar mechanisms, but the Code flow still lacks stronger working-set preservation and cleaner invariant handling.

Remediation Plan

Phase 1. Context observability and bootstrap repair

Reference targets:
- claw-code/.../src/query.ts
- claw-code/en/concepts/memory-context.md
AX targets:
- src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs
- src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
- src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
- src/AxCopilot/Services/Agent/AgentLoopLlmRequestPreparationService.cs
Work items:
- guarantee workspace-context generation starts even on first miss
- log the exact context sections injected into each request
- add diagnostics for omitted sections
Done criteria:
- empty-workspace runs show workspace context generation by loop 2
- logs show section names, sizes, and compaction status
Quality scenario:
- a fresh E:\code WPF scaffolding run should show folder and project context in the first two request cycles

Phase 2. Code working-set memory layer

Reference targets:
- claw-code/.../src/query.ts
- claw-code/.../src/history.ts
- Anthropic memory docs
AX targets:
- new CodeTaskWorkingSetService
- src/AxCopilot/Services/Agent/SessionLearningCollector.cs
- src/AxCopilot/Services/Agent/AgentLoopService.cs
- src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
Work items:
- maintain a stable structured ledger with:
  - current goal
  - selected architecture
  - changed files
  - latest successful writes
  - open diagnostics
  - next repair target
- inject it only when changed
- replace superseded failures with the latest active issue
Done criteria:
- long Code runs keep a single coherent working-set block without noisy duplication
- build and test failures are preserved as part of the working set
Quality scenario:
- after fixing MC3089, the run should still remember the earlier structure change while focusing on the new CS0017 entry-point failure

Phase 3. Task-aware pruning and protected evidence

Reference targets:
- claw-code/.../src/query.ts
- SWE-Pruner
AX targets:
- src/AxCopilot/Services/Agent/AgentToolResultBudget.cs
- src/AxCopilot/Services/Agent/ContextCondenser.cs
- src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
Work items:
- protect:
  - latest build error block
  - latest test failure block
  - current plan or working set
  - latest folder tree snapshot
  - last N write diffs
- move from pure char-based truncation toward semantic snapshots
- tune compaction rules specifically for Code tasks
Done criteria:
- active repair evidence survives across loops until superseded
- older noise shrinks without losing the current failure context
Quality scenario:
- a 30-plus-loop Code run should still preserve the latest failure and target files in the request payload

Phase 4. Tool-trace invariant hardening

Reference targets:
- claw-code/.../src/query.ts
- claw-code/.../src/history.ts
AX targets:
- src/AxCopilot/Services/LlmService.ToolUse.cs
- src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs
- src/AxCopilot/Services/Agent/AgentLoopService.cs
Work items:
- shift from after-the-fact flattening to pre-request validation and normalization
- classify mismatch and orphan causes and lock them with regression tests
- add a final integrity pass before query submission
Done criteria:
- standard Code runs approach zero mismatch or orphan repair logs
- assistant, tool, and tool_result chains remain intact end to end
Quality scenario:
- a 50-loop Code run should complete without repeated tool-trace repair events

Phase 5. Encoding hygiene and prompt cleanup

Reference targets:
- Anthropic memory docs
- OpenAI practical guide eval and observability recommendations
AX targets:
- src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
- src/AxCopilot/Services/Agent/SessionLearningCollector.cs
- active status, prompt, and catalog files
- AGENTS.md
Work items:
- enforce English-only comments in code files
- rewrite mojibake strings in active prompt paths into English
- add long-run Code evals to catch prompt and status encoding regressions
Done criteria:
- no broken strings remain in active prompt or status paths
- touched code files keep English comments only
Quality scenario:
- Windows Korean environments should show readable build, test, and status output without mojibake feedback loops

Priority

Phase 1: bootstrap and observability
Phase 2: working-set memory
Phase 3: task-aware pruning
Phase 4: tool-trace invariants
Phase 5: encoding and prompt cleanup

Expected Outcome

fewer repeated build-failure loops
better structural consistency for project generation and large edits
less drift in long-running Code tasks
fewer quality losses caused by broken strings and low-signal context replacements

Latest Delivery

Updated: 2026-04-16 01:41 (KST)

Delivered in this pass:
- Phase 1 foundation:
  - ChatWindow.UtilityPresentation.cs now bootstraps workspace context generation on first access and returns language-workflow fallback hints while .ax-context.md is still being generated.
  - AgentLoopService.cs now records query_context workflow transitions with query-window, budget, supplemental-context, and working-set summaries.
- Phase 2 foundation:
  - CodeTaskWorkingSetService.cs adds a Code-only structured ledger for:
    - goal
    - selected scaffold/profile
    - created directories
    - recent reads/writes
    - latest diagnostics
    - next repair focus
  - the working set is injected into each Code request as a supplemental code_working_set system message.
- Phase 3 foundation:
  - AgentToolResultBudget.cs and AgentQueryContextBuilder.cs now expose a code query profile with a larger protected-recent window and larger retained budgets for build_run, test_loop, process, file_read, multi_read, lsp_code_intel, and git_tool.
- Phase 4 observability step:
  - LlmService.ToolUse.cs now logs sanitization counts for flattened assistant tool traces and converted orphan tool messages, so tool-trace repair frequency can be measured per run.
Remaining follow-up:
- extend pre-request tool-trace validation so the flattening/orphan repair count trends toward zero rather than being logged after repair
- replace more mojibake prompt/status strings in active Code execution paths with English equivalents

12 KiB Raw Blame History