fix mcp test generation stability and recovery#5
Conversation
Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR improves the stability and recoverability of long-running MCP test generation by making answer file extensions configurable, adding checkpoint/resume + cancellation cleanup paths, and introducing semantic quality gates for generator type=3/type=4.
Changes:
- Add configurable
answer_extand propagate it through generation, verification, and Polygon packaging. - Harden
problem_generate_testswith checkpoints, resume support, hard timeouts, and cancellation-safe PID tracking. - Add semantic quality gates (type3/type4 overlap checks) and extend tests for recovery/cleanup edge cases.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/autocode_mcp/tools/problem.py |
Adds answer_ext, checkpoint/resume state handling, PID tracking, and a cleanup tool for generator processes. |
src/autocode_mcp/tools/test_verify.py |
Extends verification to support custom answer extensions and adds limit_semantics checks. |
src/autocode_mcp/tools/generator.py |
Adds a static semantic check for type=3/type=4 divergence (optional strict mode). |
src/autocode_mcp/utils/compiler.py |
Adds process_start_hook and ensures cancellation forces subprocess termination. |
src/autocode_mcp/utils/win_job.py |
Adds close() to release job handles without explicitly terminating first. |
src/autocode_mcp/server.py |
Returns a structured failure response when tool calls are cancelled. |
src/autocode_mcp/tools/... tests (tests/test_tools/test_problem.py, tests/test_tools/test_generator.py, tests/test_compiler.py) |
Adds coverage for custom answer_ext, resume fallback, cancellation cleanup, cleanup tool behavior, and strict semantic checks. |
Docs (README.md, skills/autocode-workflow/SKILL.md, agents/autocode-workflow.md, scripts/workflow_guard.py, src/autocode_mcp/prompts/__init__.py, CLAUDE.md) |
Updates workflow guidance and tool docs for answer_ext, resume/checkpoints, and semantic expectations. |
pyproject.toml, uv.lock |
Adds twine to dev dependency group and updates lockfile accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Made-with: Cursor
Tighten answer_ext normalization and retry semantics, clear stale generated outputs before final writes, and make cleanup/semantic checks more resilient across platforms and generator branch styles. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 18 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| remaining_pids = [pid for pid in pids if isinstance(pid, int) and pid not in killed] | ||
| self._write_cleanup_state(state_path, remaining_pids) | ||
| if not remaining_pids and os.path.exists(state_path): | ||
| os.remove(state_path) | ||
| removed_files.append(state_path) |
There was a problem hiding this comment.
ProblemCleanupProcessesTool writes the generation state file using _write_cleanup_state() which overwrites the entire JSON with only {active_pids: ...} and may even delete the file when no PIDs remain. This state file is also used by problem_generate_tests for checkpoint/resume (it stores next_seed, candidates, etc.), so overwriting/removing it can break resume=true. Consider loading the existing state JSON, updating only the active_pids field (and keeping other fields intact), and avoid deleting the checkpoint unless the user explicitly requests it.
| with open(state_path, "w", encoding="utf-8") as f: | ||
| json.dump({"active_pids": remaining_pids}, f, ensure_ascii=False, indent=2) |
There was a problem hiding this comment.
_write_cleanup_state() currently rewrites the state file to only include active_pids, which discards checkpoint metadata (phase, next_seed, candidates, errors, answer_ext, etc.) saved by problem_generate_tests. Update this helper to merge into the existing JSON state (or use a separate PID-tracking file) so cleanup doesn’t destroy resume checkpoints.
| with open(state_path, "w", encoding="utf-8") as f: | |
| json.dump({"active_pids": remaining_pids}, f, ensure_ascii=False, indent=2) | |
| state = self._load_cleanup_state(state_path) or {} | |
| state["active_pids"] = remaining_pids | |
| with open(state_path, "w", encoding="utf-8") as f: | |
| json.dump(state, f, ensure_ascii=False, indent=2) |
| @@ -873,7 +1243,7 @@ async def execute( | |||
| <memory-limit>{memory_limit_bytes}</memory-limit> | |||
| <test-count>{actual_test_count}</test-count> | |||
| <input-path-pattern>tests/%02d.in</input-path-pattern> | |||
| <answer-path-pattern>tests/%02d.ans</answer-path-pattern> | |||
| <answer-path-pattern>tests/%02d{answer_ext}</answer-path-pattern> | |||
There was a problem hiding this comment.
answer_ext is taken from the manifest and interpolated directly into problem.xml without normalization/escaping. If the manifest is edited/corrupted (e.g., contains path separators or </&), this can produce invalid XML or an unintended path pattern. Reuse the same answer_ext normalization rules used elsewhere and XML-escape the interpolated values before writing problem.xml.
| def close(self) -> None: | ||
| """关闭 Job Object 句柄(不主动终止进程)。""" | ||
| if self.job_handle is not None and self.job_handle != 0: | ||
| win32api.CloseHandle(self.job_handle) | ||
| self.job_handle = None |
There was a problem hiding this comment.
WinJobObject.close() claims it "does not actively terminate processes", but the Job is created with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, so closing the last handle will terminate all processes in the job anyway. Either (a) drop KILL_ON_JOB_CLOSE when you need a non-terminating close(), or (b) keep the flag but update the docstring/name/comments to reflect that close() can still kill processes via the job limit.
| elapsed_ms = int((time.time() - start_time) * 1000) | ||
|
|
||
| # 正常完成后清理 Job Object | ||
| # 正常完成后只关闭 Job Handle |
There was a problem hiding this comment.
The comment says "正常完成后只关闭 Job Handle", but with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE set on the job, job.close() may still terminate any remaining child processes in the job when the handle is closed. If the intent is to not kill processes on success, the job should be created without KILL_ON_JOB_CLOSE (or with a different strategy), otherwise please adjust the comment/behavior to match the actual semantics.
| # 正常完成后只关闭 Job Handle | |
| # 正常完成后关闭 Job Handle;若该 Job 配置了 KILL_ON_JOB_CLOSE, | |
| # 关闭时仍可能终止 Job 中尚未退出的子进程。 |
Preserve checkpoint metadata when cleaning tracked generator PIDs, add POSIX PID termination support, and sanitize/escape manifest-derived answer extensions in Polygon XML output. Made-with: Cursor
Summary
Test plan
uv run pytest tests/test_tools/test_problem.py tests/test_tools/test_generator.py tests/test_compiler.py -quv run ruff check src/autocode_mcp/tools/problem.py src/autocode_mcp/tools/test_verify.py src/autocode_mcp/tools/generator.py src/autocode_mcp/server.py src/autocode_mcp/utils/compiler.py src/autocode_mcp/utils/win_job.py scripts/workflow_guard.py tests/test_tools/test_problem.py tests/test_tools/test_generator.py tests/test_compiler.py