Evaluation code and paper assets for measuring no-thinking behavior, thinking inertia, and answer-space-dependent compression in LLMs.
Dianqiao Lei1, Kevin Qinghong Lin2, Pan Lu3, Philip Torr2, James Zou3
1 Tsinghua University ย ย 2 University of Oxford ย ย 3 Stanford University
News | Overview | Core Findings | Installation | Quick Start | Paper Map | Citation
- May 2026: The release repository has been cleaned for GitHub: large run outputs, logs, caches, and internal report-generation scripts are excluded.
- May 2026: The codebase now uses paper-facing script names that map directly to the experiment sections.
- Coming soon: Paper link and full release notes.
This repository accompanies the paper "LLMs Keep Thinking When Told Not To". We ask whether explicit no-thinking controls actually remove visible question-conditioned work from model responses, rather than merely hiding a special reasoning trace or shortening the output.
The evaluation decomposes each response into:
response y = [visible pre-answer text T ; final answer A]
and reports three complementary metrics:
| Metric | Meaning |
|---|---|
| Acc. | Task accuracy of the final answer A. |
| ETR | Empty-thinking ratio: how often the visible pre-answer text T is empty. |
| Sim | Semantic relevance between the question and visible pre-answer text T. |
No-Thinking Is Not a Switch. Under native think-off (M2), averaged across six models, answer-only compliance is strongest for Boolean verification, weakens for multiple choice, and collapses for open-ended generation.
Video: as the answer space opens, ETR drops toward zero while Sim rises, showing that visible question-conditioned payload remains in the answer channel.
Open-ended tasks keep question-conditioned work in the answer channel. The same no-thinking control can look effective when the task supplies a compact answer space, yet expose substantial pre-answer payload when the model must construct an open-ended response.
No-thinking is therefore evaluated as a response-level behavior: a model is closer to no-thinking when it preserves final-answer accuracy while exposing little or no question-conditioned payload before the answer.
- Think-on holds; think-off leaks. Native think-off controls compress Boolean verification most reliably, are inconsistent on multiple choice, and fail almost completely on open-ended tasks. This response-level residue is what we call thinking inertia.
- The strongest no-think instruction is not always safest. On open-ended tasks, stricter answer-only constraints can raise ETR by removing visible payload, but the same pressure can also remove computation needed for accuracy.
- Answer-space support governs compressibility. Rewriting the same math questions from MCQ into yes/no form raises ETR sharply with little accuracy loss; rewriting them into open-ended form drives ETR to zero and raises Sim, showing that no-thinking depends on the structure supplied by the question itself.
- Six-mode intervention spectrum: compare thinking-on, native no-think, reasoning re-elicitation, semantic suppression, strict answer-only output, and prefix forcing.
- Answer-space family coverage: boolean verification, multiple-choice answering, and open-ended generation.
- Paper-aligned analysis scripts: MMLU domain compressibility, surface perturbations, candidate-visible verification, and numeric answer-space rewrites.
- Posthoc scoring utilities: recompute full-response similarity and math scores without rerunning model inference.
- README result asset: a single animated core-results summary under
assets/readme/.
ThinkZero/
โโโ README.md
โโโ pyproject.toml
โโโ requirements.txt
โโโ assets/readme/ # README animated core-results summary
โโโ data/
โ โโโ processed/ # Normalized benchmark JSONL files
โ โโโ raw/ # Local cache placeholder; ignored except .gitkeep
โโโ docs/
โ โโโ index.html # GitHub Pages project page
โ โโโ assets/ # Project-page figures and animation
โ โโโ thinking_spectrum.md # Detailed six-mode workflow notes
โ โโโ nonreason_baseline.md # Earlier strict non-reasoning baseline notes
โโโ src/thinking_inertia/
โโโ download_datasets.py
โโโ eval_thinking_spectrum.py
โโโ eval_nonreason.py
โโโ score_full_response_similarity.py
โโโ rescore_math_records.py
โโโ mmlu_domain_compressibility.py
โโโ mmlu_surface_perturbations.py
โโโ mmlu_candidate_verification.py
โโโ mmlu_numeric_answer_space_rewrites.py
Generated artifacts should stay local:
data/runs/ # records.jsonl and summary.json from model runs
data/experiments/ # derived intervention datasets and analysis tables
data/raw/hf_cache/ # Hugging Face cache
logs/ # model server logs
cd ThinkZero
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .If you prefer a non-editable dependency install:
python -m pip install -r requirements.txtThe release copy includes normalized benchmark files under data/processed/.
To regenerate them from Hugging Face:
python -m thinking_inertia.download_datasets \
--root-dir data \
--overwriteSupported datasets:
| Answer-space family | Datasets |
|---|---|
| Boolean verification | boolq, strategyqa |
| Multiple choice | mmlu, mmlu_pro |
| Open-ended generation | gsm8k, math |
| Additional normalized set | commonsenseqa, gsm_symbolic |
For paper-style MATH subsets, keep the full data/processed/math.jsonl file and sample at evaluation time.
Start an OpenAI-compatible model server separately, then run the six-mode evaluation:
python -m thinking_inertia.eval_thinking_spectrum \
--root-dir data \
--datasets boolq strategyqa mmlu mmlu_pro gsm8k math \
--base-url http://127.0.0.1:8401/v1 \
--model qwen3-4b \
--api chat \
--modes 1 2 3 4 5 6 \
--include-reasoning \
--accept-display-math-wrapper \
--max-samples 200 \
--sample-strategy level_balanced \
--seed 20260502 \
--run-name qwen3_4b_spectrum_paper200Outputs are written to:
data/runs/qwen3_4b_spectrum_paper200/
โโโ records.jsonl
โโโ summary.json
eval_thinking_spectrum.py and score_full_response_similarity.py use MiniLM to compute semantic similarity.
By default, the code uses sentence-transformers/all-MiniLM-L6-v2.
For offline or pinned environments:
export MINILM_MODEL_PATH=/path/to/all-MiniLM-L6-v2Score visible responses against questions without rerunning model inference:
python -m thinking_inertia.score_full_response_similarity \
--input data/runs/qwen3_4b_spectrum_paper200If a run used fallback math normalization, rescore GSM8K/MATH records with math-verify:
python -m thinking_inertia.rescore_math_records \
--run-dir data/runs/qwen3_4b_spectrum_paper200 \
--accept-display-math-wrapper| Mode | Name | Paper role |
|---|---|---|
mode1 |
Thinking on | Thinking-enabled reference. |
mode2 |
Native no-think | Native think-off or minimal-thinking baseline. |
mode3 |
Step-by-step under no-think | Re-elicitation test. |
mode4 |
Short lead-in, no explanation | Soft semantic suppression. |
mode5 |
Strict answer only | Strong answer-only suppression. |
mode6 |
Prefix forcing | Structural "The answer is ..." branch. |
| Paper section | Script | Role |
|---|---|---|
| Settings / Benchmarks | download_datasets.py |
Download and normalize benchmark JSONL files. |
| Native Think-on vs. Think-off | eval_thinking_spectrum.py |
Run M1-M3 to test thinking inertia. |
| Performance across No-Think variants | eval_thinking_spectrum.py |
Run M2/M4/M5/M6 suppression and prefix-forcing variants. |
| Performance across disciplines | mmlu_domain_compressibility.py |
Analyze MMLU domain and subject-level compressibility. |
| Performance under answer-space rewrites | mmlu_numeric_answer_space_rewrites.py |
Build/analyze matched MCQ, yes/no, and open-ended numeric rewrites. |
| Appendix: Surface perturbations | mmlu_surface_perturbations.py |
Build/analyze choice-shuffle and numeric-label controls. |
| Appendix: Candidate visibility | mmlu_candidate_verification.py |
Build/analyze candidate-visible yes/no verification rewrites. |
| Script | Typical command | Description |
|---|---|---|
download_datasets.py |
thinking-download-datasets |
Regenerate normalized benchmark files under data/processed/. |
eval_thinking_spectrum.py |
thinking-eval-spectrum |
Main six-mode model evaluation. |
eval_nonreason.py |
thinking-eval-nonreason |
Earlier strict non-reasoning baseline and shared extraction utilities. |
score_full_response_similarity.py |
thinking-score-response |
Posthoc Q-to-response similarity scoring. |
rescore_math_records.py |
thinking-rescore-math |
Rebuild math scores and summaries with math-verify. |
mmlu_domain_compressibility.py |
python -m thinking_inertia.mmlu_domain_compressibility |
Domain/subject analysis for MMLU. |
mmlu_surface_perturbations.py |
python -m thinking_inertia.mmlu_surface_perturbations |
Choice-shuffle and numeric-label controls. |
mmlu_candidate_verification.py |
python -m thinking_inertia.mmlu_candidate_verification |
MCQ-to-yes/no candidate verification rewrites. |
mmlu_numeric_answer_space_rewrites.py |
python -m thinking_inertia.mmlu_numeric_answer_space_rewrites |
Matched numeric MCQ, boolean, and open-ended rewrites. |
Each evaluation run writes:
| File | Contents |
|---|---|
records.jsonl |
Per-example model response, extracted T, extracted answer A, accuracy, ETR fields, and similarity values. |
summary.json |
Aggregates by mode, dataset, and mode-dataset pair. |
Important record fields include:
mode, mode_name
content, reasoning, reasoning_content
extracted_t, t_source
t_word_count, t_char_count
answer_raw, prediction, gold_answer
thinking_rate, max_thinking_rate
nonreason_pass, nonreason_correct
is_correct
- docs/thinking_spectrum.md: detailed six-mode workflow.
- docs/nonreason_baseline.md: earlier baseline workflow.
- data/README.md: local data directory policy.
- Project page: static GitHub Pages site served from
docs/.
This repository intentionally excludes:
- raw model outputs from full experiments,
- local model logs,
- Hugging Face caches,
- recovered or internal scratch artifacts,
- HTML report generators used only for collaborator discussion.
The project page is included under docs/ and is served by GitHub Pages at https://leidq.github.io/ThinkZero/.
If you use this repository, please cite the paper:
@misc{lei2026llmskeepthinking,
title = {LLMs Keep Thinking When Told Not To},
author = {Dianqiao Lei and Kevin Qinghong Lin and Pan Lu and Philip Torr and James Zou},
year = {2026},
note = {Preprint}
}The final arXiv/project-page links will be added after release.
