🧠 LLMs Keep Thinking When Told Not To

Evaluation code and paper assets for measuring no-thinking behavior, thinking inertia, and answer-space-dependent compression in LLMs.

Dianqiao Lei¹, Kevin Qinghong Lin², Pan Lu³, Philip Torr², James Zou³

¹ Tsinghua University ² University of Oxford ³ Stanford University

🗞️ Latest News

May 2026: The release repository has been cleaned for GitHub: large run outputs, logs, caches, and internal report-generation scripts are excluded.
May 2026: The codebase now uses paper-facing script names that map directly to the experiment sections.
Coming soon: Paper link and full release notes.

🔍 Overview

This repository accompanies the paper "LLMs Keep Thinking When Told Not To". We ask whether explicit no-thinking controls actually remove visible question-conditioned work from model responses, rather than merely hiding a special reasoning trace or shortening the output.

The evaluation decomposes each response into:

response y = [visible pre-answer text T ; final answer A]

and reports three complementary metrics:

Metric	Meaning
Acc.	Task accuracy of the final answer `A`.
ETR	Empty-thinking ratio: how often the visible pre-answer text `T` is empty.
Sim	Semantic relevance between the question and visible pre-answer text `T`.

No-Thinking Is Not a Switch. Under native think-off (M2), averaged across six models, answer-only compliance is strongest for Boolean verification, weakens for multiple choice, and collapses for open-ended generation.

Video: as the answer space opens, ETR drops toward zero while Sim rises, showing that visible question-conditioned payload remains in the answer channel.

Open-ended tasks keep question-conditioned work in the answer channel. The same no-thinking control can look effective when the task supplies a compact answer space, yet expose substantial pre-answer payload when the model must construct an open-ended response.

No-thinking is therefore evaluated as a response-level behavior: a model is closer to no-thinking when it preserves final-answer accuracy while exposing little or no question-conditioned payload before the answer.

🏆 Core Findings

Think-on holds; think-off leaks. Native think-off controls compress Boolean verification most reliably, are inconsistent on multiple choice, and fail almost completely on open-ended tasks. This response-level residue is what we call thinking inertia.
The strongest no-think instruction is not always safest. On open-ended tasks, stricter answer-only constraints can raise ETR by removing visible payload, but the same pressure can also remove computation needed for accuracy.
Answer-space support governs compressibility. Rewriting the same math questions from MCQ into yes/no form raises ETR sharply with little accuracy loss; rewriting them into open-ended form drives ETR to zero and raises Sim, showing that no-thinking depends on the structure supplied by the question itself.

✨ What Is Included

Six-mode intervention spectrum: compare thinking-on, native no-think, reasoning re-elicitation, semantic suppression, strict answer-only output, and prefix forcing.
Answer-space family coverage: boolean verification, multiple-choice answering, and open-ended generation.
Paper-aligned analysis scripts: MMLU domain compressibility, surface perturbations, candidate-visible verification, and numeric answer-space rewrites.
Posthoc scoring utilities: recompute full-response similarity and math scores without rerunning model inference.
README result asset: a single animated core-results summary under assets/readme/.

🗂️ Repository Layout

ThinkZero/
├── README.md
├── pyproject.toml
├── requirements.txt
├── assets/readme/               # README animated core-results summary
├── data/
│   ├── processed/                 # Normalized benchmark JSONL files
│   └── raw/                       # Local cache placeholder; ignored except .gitkeep
├── docs/
│   ├── index.html                 # GitHub Pages project page
│   ├── assets/                    # Project-page figures and animation
│   ├── thinking_spectrum.md       # Detailed six-mode workflow notes
│   └── nonreason_baseline.md      # Earlier strict non-reasoning baseline notes
└── src/thinking_inertia/
    ├── download_datasets.py
    ├── eval_thinking_spectrum.py
    ├── eval_nonreason.py
    ├── score_full_response_similarity.py
    ├── rescore_math_records.py
    ├── mmlu_domain_compressibility.py
    ├── mmlu_surface_perturbations.py
    ├── mmlu_candidate_verification.py
    └── mmlu_numeric_answer_space_rewrites.py

Generated artifacts should stay local:

data/runs/          # records.jsonl and summary.json from model runs
data/experiments/   # derived intervention datasets and analysis tables
data/raw/hf_cache/  # Hugging Face cache
logs/               # model server logs

⚙️ Installation

cd ThinkZero

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

If you prefer a non-editable dependency install:

python -m pip install -r requirements.txt

📊 Data

The release copy includes normalized benchmark files under data/processed/. To regenerate them from Hugging Face:

python -m thinking_inertia.download_datasets \
  --root-dir data \
  --overwrite

Supported datasets:

Answer-space family	Datasets
Boolean verification	`boolq`, `strategyqa`
Multiple choice	`mmlu`, `mmlu_pro`
Open-ended generation	`gsm8k`, `math`
Additional normalized set	`commonsenseqa`, `gsm_symbolic`

For paper-style MATH subsets, keep the full data/processed/math.jsonl file and sample at evaluation time.

🚀 Quick Start

Start an OpenAI-compatible model server separately, then run the six-mode evaluation:

python -m thinking_inertia.eval_thinking_spectrum \
  --root-dir data \
  --datasets boolq strategyqa mmlu mmlu_pro gsm8k math \
  --base-url http://127.0.0.1:8401/v1 \
  --model qwen3-4b \
  --api chat \
  --modes 1 2 3 4 5 6 \
  --include-reasoning \
  --accept-display-math-wrapper \
  --max-samples 200 \
  --sample-strategy level_balanced \
  --seed 20260502 \
  --run-name qwen3_4b_spectrum_paper200

Outputs are written to:

data/runs/qwen3_4b_spectrum_paper200/
├── records.jsonl
└── summary.json

🔎 MiniLM Similarity Model

eval_thinking_spectrum.py and score_full_response_similarity.py use MiniLM to compute semantic similarity. By default, the code uses sentence-transformers/all-MiniLM-L6-v2. For offline or pinned environments:

export MINILM_MODEL_PATH=/path/to/all-MiniLM-L6-v2

🧪 Posthoc Full-Response Similarity

Score visible responses against questions without rerunning model inference:

python -m thinking_inertia.score_full_response_similarity \
  --input data/runs/qwen3_4b_spectrum_paper200

🧮 Math Rescoring

If a run used fallback math normalization, rescore GSM8K/MATH records with math-verify:

python -m thinking_inertia.rescore_math_records \
  --run-dir data/runs/qwen3_4b_spectrum_paper200 \
  --accept-display-math-wrapper

🧭 Six Evaluation Modes

Mode	Name	Paper role
`mode1`	Thinking on	Thinking-enabled reference.
`mode2`	Native no-think	Native think-off or minimal-thinking baseline.
`mode3`	Step-by-step under no-think	Re-elicitation test.
`mode4`	Short lead-in, no explanation	Soft semantic suppression.
`mode5`	Strict answer only	Strong answer-only suppression.
`mode6`	Prefix forcing	Structural "The answer is ..." branch.

📝 Paper Section Map

Paper section	Script	Role
Settings / Benchmarks	`download_datasets.py`	Download and normalize benchmark JSONL files.
Native Think-on vs. Think-off	`eval_thinking_spectrum.py`	Run M1-M3 to test thinking inertia.
Performance across No-Think variants	`eval_thinking_spectrum.py`	Run M2/M4/M5/M6 suppression and prefix-forcing variants.
Performance across disciplines	`mmlu_domain_compressibility.py`	Analyze MMLU domain and subject-level compressibility.
Performance under answer-space rewrites	`mmlu_numeric_answer_space_rewrites.py`	Build/analyze matched MCQ, yes/no, and open-ended numeric rewrites.
Appendix: Surface perturbations	`mmlu_surface_perturbations.py`	Build/analyze choice-shuffle and numeric-label controls.
Appendix: Candidate visibility	`mmlu_candidate_verification.py`	Build/analyze candidate-visible yes/no verification rewrites.

🛠️ Script Reference

Script	Typical command	Description
`download_datasets.py`	`thinking-download-datasets`	Regenerate normalized benchmark files under `data/processed/`.
`eval_thinking_spectrum.py`	`thinking-eval-spectrum`	Main six-mode model evaluation.
`eval_nonreason.py`	`thinking-eval-nonreason`	Earlier strict non-reasoning baseline and shared extraction utilities.
`score_full_response_similarity.py`	`thinking-score-response`	Posthoc Q-to-response similarity scoring.
`rescore_math_records.py`	`thinking-rescore-math`	Rebuild math scores and summaries with `math-verify`.
`mmlu_domain_compressibility.py`	`python -m thinking_inertia.mmlu_domain_compressibility`	Domain/subject analysis for MMLU.
`mmlu_surface_perturbations.py`	`python -m thinking_inertia.mmlu_surface_perturbations`	Choice-shuffle and numeric-label controls.
`mmlu_candidate_verification.py`	`python -m thinking_inertia.mmlu_candidate_verification`	MCQ-to-yes/no candidate verification rewrites.
`mmlu_numeric_answer_space_rewrites.py`	`python -m thinking_inertia.mmlu_numeric_answer_space_rewrites`	Matched numeric MCQ, boolean, and open-ended rewrites.

📁 Output Format

Each evaluation run writes:

File	Contents
`records.jsonl`	Per-example model response, extracted `T`, extracted answer `A`, accuracy, ETR fields, and similarity values.
`summary.json`	Aggregates by mode, dataset, and mode-dataset pair.

Important record fields include:

mode, mode_name
content, reasoning, reasoning_content
extracted_t, t_source
t_word_count, t_char_count
answer_raw, prediction, gold_answer
thinking_rate, max_thinking_rate
nonreason_pass, nonreason_correct
is_correct

📚 Documentation

docs/thinking_spectrum.md: detailed six-mode workflow.
docs/nonreason_baseline.md: earlier baseline workflow.
data/README.md: local data directory policy.
Project page: static GitHub Pages site served from docs/.

📣 Release Notes

This repository intentionally excludes:

raw model outputs from full experiments,
local model logs,
Hugging Face caches,
recovered or internal scratch artifacts,
HTML report generators used only for collaborator discussion.

The project page is included under docs/ and is served by GitHub Pages at https://leidq.github.io/ThinkZero/.

📌 Citation

If you use this repository, please cite the paper:

@misc{lei2026llmskeepthinking,
  title  = {LLMs Keep Thinking When Told Not To},
  author = {Dianqiao Lei and Kevin Qinghong Lin and Pan Lu and Philip Torr and James Zou},
  year   = {2026},
  note   = {Preprint}
}

The final arXiv/project-page links will be added after release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 LLMs Keep Thinking When Told Not To

🗞️ Latest News

🔍 Overview

🏆 Core Findings

✨ What Is Included

🗂️ Repository Layout

⚙️ Installation

📊 Data

🚀 Quick Start

🔎 MiniLM Similarity Model

🧪 Posthoc Full-Response Similarity

🧮 Math Rescoring

🧭 Six Evaluation Modes

📝 Paper Section Map

🛠️ Script Reference

📁 Output Format

📚 Documentation

📣 Release Notes

📌 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets/readme		assets/readme
data		data
docs		docs
src/thinking_inertia		src/thinking_inertia
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧠 LLMs Keep Thinking When Told Not To

🗞️ Latest News

🔍 Overview

🏆 Core Findings

✨ What Is Included

🗂️ Repository Layout

⚙️ Installation

📊 Data

🚀 Quick Start

🔎 MiniLM Similarity Model

🧪 Posthoc Full-Response Similarity

🧮 Math Rescoring

🧭 Six Evaluation Modes

📝 Paper Section Map

🛠️ Script Reference

📁 Output Format

📚 Documentation

📣 Release Notes

📌 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages