Skip to content

LeiDQ/ThinkZero

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  LLMs Keep Thinking When Told Not To

Evaluation code and paper assets for measuring no-thinking behavior, thinking inertia, and answer-space-dependent compression in LLMs.

Dianqiao Lei1, Kevin Qinghong Lin2, Pan Lu3, Philip Torr2, James Zou3

1 Tsinghua University ย ย  2 University of Oxford ย ย  3 Stanford University

Python Paper Website

News | Overview | Core Findings | Installation | Quick Start | Paper Map | Citation

๐Ÿ—ž๏ธ Latest News

  • May 2026: The release repository has been cleaned for GitHub: large run outputs, logs, caches, and internal report-generation scripts are excluded.
  • May 2026: The codebase now uses paper-facing script names that map directly to the experiment sections.
  • Coming soon: Paper link and full release notes.

๐Ÿ” Overview

This repository accompanies the paper "LLMs Keep Thinking When Told Not To". We ask whether explicit no-thinking controls actually remove visible question-conditioned work from model responses, rather than merely hiding a special reasoning trace or shortening the output.

The evaluation decomposes each response into:

response y = [visible pre-answer text T ; final answer A]

and reports three complementary metrics:

Metric Meaning
Acc. Task accuracy of the final answer A.
ETR Empty-thinking ratio: how often the visible pre-answer text T is empty.
Sim Semantic relevance between the question and visible pre-answer text T.

No-Thinking Is Not a Switch. Under native think-off (M2), averaged across six models, answer-only compliance is strongest for Boolean verification, weakens for multiple choice, and collapses for open-ended generation.

Video Overview

Video: as the answer space opens, ETR drops toward zero while Sim rises, showing that visible question-conditioned payload remains in the answer channel.

Open-ended tasks keep question-conditioned work in the answer channel. The same no-thinking control can look effective when the task supplies a compact answer space, yet expose substantial pre-answer payload when the model must construct an open-ended response.

No-thinking is therefore evaluated as a response-level behavior: a model is closer to no-thinking when it preserves final-answer accuracy while exposing little or no question-conditioned payload before the answer.

๐Ÿ† Core Findings

  • Think-on holds; think-off leaks. Native think-off controls compress Boolean verification most reliably, are inconsistent on multiple choice, and fail almost completely on open-ended tasks. This response-level residue is what we call thinking inertia.
  • The strongest no-think instruction is not always safest. On open-ended tasks, stricter answer-only constraints can raise ETR by removing visible payload, but the same pressure can also remove computation needed for accuracy.
  • Answer-space support governs compressibility. Rewriting the same math questions from MCQ into yes/no form raises ETR sharply with little accuracy loss; rewriting them into open-ended form drives ETR to zero and raises Sim, showing that no-thinking depends on the structure supplied by the question itself.

โœจ What Is Included

  • Six-mode intervention spectrum: compare thinking-on, native no-think, reasoning re-elicitation, semantic suppression, strict answer-only output, and prefix forcing.
  • Answer-space family coverage: boolean verification, multiple-choice answering, and open-ended generation.
  • Paper-aligned analysis scripts: MMLU domain compressibility, surface perturbations, candidate-visible verification, and numeric answer-space rewrites.
  • Posthoc scoring utilities: recompute full-response similarity and math scores without rerunning model inference.
  • README result asset: a single animated core-results summary under assets/readme/.

๐Ÿ—‚๏ธ Repository Layout

ThinkZero/
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ assets/readme/               # README animated core-results summary
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ processed/                 # Normalized benchmark JSONL files
โ”‚   โ””โ”€โ”€ raw/                       # Local cache placeholder; ignored except .gitkeep
โ”œโ”€โ”€ docs/
โ”‚   โ”œโ”€โ”€ index.html                 # GitHub Pages project page
โ”‚   โ”œโ”€โ”€ assets/                    # Project-page figures and animation
โ”‚   โ”œโ”€โ”€ thinking_spectrum.md       # Detailed six-mode workflow notes
โ”‚   โ””โ”€โ”€ nonreason_baseline.md      # Earlier strict non-reasoning baseline notes
โ””โ”€โ”€ src/thinking_inertia/
    โ”œโ”€โ”€ download_datasets.py
    โ”œโ”€โ”€ eval_thinking_spectrum.py
    โ”œโ”€โ”€ eval_nonreason.py
    โ”œโ”€โ”€ score_full_response_similarity.py
    โ”œโ”€โ”€ rescore_math_records.py
    โ”œโ”€โ”€ mmlu_domain_compressibility.py
    โ”œโ”€โ”€ mmlu_surface_perturbations.py
    โ”œโ”€โ”€ mmlu_candidate_verification.py
    โ””โ”€โ”€ mmlu_numeric_answer_space_rewrites.py

Generated artifacts should stay local:

data/runs/          # records.jsonl and summary.json from model runs
data/experiments/   # derived intervention datasets and analysis tables
data/raw/hf_cache/  # Hugging Face cache
logs/               # model server logs

โš™๏ธ Installation

cd ThinkZero

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

If you prefer a non-editable dependency install:

python -m pip install -r requirements.txt

๐Ÿ“Š Data

The release copy includes normalized benchmark files under data/processed/. To regenerate them from Hugging Face:

python -m thinking_inertia.download_datasets \
  --root-dir data \
  --overwrite

Supported datasets:

Answer-space family Datasets
Boolean verification boolq, strategyqa
Multiple choice mmlu, mmlu_pro
Open-ended generation gsm8k, math
Additional normalized set commonsenseqa, gsm_symbolic

For paper-style MATH subsets, keep the full data/processed/math.jsonl file and sample at evaluation time.

๐Ÿš€ Quick Start

Start an OpenAI-compatible model server separately, then run the six-mode evaluation:

python -m thinking_inertia.eval_thinking_spectrum \
  --root-dir data \
  --datasets boolq strategyqa mmlu mmlu_pro gsm8k math \
  --base-url http://127.0.0.1:8401/v1 \
  --model qwen3-4b \
  --api chat \
  --modes 1 2 3 4 5 6 \
  --include-reasoning \
  --accept-display-math-wrapper \
  --max-samples 200 \
  --sample-strategy level_balanced \
  --seed 20260502 \
  --run-name qwen3_4b_spectrum_paper200

Outputs are written to:

data/runs/qwen3_4b_spectrum_paper200/
โ”œโ”€โ”€ records.jsonl
โ””โ”€โ”€ summary.json

๐Ÿ”Ž MiniLM Similarity Model

eval_thinking_spectrum.py and score_full_response_similarity.py use MiniLM to compute semantic similarity. By default, the code uses sentence-transformers/all-MiniLM-L6-v2. For offline or pinned environments:

export MINILM_MODEL_PATH=/path/to/all-MiniLM-L6-v2

๐Ÿงช Posthoc Full-Response Similarity

Score visible responses against questions without rerunning model inference:

python -m thinking_inertia.score_full_response_similarity \
  --input data/runs/qwen3_4b_spectrum_paper200

๐Ÿงฎ Math Rescoring

If a run used fallback math normalization, rescore GSM8K/MATH records with math-verify:

python -m thinking_inertia.rescore_math_records \
  --run-dir data/runs/qwen3_4b_spectrum_paper200 \
  --accept-display-math-wrapper

๐Ÿงญ Six Evaluation Modes

Mode Name Paper role
mode1 Thinking on Thinking-enabled reference.
mode2 Native no-think Native think-off or minimal-thinking baseline.
mode3 Step-by-step under no-think Re-elicitation test.
mode4 Short lead-in, no explanation Soft semantic suppression.
mode5 Strict answer only Strong answer-only suppression.
mode6 Prefix forcing Structural "The answer is ..." branch.

๐Ÿ“ Paper Section Map

Paper section Script Role
Settings / Benchmarks download_datasets.py Download and normalize benchmark JSONL files.
Native Think-on vs. Think-off eval_thinking_spectrum.py Run M1-M3 to test thinking inertia.
Performance across No-Think variants eval_thinking_spectrum.py Run M2/M4/M5/M6 suppression and prefix-forcing variants.
Performance across disciplines mmlu_domain_compressibility.py Analyze MMLU domain and subject-level compressibility.
Performance under answer-space rewrites mmlu_numeric_answer_space_rewrites.py Build/analyze matched MCQ, yes/no, and open-ended numeric rewrites.
Appendix: Surface perturbations mmlu_surface_perturbations.py Build/analyze choice-shuffle and numeric-label controls.
Appendix: Candidate visibility mmlu_candidate_verification.py Build/analyze candidate-visible yes/no verification rewrites.

๐Ÿ› ๏ธ Script Reference

Script Typical command Description
download_datasets.py thinking-download-datasets Regenerate normalized benchmark files under data/processed/.
eval_thinking_spectrum.py thinking-eval-spectrum Main six-mode model evaluation.
eval_nonreason.py thinking-eval-nonreason Earlier strict non-reasoning baseline and shared extraction utilities.
score_full_response_similarity.py thinking-score-response Posthoc Q-to-response similarity scoring.
rescore_math_records.py thinking-rescore-math Rebuild math scores and summaries with math-verify.
mmlu_domain_compressibility.py python -m thinking_inertia.mmlu_domain_compressibility Domain/subject analysis for MMLU.
mmlu_surface_perturbations.py python -m thinking_inertia.mmlu_surface_perturbations Choice-shuffle and numeric-label controls.
mmlu_candidate_verification.py python -m thinking_inertia.mmlu_candidate_verification MCQ-to-yes/no candidate verification rewrites.
mmlu_numeric_answer_space_rewrites.py python -m thinking_inertia.mmlu_numeric_answer_space_rewrites Matched numeric MCQ, boolean, and open-ended rewrites.

๐Ÿ“ Output Format

Each evaluation run writes:

File Contents
records.jsonl Per-example model response, extracted T, extracted answer A, accuracy, ETR fields, and similarity values.
summary.json Aggregates by mode, dataset, and mode-dataset pair.

Important record fields include:

mode, mode_name
content, reasoning, reasoning_content
extracted_t, t_source
t_word_count, t_char_count
answer_raw, prediction, gold_answer
thinking_rate, max_thinking_rate
nonreason_pass, nonreason_correct
is_correct

๐Ÿ“š Documentation

๐Ÿ“ฃ Release Notes

This repository intentionally excludes:

  • raw model outputs from full experiments,
  • local model logs,
  • Hugging Face caches,
  • recovered or internal scratch artifacts,
  • HTML report generators used only for collaborator discussion.

The project page is included under docs/ and is served by GitHub Pages at https://leidq.github.io/ThinkZero/.

๐Ÿ“Œ Citation

If you use this repository, please cite the paper:

@misc{lei2026llmskeepthinking,
  title  = {LLMs Keep Thinking When Told Not To},
  author = {Dianqiao Lei and Kevin Qinghong Lin and Pan Lu and Philip Torr and James Zou},
  year   = {2026},
  note   = {Preprint}
}

The final arXiv/project-page links will be added after release.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages