The Token Governor: A Research & Engineering Pipeline for Timing-Conditioned Dynamic Output Budget Control in LLM Chatbots
Status: Pre-experimental. Formalizing hypothesis from chatbot-token-reduction (primary) and think2(1) (primary) with support from Han et al. (TALE, 2024), Aggarwal & Welleck (L1/LCPO, COLM 2025), Bhargava et al. (Control Theory of Prompting, 2024).
Thesis in one sentence: The temporal decomposition of how a user physically types a prompt — total elapsed time, idle pause time, and active keystroke time — is a measurable, low-latency signal that predicts required response complexity and can serve as a real-time token budget governor, reducing inference cost without proportional quality loss.
The raw idea from chatbot-token-reduction is:
Track the time in which a prompt is written. Based on this, the model decides how much info, what info, context, user's chat history, and some other factors the response of the chatbot will be sent to the user.
And from think2(1):
This is dynamically choosing/deciding capping of no. of tokens — Governor for Tokens.
The author identifies three behavioral scenarios:
- Scene 1 (copy-paste, t < 3s): User copies ~200 lines × 13 words/line ≈ 2,600 words from a browser source, pastes it, hits Enter. Time is near-zero relative to content volume. Hypothesis: user wants a fast, direct answer. Model should think less, generate fewer tokens, lower cost.
- Scene 2 (typed, quick): User wants quick answers, not long-form output. Response length preference is inferrable from typing velocity.
- Scene 3 (typed, letter-by-letter, t = 5–t sec): User composes carefully. What they will ask in this time frame can be partially predicted from history. Maximum token budget may be capped based on predicted complexity.
The extended formalization in think2(1) introduces three timers and a pipeline: Prompt → t₁ | t₂ | t₃ → Pruning → Choose Response → Response by chatbot. Supporting papers (TALE: Token-Budget-Aware LLM Reasoning; L1: Length Controlled Policy Optimization) confirm that token budgets injected into prompts materially compress reasoning length with acceptable quality degradation.
The author also correctly identifies the hard challenge: "How can I mathematicize all of these?"
That is what this document does.
Before formalizing, every claim must pass a falsifiability check. The following table audits each raw claim.
| Raw Claim | Falsifiable? | Repair If Not |
|---|---|---|
| "Track time in which prompt is written" | Yes — t is measurable | None needed |
| "Model decides how much info" | Undefined — "info" is not a variable | Replace with N_max (output token budget) |
| "t < 3s → model thinks less" | Partially — "think less" is not defined | Replace with: cap reasoning tokens R ≤ R_min |
| "Fewer tokens → lower cost" | Yes — API pricing is linear in tokens | Trivially true given the premise |
| "Fewer tokens → higher performance" | FALSE as stated — this is wrong | Restate: fewer tokens THAN NEEDED → quality drop; fewer tokens THAN WASTED → quality preserved |
| "Response cap based on timing" | Yes — if f(t) is defined | Define f explicitly |
Critical correction: The claim "fewer tokens = higher performance" is the hypothesis's weakest assertion. The correct claim is: for a subset of queries, the default token budget is unnecessarily large; capping at the predicted minimum sufficient budget preserves quality while reducing cost. This is the falsifiable version. Do not proceed with the weaker form.
Let the following be the canonical variable set for this system.
Timing variables (from think2(1)):
t₀ : Reference zero — the moment the first keystroke is registered
OR the moment the input field is cleared
t_f : Finalization time — the moment Enter is pressed OR the
last keystroke before a 2s idle timeout
t₁ : Wall-clock elapsed time = t_f - t₀ [milliseconds]
t₂ : Total idle time (sum of all inter-keystroke gaps > δ_idle)
t₂ ∈ [0, t₁], δ_idle = 500ms (tunable threshold)
t₃ : Total active typing time = t₁ - t₂
t₃ ∈ [0, t₁]
Note the fundamental identity: t₁ = t₂ + t₃ (exact, by definition).
Derived timing features:
v_type : Mean typing velocity = L_char / t₃ [characters per second]
where L_char = character count of the submitted prompt
ρ_paste : Paste indicator = 1 if t₃ < δ_paste AND L_char > θ_paste,
else 0. (δ_paste = 2s, θ_paste = 200 chars are initial values)
r_idle : Idle ratio = t₂ / t₁ ∈ [0, 1]
Prompt structure variables:
L_p : Prompt token length (input tokens, counted by tokenizer)
L_char : Prompt character length
H_p : Shannon entropy of the prompt token distribution
H_p = -Σ p(token_i) log p(token_i)
C_h : Conversation history feature vector (e.g., mean response
length of last k turns, topic embedding centroid)
Output variables:
N_max : Maximum output token budget (the "governor cap")
N_actual : Actual tokens generated by the model for a given response
N_baseline : Tokens generated WITHOUT any budget cap (oracle baseline)
Q : Response quality score ∈ [0, 1] (defined operationally below)
C_token : Cost per token ($USD, model-specific)
L_latency : Time-to-first-token + total generation latency [ms]
System parameters (tunable):
α : Quality-cost trade-off weight in the objective function
k : Conversation history look-back window
δ_idle : Idle threshold for t₂ accumulation [ms]
δ_paste : Paste detection time threshold [ms]
θ_paste : Paste detection character count threshold
β₀..βₙ : Regression coefficients for the budget prediction function
The token governor maps timing and prompt features to a predicted maximum output budget:
N̂_max = f(t₁, t₂, t₃, v_type, ρ_paste, L_p, H_p, C_h ; β)
The minimal parameterization sufficient for a first-pass model is:
N̂_max = exp(β₀ + β₁·log(t₃+1) + β₂·ρ_paste + β₃·log(L_p+1) + β₄·r_idle + β₅·H_p)
The exponential form is chosen because N̂_max must be strictly positive and the relationship between typing time and response length is likely log-linear, not linear. The log(t₃+1) term smoothly handles the paste case (t₃ ≈ 0).
For the paste-case specifically:
If ρ_paste = 1:
N̂_max = N̂_max_paste = g(L_p, H_p, C_h)
where g is a separate model trained on paste-type queries only
(rationale: paste queries are a distinct distributional regime)
Claim 1: Timing signal correlates with required response length.
∃ ρ > 0 such that:
Corr(t₃, N_baseline) > ρ across the query distribution D
Falsification: If Corr(t₃, N_baseline) ≤ 0.15 for |D| > 10,000
the hypothesis collapses here. Stop.
Claim 2: Budget cap preserves quality above threshold.
∀ queries q where N̂_max(q) < N_baseline(q):
Q(response | N_max = N̂_max(q)) ≥ (1 - ε) · Q(response | N_max = ∞)
Target: ε ≤ 0.05 (5% quality degradation tolerance)
Falsification: Mean Q drops > 10% on a balanced eval set
Claim 3: The system reduces total token consumption.
E[N_actual | governor_on] < E[N_actual | governor_off]
Formalized as: Δ_tokens = 1 - E[N_actual|on] / E[N_actual|off] > 0
Target: Δ_tokens ≥ 0.15 (15% token reduction)
Falsification: Δ_tokens ≤ 0.05 (governor has negligible effect)
Claim 4: Cost reduction is proportional to token reduction.
ΔC = C_token · E[N_baseline - N̂_max]⁺
where (x)⁺ = max(0, x) [we only save on queries where we cap below baseline]
This is a TRIVIAL CONSEQUENCE of pricing linearity. Not falsifiable — it is
an accounting identity. Mark it as a derivation, not an empirical claim.
The system can be cast as a constrained optimization problem. Given a query q with features (t₁, t₂, t₃, L_p, H_p, C_h), find N̂_max that:
minimize: C_token · N̂_max (cost)
subject to: E[Q(response | N_max = N̂_max)] ≥ Q_min (quality floor)
N̂_max ≥ N_lower(q) (minimum coherence budget)
N̂_max ≤ N_model_max (model hard limit)
where Q_min is operationally defined as the minimum acceptable quality score (initial value: 0.85 on a normalized scale), and N_lower(q) is a hard lower bound derived from prompt type (e.g., any query with L_p > 500 tokens gets N_lower ≥ 150).
The dual of this problem: given a target quality floor Q_min, the Lagrangian relaxation yields the minimum budget as:
N̂_max* = argmin_{N} { C_token · N + λ · [Q_min - E[Q(·|N)]] }
This gives the theoretical minimum budget for each quality level. In practice, λ must be estimated from data. The production system approximates this via the regression function f(·) above.
Following Bhargava et al. (2024), the LLM system can be formalized as a discrete stochastic dynamical system. The token governor acts as an external control input:
State: s_t = (token sequence at step t)
Control: u = N̂_max (injected as budget constraint in the prompt)
Dynamics: s_{t+1} ~ P_θ(s_{t+1} | s_t, u)
Output: y = s_{N̂_max} (the complete response)
Objective: Find u* = f(timing_features) such that Q(y) ≥ Q_min
The key insight from control theory: prompt-injected constraints modify the output distribution of LLMs measurably and reliably (Bhargava et al. show ≥ 97% reachability for short prompts). This validates the TALE approach (Han et al., 2024): including a token budget in the system prompt reliably compresses reasoning length. The timing signal's role is to compute the right budget to inject — not to constrain the model mechanically.
The full causal chain from user behavior to inference cost reduction is:
User intent (latent)
│
▼
User interaction pattern
│ [manifests as]
▼
Timing signals: (t₁, t₂, t₃, v_type, ρ_paste)
│ [predicted by f(·)]
▼
Budget prediction: N̂_max
│ [injected as]
▼
System prompt modification: "Answer in at most N̂_max tokens."
│ [modifies]
▼
Token generation process: P_θ(y | x, budget_constraint)
│ [produces]
▼
Response y with N_actual ≤ N̂_max tokens
│ [measured as]
▼
Quality Q(y), Latency L(y), Cost C(y)
Confounds that must be controlled:
The primary confound is query complexity. Complex queries require more tokens regardless of how the user typed them. The timing signal must be conditioned on prompt features (L_p, H_p) to avoid the spurious correlation: long prompts take longer to type AND require longer responses, but the causal mechanism goes through complexity, not through time.
A user who types slowly because they are deliberating over wording (high t₂/t₁ ratio, low v_type) may be writing a simple question. Typing speed is an imperfect proxy for complexity. This is the main theoretical weakness of the hypothesis.
Proposed correction: Use the ratio of typing time to prompt length, not raw typing time:
v_type = L_char / t₃ [chars/sec]
A low v_type with a short prompt → deliberate, possibly complex query
A low v_type with a long prompt → likely transcription/composition, complex query
A high v_type with a long prompt → paste, OR fast typist with pre-planned query
The feature v_type alone cannot disambiguate a fast typist from a paste operation. That is precisely why ρ_paste (the paste indicator) must be a separate binary feature.
Step 1: Raw signal collection (client-side)
keystroke_log = [(char_i, timestamp_i) for each keypress]
t₀ = keystroke_log[0].timestamp (first keypress)
t_f = keystroke_log[-1].timestamp (last keypress before Enter)
t₁ = t_f - t₀ [ms]
# Compute inter-keystroke intervals
gaps = [keystroke_log[i+1].timestamp - keystroke_log[i].timestamp
for i in range(len(keystroke_log)-1)]
# Separate idle and active time
t₂ = sum(g for g in gaps if g > δ_idle)
t₃ = t₁ - t₂
L_char = len(submitted_prompt)
Step 2: Feature extraction
v_type = L_char / max(t₃, 1) # chars/sec, guarded against div-by-zero
ρ_paste = int(t₃ < δ_paste AND L_char > θ_paste)
r_idle = t₂ / max(t₁, 1)
Step 3: Budget prediction
log(N̂_max) = β₀ + β₁·log(t₃+1) + β₂·ρ_paste
+ β₃·log(L_p+1) + β₄·r_idle + β₅·H_p
N̂_max = exp(log(N̂_max))
N̂_max = clip(N̂_max, N_lower, N_model_max)
Step 4: Budget injection into prompt
Following Han et al. (TALE, 2024) and Aggarwal & Welleck (L1, COLM 2025), the budget is injected as a natural-language constraint in the system prompt:
SYSTEM_PROMPT_SUFFIX = f"""
The user's query was composed in approximately {t₁/1000:.1f} seconds.
Provide a response that is complete but concise.
Target response length: {N̂_max} tokens or fewer.
Do not pad the response. Stop when the answer is complete.
"""
The specific phrasing matters (Bhargava et al., 2024 show prompt sensitivity). The exact injection format is an experimental variable — it should be A/B tested.
Beyond the output token count, several internal signals can be monitored to understand whether the budget constraint is binding or being ignored:
Attention sparsity: Under a tight token budget, the model should attend more narrowly to task-relevant tokens. Proxy: the ratio of attention weight concentrated in the top-k% of tokens. Higher sparsity under tight budget suggests the model is efficiently compressing, not hallucinating.
Log-probability of the EOS token: If the model would naturally generate more tokens but is constrained by the budget, the log P(EOS) at the budget boundary should be lower than in unconstrained generation. A low log P(EOS) at truncation is a quality risk signal — the model was "mid-thought" when cut off. This is measurable in white-box settings (open-weight models) but not via APIs.
Perplexity of the generated response: Higher perplexity in constrained responses relative to unconstrained baselines indicates forced compression is introducing incoherence.
Proxy for white-box access: If only API access is available, use a secondary LLM judge (the TALE approach) to rate completeness. Incompleteness = the budget is binding in a harmful way.
The simulation must answer: does N̂_max = f(timing_features) correlate with the actual N_baseline needed for high-quality responses?
This requires a dataset where N_baseline is known. The simulation synthesizes timing features for queries with known ground-truth complexity.
Environment components:
Component 1: Query corpus with known complexity labels
- Source 1: ShareGPT / WildChat (real user conversations)
- Source 2: MMLU, HumanEval, GSM8K (structured tasks with known difficulty)
- Source 3: Synthetic prompts: paste_type (random Wikipedia paragraphs),
typed_simple (single-sentence factual questions),
typed_complex (multi-step reasoning queries)
Component 2: Synthetic timing generation model
- For real queries without timing data, simulate timing from:
Paste scenario: t₃ ~ Uniform(0.5, 2.5) seconds
L_char ~ query character count (full length)
Typed scenario: v_type ~ Normal(μ_v, σ_v) chars/sec
where μ_v = 55 chars/sec (average typist at 11 wpm)
σ_v = 20 chars/sec
t₃ = L_char / v_type
t₂ = t₃ · r_idle, r_idle ~ Beta(2, 5)
t₁ = t₂ + t₃
Component 3: Oracle N_baseline measurement
- Run each query through the target LLM with no budget constraint
- N_baseline = actual tokens generated
- Q_oracle = human rating or LLM judge score
Component 4: Budget prediction function f(·)
- Initialized with reasonable priors (β coefficients from regression)
- Updated via cross-validation on training split
| Step | Simulated? | Notes |
|---|---|---|
| User typing behavior | YES — synthetic timestamps | Calibrate against real keystroke data if available |
| Timing feature extraction | YES — deterministic from synthetic timestamps | |
| Budget prediction f(·) | YES — pure regression | No model call needed |
| Budget injection | YES — string formatting | |
| LLM response (N_baseline) | NO — requires real inference | One-time offline dataset construction |
| LLM response (N_actual) | NO — requires real inference | Needed to measure quality under budget |
| Quality scoring | PARTIALLY — LLM-as-judge can be simulated | Final eval needs human raters |
Cost estimate: If the eval set is 10,000 queries and each query requires 2 model calls (unconstrained + constrained), at ~$3/million tokens with average 500 output tokens per call, total eval cost ≈ $30. This is tractable.
# === TOKEN GOVERNOR SIMULATION ===
import numpy as np
from scipy.stats import pearsonr
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# --- Step 1: Build the oracle dataset ---
def build_oracle_dataset(queries: list[str], model_api) -> list[dict]:
"""
For each query, run inference with no budget constraint.
Returns N_baseline and Q_oracle for each query.
"""
dataset = []
for q in queries:
response = model_api.generate(q, max_tokens=4096)
Q = llm_judge_quality(q, response) # 0-1 score
dataset.append({
"query": q,
"N_baseline": len(response.tokens),
"Q_oracle": Q,
"L_p": len(tokenize(q)),
"L_char": len(q),
"H_p": compute_entropy(tokenize(q)),
})
return dataset
# --- Step 2: Synthesize timing features ---
def synthesize_timing(record: dict, scenario: str = "auto") -> dict:
"""
Generate synthetic timing features for a query.
In production, these come from real keystroke logs.
"""
L_char = record["L_char"]
if scenario == "paste" or (scenario == "auto" and L_char > 300 and np.random.rand() < 0.4):
t3 = np.random.uniform(0.5, 2.5) # seconds
t2 = np.random.uniform(0, 0.5)
rho_paste = 1
else:
v_type = np.random.normal(55, 20) # chars/sec
v_type = max(v_type, 10) # floor: 10 chars/sec
t3 = L_char / v_type
r_idle = np.random.beta(2, 5) # skewed toward low idle
t2 = t3 * r_idle
rho_paste = 0
t1 = t2 + t3
record.update({
"t1": t1 * 1000, # convert to ms
"t2": t2 * 1000,
"t3": t3 * 1000,
"v_type": L_char / max(t3, 0.001),
"rho_paste": rho_paste,
"r_idle": t2 / max(t1, 0.001),
})
return record
# --- Step 3: Train the budget predictor ---
def train_budget_predictor(dataset: list[dict]) -> Ridge:
"""
Fit log-linear regression: log(N_baseline) ~ timing + prompt features.
N_baseline is the learning target (we want to predict it accurately).
"""
X = np.array([[
np.log(r["t3"] + 1),
r["rho_paste"],
np.log(r["L_p"] + 1),
r["r_idle"],
r["H_p"],
] for r in dataset])
y = np.log(np.array([r["N_baseline"] for r in dataset]) + 1)
model = Ridge(alpha=1.0)
# 5-fold cross-validation to check predictive power
cv_r2 = cross_val_score(model, X, y, cv=5, scoring="r2")
print(f"CV R² for log(N_baseline) prediction: {cv_r2.mean():.3f} ± {cv_r2.std():.3f}")
# If R² < 0.2, the hypothesis fails at this stage. Log and stop.
assert cv_r2.mean() > 0.2, "HYPOTHESIS FALSIFIED: timing features insufficient"
model.fit(X, y)
return model
# --- Step 4: Apply budget and measure quality ---
def evaluate_governor(dataset: list[dict], predictor: Ridge,
model_api, safety_margin: float = 1.2) -> dict:
"""
For each query, compute N_hat_max, run constrained inference,
and measure quality relative to oracle baseline.
"""
results = []
for r in dataset:
X = np.array([[np.log(r["t3"]+1), r["rho_paste"],
np.log(r["L_p"]+1), r["r_idle"], r["H_p"]]])
log_n_hat = predictor.predict(X)[0]
N_hat = int(np.exp(log_n_hat) * safety_margin)
N_hat = max(50, min(N_hat, 4096)) # hard bounds
# Run constrained inference
response_constrained = model_api.generate(
r["query"],
max_tokens=N_hat,
system_suffix=f"Answer in at most {N_hat} tokens."
)
N_actual = len(response_constrained.tokens)
Q_constrained = llm_judge_quality(r["query"], response_constrained)
results.append({
"N_baseline": r["N_baseline"],
"N_hat": N_hat,
"N_actual": N_actual,
"Q_oracle": r["Q_oracle"],
"Q_constrained": Q_constrained,
"token_reduction": (r["N_baseline"] - N_actual) / r["N_baseline"],
"quality_delta": Q_constrained - r["Q_oracle"],
})
return {
"mean_token_reduction": np.mean([r["token_reduction"] for r in results]),
"mean_quality_delta": np.mean([r["quality_delta"] for r in results]),
"p_quality_above_threshold": np.mean([r["Q_constrained"] >= 0.80 for r in results]),
"raw": results,
}
# --- Step 5: Ablation ---
# Run the above with timing features removed one at a time to measure marginal contribution.
# Ablation variables: [t3, rho_paste, L_p, r_idle, H_p]
# Expected: rho_paste has highest marginal contribution (paste vs. typed is the biggest split).The ablation must answer: which features carry the most signal?
Run the budget predictor with each feature removed individually and measure the drop in cross-validation R² and in mean quality delta across the eval set.
Expected ordering of feature importance (hypothesis, to be confirmed empirically):
1. ρ_paste (binary split: paste vs. typed, largest effect size)
2. log(L_p) (prompt length is strong complexity proxy)
3. log(t₃) (typing time, correlated with deliberate composition)
4. H_p (prompt entropy, measures query specificity)
5. r_idle (weakest — idle ratio has high variance, low signal)
If ρ_paste turns out not to be the dominant feature, the copy-paste detection mechanism requires redesign. If log(L_p) alone explains most of the variance, the timing features are redundant and the hypothesis is partially falsified (prompt length alone is sufficient, timing is unnecessary added complexity).
Primary metrics:
Δ_tokens : Token reduction ratio = 1 - E[N_actual|governor] / E[N_actual|baseline]
Target: ≥ 0.15 (15%)
ΔQ : Quality degradation = E[Q_oracle] - E[Q_governed]
Target: ≤ 0.05 (5 percentage points on 0-1 scale)
Δ_cost : Cost reduction = C_token · E[N_baseline - N_actual]⁺
Derived metric — follows from Δ_tokens, not independently testable
L_governor : Latency overhead of governor (feature extraction + prediction)
Target: < 5ms (must be negligible vs. model latency)
Secondary metrics:
ε_overtrim : Rate at which governor caps BELOW the quality-preserving budget
ε_overtrim = P(Q_governed < Q_min)
Target: ε_overtrim < 0.05 (5% of queries over-trimmed)
ε_undertrim : Rate at which governor sets budget ABOVE N_baseline
(no savings occur — the cap is non-binding)
ε_undertrim = P(N̂_max > N_baseline)
Target: ε_undertrim < 0.40 (40% of queries should trigger savings)
TOA : Trade-Off Area = ∫[0,1] Q(N_max) dN_max (normalized)
Measures the quality-efficiency frontier
Higher TOA = better system
Experiment 1: Correlation Study (Phase 1, no model calls needed)
Question: Does timing signal correlate with oracle response length?
Design: Collect (or simulate) timing features for N=10,000 queries from
ShareGPT/WildChat. Run oracle inference to get N_baseline for each.
Compute Pearson r and Spearman ρ between each feature and N_baseline.
Baseline: Random prediction (null model)
Method: Log-linear regression as specified in Section 1.2
Key test: H₀: r(t₃, N_baseline) = 0 vs H₁: r > 0
Significance threshold: α = 0.01
Minimum effect size: r > 0.25 (Cohen's guideline for medium correlation)
Experiment 2: Budget Accuracy (Phase 2)
Question: How well does N̂_max predict the minimum N needed for Q ≥ Q_min?
Design: For a subset of 1,000 queries, generate responses at multiple
budget levels: N_max ∈ {50, 100, 200, 500, 1000, 2000, uncapped}.
For each query, find N_min*(q) = smallest N_max giving Q ≥ Q_min.
Measure: E[|N̂_max - N_min*|] and P(N̂_max ≥ N_min*).
Baseline: N̂_max = mean(N_baseline) for all queries (constant predictor)
Key test: Does f(timing) outperform the constant predictor in accuracy?
Metric: Root Mean Squared Error of (N̂_max - N_min*)
Experiment 3: End-to-End Quality-Cost Trade-off (Phase 3)
Question: Does the full governor pipeline achieve the target Δ_tokens with acceptable ΔQ?
Design: A/B experiment on held-out test set (N=5,000 queries)
Arm A (control): No governor. N_max = model default (e.g., 4096)
Arm B (treatment): Governor active. N_max = N̂_max from f(timing)
Stratification by scenario:
Stratum 1: paste queries (ρ_paste = 1)
Stratum 2: short typed queries (t₃ < 5s, L_p < 100)
Stratum 3: long typed queries (t₃ ≥ 5s OR L_p ≥ 100)
Evaluation:
Primary: Δ_tokens, ΔQ per stratum
Secondary: ε_overtrim, ε_undertrim, TOA
Quality measure: GPT-4 or Claude judge on a 1-5 rubric,
normalized to [0, 1]
Human eval: 200 randomly sampled query pairs (governed vs. baseline)
shown to 3 annotators each (blind, counterbalanced order)
Inter-annotator agreement: Krippendorff's α > 0.6 required
Sample size calculation:
To detect Δ_tokens = 0.15 with power = 0.80, α = 0.05,
assuming σ(Δ_tokens) ≈ 0.25 (estimated from pilot):
n = (z_α/2 + z_β)² · σ² / Δ²
= (1.96 + 0.84)² · 0.0625 / 0.0225
≈ 2,178 queries per arm → use 3,000 per arm (buffer for stratification)
Significance tests:
Δ_tokens: Paired t-test (paired by query), one-sided, α = 0.01
ΔQ: Non-inferiority test. H₀: E[Q_oracle] - E[Q_governed] ≥ 0.10
Reject H₀ → governor is non-inferior in quality
ε_overtrim: Binomial proportion test, 95% CI for P(Q_governed < Q_min)
Confidence intervals:
Bootstrap 95% CI for all primary metrics (10,000 resamples)
Report both point estimates and CIs in all tables
Multiple comparison correction:
Bonferroni correction across 3 strata × 2 primary metrics = 6 tests
Adjusted α = 0.05/6 ≈ 0.0083
Training set (for β estimation):
- 8,000 queries from WildChat (diverse real user conversations)
- Balanced across scenarios: 30% paste-type, 40% short typed, 30% long typed
- Include adversarial queries (section 6)
Validation set:
- 1,000 queries, same distribution, held out during training
- Used for hyperparameter tuning (δ_idle, δ_paste, θ_paste, safety_margin)
Test set:
- 5,000 queries, stratified by scenario and domain
- Domains: coding (HumanEval-Chat), reasoning (GSM8K-Chat),
factual (TriviaQA), summarization (CNN/DM user queries),
open-ended (WildChat subset)
- No overlap with training or validation sets
- Timing data: real if available (from a logging pilot), synthetic otherwise
Out-of-distribution test:
- 500 queries from a different distribution (e.g., medical QA, legal QA)
- Used to measure generalization — expected to degrade, quantity the gap
The token governor is a lightweight middleware layer that sits between the user interface and the LLM API. It has three components:
[Client Layer] Keystroke Logger → Timing Extractor
│
▼ (timing JSON, ~50 bytes)
[Governor Service] Feature Extractor → Budget Predictor → Prompt Modifier
│
▼ (modified prompt + system message)
[LLM API] Model Inference
│
▼ (response)
[Quality Monitor] Token Counter + LLM Judge (async, for logging)
Latency budget: The governor service must complete in < 5ms (p99). The feature extraction is O(L_char) and the prediction is a single matrix multiply — this is trivially fast.
# === PRODUCTION TOKEN GOVERNOR ===
import time
import numpy as np
from dataclasses import dataclass
@dataclass
class KeystrokeLog:
"""Collected by the client via JavaScript keydown events."""
chars: list[str]
timestamps: list[float] # Unix timestamps in milliseconds
enter_timestamp: float
@dataclass
class TimingFeatures:
t1: float # ms
t2: float # ms
t3: float # ms
v_type: float # chars/sec
rho_paste: int # 0 or 1
r_idle: float # [0, 1]
@dataclass
class GovernorDecision:
N_hat: int
confidence: float # [0, 1] — lower means less certain, apply larger safety margin
scenario: str # "paste", "typed_short", "typed_long"
debug: dict
# --- Feature Extraction ---
def extract_timing_features(log: KeystrokeLog,
L_char: int,
delta_idle: float = 500.0,
delta_paste: float = 2000.0,
theta_paste: int = 200) -> TimingFeatures:
"""
Extract timing features from raw keystroke log.
This runs client-side and result is sent to governor service.
Time complexity: O(N_keystrokes)
"""
if len(log.timestamps) == 0:
# Edge case: no keystrokes (programmatic submission)
return TimingFeatures(t1=0, t2=0, t3=0, v_type=float('inf'),
rho_paste=1 if L_char > theta_paste else 0,
r_idle=0)
t0 = log.timestamps[0]
tf = log.enter_timestamp
t1 = tf - t0
# Compute idle time (gaps > delta_idle)
gaps = [log.timestamps[i+1] - log.timestamps[i]
for i in range(len(log.timestamps)-1)]
t2 = sum(g for g in gaps if g > delta_idle)
t3 = max(t1 - t2, 1.0) # guard against t3=0
v_type = L_char / (t3 / 1000.0) # chars/sec (t3 in ms → sec)
rho_paste = int(t3 < delta_paste and L_char > theta_paste)
r_idle = t2 / max(t1, 1.0)
return TimingFeatures(t1=t1, t2=t2, t3=t3, v_type=v_type,
rho_paste=rho_paste, r_idle=r_idle)
# --- Budget Prediction ---
class TokenGovernor:
"""
Lightweight ridge regression wrapper.
Serializes to < 1KB (6 coefficients + 2 bounds).
"""
def __init__(self, coefficients: np.ndarray,
intercept: float,
safety_margin: float = 1.2,
N_lower: int = 50,
N_upper: int = 4096):
self.coef = coefficients # shape (5,)
self.intercept = intercept
self.safety_margin = safety_margin
self.N_lower = N_lower
self.N_upper = N_upper
def predict(self,
timing: TimingFeatures,
L_p: int,
H_p: float,
C_h: dict | None = None) -> GovernorDecision:
"""
Predict maximum output token budget.
Runs in < 1ms.
"""
features = np.array([
np.log(timing.t3 + 1),
float(timing.rho_paste),
np.log(L_p + 1),
timing.r_idle,
H_p,
])
log_N_hat = self.intercept + self.coef @ features
N_hat_raw = np.exp(log_N_hat)
N_hat = int(np.clip(N_hat_raw * self.safety_margin,
self.N_lower, self.N_upper))
# Scenario classification for logging and override rules
if timing.rho_paste == 1:
scenario = "paste"
elif timing.t3 < 5000 and L_p < 100:
scenario = "typed_short"
else:
scenario = "typed_long"
# Confidence: lower for edge cases
confidence = 1.0
if timing.t1 < 500 and timing.rho_paste == 0:
confidence *= 0.7 # very fast typed — ambiguous
if L_p > 1000:
confidence *= 0.8 # very long prompt — high uncertainty
return GovernorDecision(
N_hat=N_hat,
confidence=confidence,
scenario=scenario,
debug={"N_hat_raw": N_hat_raw, "features": features.tolist()}
)
def build_system_prompt_injection(self, decision: GovernorDecision) -> str:
"""
Constructs the budget constraint for injection into the system prompt.
The phrasing follows TALE (Han et al., 2024) recommendations.
"""
if decision.confidence < 0.6:
# Low confidence → use a softer constraint
return f"Please be concise. Aim for {decision.N_hat} tokens or fewer if possible."
return (
f"The user expects a response of approximately {decision.N_hat} tokens or fewer. "
f"Provide a complete answer within this limit. "
f"Do not pad with unnecessary explanation. Stop when done."
)
# --- Interface to LLM API ---
def governed_generate(query: str,
keystroke_log: KeystrokeLog,
governor: TokenGovernor,
llm_api,
system_prompt: str = "") -> dict:
"""
Full pipeline: timing → prediction → constrained inference.
"""
start = time.perf_counter_ns()
# Feature extraction
L_char = len(query)
L_p = len(tokenize(query))
H_p = compute_entropy(tokenize(query))
timing = extract_timing_features(keystroke_log, L_char)
# Budget prediction
decision = governor.predict(timing, L_p, H_p)
# Prompt modification
injection = governor.build_system_prompt_injection(decision)
modified_system = system_prompt + "\n\n" + injection if system_prompt else injection
governor_latency_ms = (time.perf_counter_ns() - start) / 1e6
# LLM inference
response = llm_api.generate(
query,
system=modified_system,
max_tokens=decision.N_hat
)
return {
"response": response.text,
"N_actual": response.token_count,
"N_hat": decision.N_hat,
"governor_latency_ms": governor_latency_ms,
"scenario": decision.scenario,
"timing": timing,
}// === KEYSTROKE LOGGER (browser, < 2KB) ===
class KeystrokeLogger {
constructor(inputElement) {
this.chars = [];
this.timestamps = [];
this.startTime = null;
this.element = inputElement;
this._attach();
}
_attach() {
this.element.addEventListener('keydown', (e) => {
const ts = performance.now(); // high-resolution timestamp
if (e.key === 'Enter') {
this.enterTimestamp = ts;
return;
}
// Handle clear (Ctrl+A + Delete, or direct clear)
if (this.chars.length === 0 || this._isCleared()) {
this.startTime = ts;
this.chars = [];
this.timestamps = [];
}
this.chars.push(e.key);
this.timestamps.push(ts);
});
}
_isCleared() {
return this.element.value.length === 0;
}
reset() {
this.chars = [];
this.timestamps = [];
this.startTime = null;
this.enterTimestamp = null;
}
serialize() {
return {
chars: this.chars,
timestamps: this.timestamps, // relative to performance.now() epoch
enter_timestamp: this.enterTimestamp,
L_char: this.element.value.length,
};
}
}
// Usage:
// const logger = new KeystrokeLogger(document.getElementById('chat-input'));
// On submit: fetch('/api/governed-generate', { body: JSON.stringify(logger.serialize()) })Client → Governor Service:
POST /governor/predict
{
"keystroke_log": { "chars": [...], "timestamps": [...], "enter_timestamp": float },
"L_char": int,
"query_preview": str // first 50 chars for entropy estimation
}
Response: { "N_hat": int, "scenario": str, "confidence": float }
Governor Service → LLM API:
Standard API call, with system prompt modified to include budget constraint
max_tokens: N_hat (hard cutoff enforced by API)
LLM API → Quality Monitor (async, non-blocking):
{
"query_hash": str, // for privacy, never store raw query in production
"N_hat": int,
"N_actual": int,
"scenario": str,
"timing_features": { ... }
}
Quality monitor samples 1% of responses for LLM-judge scoring.
Failure Mode 1: The fast complex typist.
A user who types at 100+ WPM submits a genuinely complex multi-step reasoning query. t₃ is small, ρ_paste = 0, and the governor sets a tight budget. The response is truncated mid-reasoning.
Detection: P(EOS log-prob < -5 at truncation boundary) is high. Mitigation: Hard lower bound N_lower on complex query types (identified by H_p, or explicit task keywords). Residual risk: Undetectable before response starts. This is an irreducible failure rate.
Failure Mode 2: The slow paste typist.
A user who types slowly (deliberate, slow) submits a short simple question like "What is 2+2?" t₃ is large, r_idle is high. The governor over-allocates a large budget. No savings occur.
Detection: ε_undertrim metric. Impact: No quality damage, only opportunity cost (savings foregone). This is the safe failure mode. Acceptable.
Failure Mode 3: Copy-paste of structured code or formulas.
A user pastes a 500-line code snippet with a 5-word question: "Fix the bug." ρ_paste = 1, so the governor assigns a small budget. The model needs 300+ tokens to explain the fix.
Detection: High L_p with low Q_governed. Mitigation: Special-case rule — if ρ_paste = 1 AND question_length < 20 chars (the pasted content is context, not the question), apply a different budget rule: N_hat = N_paste_context_default (e.g., 600 tokens). This requires a second pass to distinguish the actual question from the pasted context.
Failure Mode 4: Multi-turn conversation context collapse.
The governor is calibrated on single-turn queries. In a long multi-turn conversation, the context accumulates (C_h) and required response length is no longer predictable from timing alone — the user may be asking a follow-up that requires referencing earlier turns.
Detection: Degradation in Q_governed as turn_number increases. Mitigation: Add turn number and rolling mean of N_baseline(last k turns) as features. For turns > 10, widen the safety margin from 1.2 to 1.5.
Failure Mode 5: Adversarial prompt injection via timing manipulation.
A user or automated system sends crafted keystroke logs (e.g., simulating a slow typing pattern) to force a large token allocation, bypassing budget constraints. In a cost-reduction context this is a denial-of-budget attack (increasing cost by faking slow typing). In a quality context it is irrelevant (they get more tokens, not a problem).
Impact in production: Low. The governor increases cost for the attacker — which they pay for (in a user-facing product, they'd only increase their own cost). Not a security concern unless pricing is asymmetric.
Stress Test 1: Adversarial Speed
Input: 1,000 queries with manually-crafted timing designed to maximize
the gap between predicted and required budget (both directions)
Metric: ε_overtrim rate on this subset
Pass threshold: ε_overtrim < 0.20 (higher than normal — adversarial is hard)
Stress Test 2: Long-Context Reasoning
Input: 500 queries requiring ≥ 500 output tokens (verified by oracle)
with ρ_paste = 1 (pasted context)
Metric: Q_governed vs Q_oracle, P(N_actual = N_hat)
Pass threshold: Mean ΔQ < 0.10 on this subset
Stress Test 3: Zero-Token Convergence
Input: Queries where the ideal response is a single word or number
(e.g., "What is the capital of France?")
Metric: Does the governor assign N_hat close to 5-20 (correct) rather
than the mean N_baseline (incorrect)
Pass threshold: Mean N_hat < 50 for this category
Stress Test 4: Distribution Shift
Input: Medical QA, legal QA, code (domains absent from training)
Metric: Δ_tokens and ΔQ vs. in-distribution performance
Pass threshold: ΔQ degradation < 0.05 additional vs. in-distribution
Stress Test 5: Latency Spike
Input: 10,000 concurrent requests to governor service
Metric: p99 governor latency
Pass threshold: < 10ms p99 (2× the target, to account for concurrency)
The governor is not a static model. It must improve over time as real usage data accumulates. The iteration loop has three timescales.
Hourly (automated): Monitor ε_overtrim and mean Δ_tokens on live traffic. If ε_overtrim spikes above 0.08 within a 1-hour window, auto-increase safety_margin by 0.05 (from 1.2 to 1.25). Alert on duty. This is a simple rule-based controller that prevents quality degradation before a full retraining cycle.
Weekly (offline): Collect all served requests with quality scores (from the 1% sample). Retrain the β coefficients on the new data. Measure improvement vs. previous model. Deploy only if CV R² improves by > 0.01 and ε_overtrim does not increase. This is a standard MLOps offline training loop.
Monthly (architectural): Evaluate whether linear regression is still sufficient, or whether a small neural model (e.g., 2-layer MLP, < 1ms inference) should replace it. Evaluate new features (e.g., backspace count as a proxy for deliberation, punctuation rate as a proxy for sentence structure). Consider RL-based fine-tuning (following the LCPO approach of Aggarwal & Welleck, 2025) to train the model itself to respond optimally at the predicted budget.
The current approach injects a token budget into the system prompt and relies on the model following it. This is fragile: models sometimes ignore soft constraints, and the quality-at-budget relationship varies by model version.
A stronger approach, following L1/LCPO (Aggarwal & Welleck, COLM 2025), is to fine-tune the model to:
- Receive a token budget as input
- Optimize jointly for response quality AND budget adherence
The reward function for RL fine-tuning is:
R(response, query, N_hat) = α · Q(response, query)
- (1-α) · max(0, N_actual - N_hat) / N_hat
- β_hard · 1[N_actual > N_hat · 1.1]
Where the first term rewards quality, the second penalizes going over budget proportionally, and the third is a hard penalty for significant overrun (>10% over budget). The α parameter controls the quality-efficiency trade-off and should be set by user segment (power users → α closer to 1; cost-conscious deployments → α closer to 0.5).
This requires access to model weights (open-weight models: Llama, Qwen, Mistral) and is a longer-term research direction. The production MVP should use prompt injection only.
δ_idle (idle threshold):
Start: 500ms
Tune: Grid search over {200, 350, 500, 750, 1000}ms on validation set
Criterion: Maximize Corr(t2/t1, Q_delta) — idle ratio should predict quality gap
δ_paste (paste time threshold):
Start: 2000ms
Tune: Validate against ground-truth paste/type labels from pilot study
Criterion: F1 score of ρ_paste binary classifier
safety_margin:
Start: 1.20
Tune: Minimize ε_overtrim while keeping Δ_tokens > 0.10
Criterion: ε_overtrim ≤ 0.05 with maximum achievable Δ_tokens
N_lower (minimum budget):
Start: 50 tokens (floor for any response to be coherent)
Tune: Task-specific. Code generation: 100. Factual QA: 20. Summary: 150.
Criterion: P(response is complete | N_hat = N_lower) > 0.90
safety_margin multiplier for low-confidence decisions:
Start: 1.5 (applied when confidence < 0.6)
Tune: Hold fixed initially, revisit after 3 months of data
Yes. Every component of the hypothesis maps to a measurable quantity:
| Claim | Measurable Variable | How |
|---|---|---|
| "Track time prompt is written" | t₁, t₂, t₃ | Client-side keystroke logger |
| "Model decides response length" | N̂_max | Budget predictor f(timing) |
| "Fewer tokens generated" | Δ_tokens | Token count pre/post governor |
| "Lower cost" | ΔC | Δ_tokens × C_token × volume |
| "Quality preserved" | ΔQ | LLM judge + human eval |
| "Governor overhead is negligible" | L_governor | Profiling the prediction pipeline |
Yes. The hypothesis is falsified if ANY of the following occur:
Falsification Condition 1 (signal absent):
Pearson r(t₃, N_baseline) ≤ 0.15 on N ≥ 10,000 queries
→ Timing features carry no signal about required response length
→ The entire premise collapses. The governor is guessing.
Falsification Condition 2 (budget useless):
Δ_tokens ≤ 0.05 under the governor
→ The predicted budgets are too conservative to have any effect
→ Re-examine safety_margin and N_lower settings; if still failing, abandon.
Falsification Condition 3 (quality unacceptable):
Mean ΔQ > 0.10 on a balanced eval set
→ The governor degrades quality too much to be deployable
→ The trade-off point does not exist at a useful operating point
Falsification Condition 4 (L_p alone is sufficient):
R²(N_baseline | L_p alone) ≥ R²(N_baseline | all timing features)
→ Prompt length already captures all predictive signal
→ Timing is unnecessary complexity; use L_p-only predictor instead
→ The hypothesis is not wrong, but timing is redundant.
→ This is a PARTIAL falsification — cost reduction still works, but
the innovation (timing signal) is not the mechanism.
The hypothesis is definitively proven wrong if, on a properly-powered study (N ≥ 5,000, balanced strata), both of the following hold simultaneously:
Corr(t₃, N_baseline) < 0.15 [no signal]
AND
Δ_tokens under any parametrization ≤ 0.05 [no savings]
A single failure (signal absent but savings achievable through a different mechanism, OR savings negligible but signal present) represents a partial failure and warrants a redesigned mechanism.
The following table defines the deployment gate. All conditions in the "Deploy" column must be met on the test set.
| Metric | Reject | Revisit | Deploy |
|---|---|---|---|
| Δ_tokens | < 0.05 | 0.05–0.12 | ≥ 0.15 |
| Mean ΔQ | > 0.10 | 0.05–0.10 | ≤ 0.05 |
| ε_overtrim | > 0.10 | 0.05–0.10 | ≤ 0.05 |
| Governor latency (p99) | > 20ms | 10–20ms | ≤ 5ms |
| Human eval preference vs baseline | < 45% prefer governed | 45–55% | ≥ 50% (non-inferior) |
| Out-of-distribution ΔQ degradation | > 0.15 | 0.08–0.15 | ≤ 0.08 |
If all Deploy conditions are met on the test set, and if the governance service passes latency and fault-tolerance requirements, the system should be deployed as:
- A shadow deployment first (logs predictions but does not apply them; 2 weeks)
- A 5% traffic split A/B test (4 weeks, monitoring all metrics)
- Full rollout with real-time monitoring and the auto-increase safety_margin rule active
The author's note from chatbot-token-reduction is apt: "This could be a step towards custom GPTs." A successful token governor is the first layer of a user-adaptive LLM system — one that infers context from behavioral signals beyond the text of the prompt itself.
The author's notebook states: "Fewer tokens → lesser cost & higher performance." This requires surgical precision.
Cost reduction: Correct and trivially true. Fewer output tokens = fewer compute FLOPs = lower API cost. This is an accounting identity, not an empirical claim.
"Higher performance": This is ambiguous and in its naive reading is FALSE. A shorter response is not inherently better. The correct interpretation, consistent with the experimental evidence from TALE (Han et al., 2024) and L1/LCPO (Aggarwal & Welleck, 2025), is:
Current LLM reasoning processes are unnecessarily lengthy for a substantial fraction of queries. Compressing them to the minimum sufficient budget does not degrade quality on those queries, and may slightly improve it (by reducing rambling, hallucination-prone padding, and circular reasoning). On queries where the baseline is already at the minimum sufficient length, compression degrades quality.
The claim is therefore domain-specific: for over-generated responses, quality is preserved or improved. For optimally-generated responses, quality degrades. The governor's job is to identify and target only the former.
This is not a theoretical weakness of the hypothesis — it is an empirical question about the base rate of over-generation, which TALE estimates at 40–60% of reasoning traces on standard benchmarks.
| This work | Literature analog | Key difference |
|---|---|---|
| Timing-conditioned N_max | TALE token budget in prompt (Han et al., 2024) | TALE uses static per-query budget; this uses behavioral signal to set it dynamically |
| Token governor as control input | Control Theory of LLM Prompting (Bhargava et al., 2024) | Bhargava studies reachability; this applies control input pragmatically |
| RL fine-tuning extension | L1/LCPO (Aggarwal & Welleck, COLM 2025) | L1 uses user-specified length; this predicts length from timing signal |
| Budget predictor regression | Token-Budget-Aware Reasoning | TALE uses complexity estimation; this uses interaction behavior |
| Paste vs. typed classification | Novel (no direct analog) | Closest: intention detection in query understanding literature |
The key novelty of the hypothesis, relative to all existing work, is the use of client-side interaction timing as a behavioral proxy for query complexity. No prior work uses this signal. It is a low-cost, always-available signal that requires no model inference and no explicit user labeling. If it carries predictive power (Falsification Condition 1 above), it is a genuinely novel contribution.
End of pipeline document. Next step: implement the simulation (Section 3.3), run the correlation study (Experiment 1), and report Corr(t₃, N_baseline) before committing to full experimental infrastructure.