No transformer. No backpropagation. No gradients anywhere.
A working proof-of-concept for the thesis that learning is compression (MDL / Solomonoff): the shortest reusable description of the data is also the one that generalizes furthest beyond it. This is the single most relevant principle for a low-data regime like the BabyLM Challenge (10M–100M words), where the bottleneck is sample efficiency, not compute.
python3 solon.py # pure stdlib, runs in ~1s on a laptop
A standard LM stores knowledge in billions of weights, updated slowly by gradient descent, and cannot acquire a new word in one exposure. A child can. SOLON drops the neural net entirely and learns the way a compressor does — by counting and refactoring — yielding four behaviours, each a direct consequence of "shorter code = better model":
| Stage | Mechanism | First principle |
|---|---|---|
| 1. Predict | Witten-Bell back-off model → calibrated bits per word | grammaticality = fewer bits; bits/word = a reading-time proxy |
| 2. Chunk | RePair: replace the most frequent pair with a new symbol | constituents are whatever shrinks the corpus |
| 3. Abstract | merge words that share (IDF-weighted) contexts into categories | the "dream" refactor — generalization manufactured offline |
| 4. Generalize | a new word slots into a category from one context | productivity (the "wug" effect) with zero retraining |
On a toy world with word order and subject–verb agreement, SOLON:
- induces clean categories from raw text —
N.sg,N.pl,V.sg,V.pl,ADJ,DET— separated even by number, with no labels; - judges BLiMP-style minimal pairs at 100% (grammatical = fewer bits);
- learns the nonce word
blicketfrom a single sentence (the blicket runs), files it underN.sg, and then correctly judges agreement and word order on sentences it has never seen:
back-off LM SOLON
the blicket sleeps . > the blicket sleep . +0.2 b +11.0 b
the blicket runs . > blicket the runs . +0.0 b +20.9 b (LM blind)
The back-off model is blind on a novel word (no n-gram contains it) and can
only guess from base rates (~0-bit margin). SOLON answers structurally, because
blicket inherited an entire category's grammar from one exposure.
This script is rungs 1–3 of a larger, deliberately gradient-free design:
- Prediction by compression ✔ (here: a back-off model; scales to PPM/CTW)
- A growing construction library ✔ (here: RePair; scales to MDL grammar induction / Bayesian Model Merging with variable slots)
- Distributional abstraction / "dreaming" ✔ (here: IDF-weighted complete-link clustering; scales to ADIOS-style equivalence classes)
- One-shot, test-time acquisition ✔ — learning and inference are the same operation, so the model never stops learning, exactly like a child.
The induced grammar is fully inspectable — you can print the categories and constructions it discovered. For a language-acquisition venue that interpretability is worth as much as the score.
- Toy corpus. A synthetic mini-English. The mechanisms are real; the scale is not. Real text needs sub-word units (RePair seq now exposed + light tuple support in classes/grammar), a chart/Earley parser over the induced grammar (CKY bits implemented), and variable-slot constructions (not just flat categories; next).
- Agreement is captured via adjacent class bigrams. Long-distance dependencies (across embedded clauses) need the hierarchical / slot-binding parser — that is the next rung.
- Clustering threshold is tuned for this world (0.46, complete-linkage).
Now supported:
mdl=Trueuses a cheap MDL proxy (class model cost + fit) to decide merges (see code + solon_tinystories call). Full joint MDL is future work.
Word-level tokens are blind to morphology — wug and wugs are unrelated
symbols. A character-level PPM (same Witten-Bell engine, order 6) learns the
shape of the language and inflects words it has never seen:
python3 solon_morphology.py # ~2s on 1.2M chars of TinyStories
- Wug test — for the fully novel stem
wug, the regular plural-scosts 5.2 bits vs-z28.5 /-q29.5. Forblicket,-s= 3.1 bits. The rule generalized; it even respects phonotactics (noun-like stems prefer-s, verb-like stems prefer-ed/-ing). - Why sub-word — the word-level model scores
wugsandwugzidentically (both OOV → blind); the char model prefers the real plural by 23 bits. - Held-out compression: 1.70 bits/char on unseen text.
- Honest limitation — a left-to-right char model learns how to pluralize
but not when: given "two ___" it keeps the shorter bare form, because the
distant number cue is lost after backing off on the novel stem. Deciding to
inflect lives in the word-level categories (
solon.py§3). Form (sub-word) and when (word-level) are complementary — the full system needs both.
The two levels above each fail at one thing. The char model learns the form
of the plural (-s) but not when to apply it — given "two ___" it keeps the
shorter bare form (solon_morphology.py §4). The word level knows when
(number is licensed by "two") but is blind to a novel word's spelling. This
fuses them into one factored code length:
P(number, surface | prev, stem) = P(number | prev) # WHEN (word-level)
· P(surface | number, stem) # FORM (char-level)
- WHEN is learned by counting —
P(+s marked | prev)comes out astwo→0.94,many→0.91,the→0.11,a/one→0.00. No labels. - FORM is a realization distribution: given you're marking, the char model
picks the allomorph (
dax→daxes, notdaxs), normalized so it doesn't pay the raw string-length penalty that made §4 keep the bare form. The productive plural is one rule, not re-derived per word. - Number marking itself is bootstrapped unsupervised from the
+s/+esalternation:wis "marked" iffw = stem(+e)sand the stem is also in vocab.
python3 solon_fusion.py # needs tinystories-valid.txt
Result — context-sensitive inflection of words seen zero times:
wug -> a wug | two wugs
dax -> a dax | two daxes <- char model spells the allomorph
number-marking accuracy over 40 (novel stem x cue) decisions: 40/40 = 100%
Neither level alone can do this: char-only is context-blind and would misspell
daxs; word-only can't spell an unseen word at all. Honest limit: the same
mechanism only weakly predicts verb agreement — P(+s|she)=0.32 > P(+s|they)=0.00
is directionally right, but below 0.5 because high-frequency irregular verbs
(was, had, went) carry no -s and dilute the cue. Determiner number is
clean; pronoun-cued agreement is genuinely harder from raw counts.
Is this novel? The ingredients aren't (factored language models,
two-level morphology, unsupervised morphology induction all exist). The
demonstration is: an unsupervised, backprop-free factored morphology model
that bootstraps number from the +s alternation and resolves novel-word
agreement that neither the sub-word nor the word level can alone — packaged as a
single MDL code length. It's a synthesis, honestly scoped, not a new algorithm
class. The rest of SOLON re-implements classic MDL/distributional acquisition
(RePair, ADIOS, Guo 2001) and is intended as a clean baseline.
solon.py— core system (toy corpus, predictor, RePair, category induction, one-shot learner). Pure Python standard library.solon_tinystories.py— the same pipeline on ~1M words of real TinyStories.solon_morphology.py— character-level PPM; the wug test and morphology.solon_fusion.py— the form/when fusion; context-sensitive inflection of novel words. Runpip install tqdmfor progress bars (optional).
Swap make_corpus() for a loader over the strict-small 10M-word corpus, move
the predictor to character/sub-word PPM (robust to morphology — real wug tests),
and add a CKY parser so grammaticality uses minimum-description-length parses
rather than class bigrams (now available via ConstructionGrammar(..., use_chart=True).bits).
Clustering now supports mdl=True (MDL delta stopping: merge iff it shortens approx DL;
see induce_classes(..., mdl=True)). RePair seq is exposed for subword experiments.
The eval pipeline (BLiMP, EWOK, reading-time) drops in 2026; bits-per-word is already
the right currency for the reading-time fit. Run with larger n_words / mdl / chart
flags for scaled experiments (e.g. python solon_tinystories.py ... 4000000).