Skip to content

Lulzx/solon

Repository files navigation

SOLON — learning language by compression

ci license: MIT

No transformer. No backpropagation. No gradients anywhere.

A working proof-of-concept for the thesis that learning is compression (MDL / Solomonoff): the shortest reusable description of the data is also the one that generalizes furthest beyond it. This is the single most relevant principle for a low-data regime like the BabyLM Challenge (10M–100M words), where the bottleneck is sample efficiency, not compute.

python3 solon.py        # pure stdlib, runs in ~1s on a laptop

The idea

A standard LM stores knowledge in billions of weights, updated slowly by gradient descent, and cannot acquire a new word in one exposure. A child can. SOLON drops the neural net entirely and learns the way a compressor does — by counting and refactoring — yielding four behaviours, each a direct consequence of "shorter code = better model":

Stage Mechanism First principle
1. Predict Witten-Bell back-off model → calibrated bits per word grammaticality = fewer bits; bits/word = a reading-time proxy
2. Chunk RePair: replace the most frequent pair with a new symbol constituents are whatever shrinks the corpus
3. Abstract merge words that share (IDF-weighted) contexts into categories the "dream" refactor — generalization manufactured offline
4. Generalize a new word slots into a category from one context productivity (the "wug" effect) with zero retraining

What the demo shows

On a toy world with word order and subject–verb agreement, SOLON:

  • induces clean categories from raw text — N.sg, N.pl, V.sg, V.pl, ADJ, DET — separated even by number, with no labels;
  • judges BLiMP-style minimal pairs at 100% (grammatical = fewer bits);
  • learns the nonce word blicket from a single sentence (the blicket runs), files it under N.sg, and then correctly judges agreement and word order on sentences it has never seen:
                                          back-off LM      SOLON
the blicket sleeps . > the blicket sleep .     +0.2 b     +11.0 b
the blicket runs .   > blicket the runs .      +0.0 b     +20.9 b   (LM blind)

The back-off model is blind on a novel word (no n-gram contains it) and can only guess from base rates (~0-bit margin). SOLON answers structurally, because blicket inherited an entire category's grammar from one exposure.

How this maps onto the full architecture (SOLON, the design)

This script is rungs 1–3 of a larger, deliberately gradient-free design:

  1. Prediction by compression ✔ (here: a back-off model; scales to PPM/CTW)
  2. A growing construction library ✔ (here: RePair; scales to MDL grammar induction / Bayesian Model Merging with variable slots)
  3. Distributional abstraction / "dreaming" ✔ (here: IDF-weighted complete-link clustering; scales to ADIOS-style equivalence classes)
  4. One-shot, test-time acquisition ✔ — learning and inference are the same operation, so the model never stops learning, exactly like a child.

The induced grammar is fully inspectable — you can print the categories and constructions it discovered. For a language-acquisition venue that interpretability is worth as much as the score.

Honest limitations

  • Toy corpus. A synthetic mini-English. The mechanisms are real; the scale is not. Real text needs sub-word units (RePair seq now exposed + light tuple support in classes/grammar), a chart/Earley parser over the induced grammar (CKY bits implemented), and variable-slot constructions (not just flat categories; next).
  • Agreement is captured via adjacent class bigrams. Long-distance dependencies (across embedded clauses) need the hierarchical / slot-binding parser — that is the next rung.
  • Clustering threshold is tuned for this world (0.46, complete-linkage). Now supported: mdl=True uses a cheap MDL proxy (class model cost + fit) to decide merges (see code + solon_tinystories call). Full joint MDL is future work.

Sub-word edition: morphology by compression (solon_morphology.py)

Word-level tokens are blind to morphology — wug and wugs are unrelated symbols. A character-level PPM (same Witten-Bell engine, order 6) learns the shape of the language and inflects words it has never seen:

python3 solon_morphology.py        # ~2s on 1.2M chars of TinyStories
  • Wug test — for the fully novel stem wug, the regular plural -s costs 5.2 bits vs -z 28.5 / -q 29.5. For blicket, -s = 3.1 bits. The rule generalized; it even respects phonotactics (noun-like stems prefer -s, verb-like stems prefer -ed/-ing).
  • Why sub-word — the word-level model scores wugs and wugz identically (both OOV → blind); the char model prefers the real plural by 23 bits.
  • Held-out compression: 1.70 bits/char on unseen text.
  • Honest limitation — a left-to-right char model learns how to pluralize but not when: given "two ___" it keeps the shorter bare form, because the distant number cue is lost after backing off on the novel stem. Deciding to inflect lives in the word-level categories (solon.py §3). Form (sub-word) and when (word-level) are complementary — the full system needs both.

Form/when fusion: context-sensitive morphology (solon_fusion.py)

The two levels above each fail at one thing. The char model learns the form of the plural (-s) but not when to apply it — given "two ___" it keeps the shorter bare form (solon_morphology.py §4). The word level knows when (number is licensed by "two") but is blind to a novel word's spelling. This fuses them into one factored code length:

P(number, surface | prev, stem) = P(number | prev)        # WHEN  (word-level)
                                · P(surface | number, stem) # FORM  (char-level)
  • WHEN is learned by counting — P(+s marked | prev) comes out as two→0.94, many→0.91, the→0.11, a/one→0.00. No labels.
  • FORM is a realization distribution: given you're marking, the char model picks the allomorph (daxdaxes, not daxs), normalized so it doesn't pay the raw string-length penalty that made §4 keep the bare form. The productive plural is one rule, not re-derived per word.
  • Number marking itself is bootstrapped unsupervised from the +s/+es alternation: w is "marked" iff w = stem(+e)s and the stem is also in vocab.
python3 solon_fusion.py        # needs tinystories-valid.txt

Result — context-sensitive inflection of words seen zero times:

wug      ->  a wug      | two wugs
dax      ->  a dax      | two daxes       <- char model spells the allomorph
number-marking accuracy over 40 (novel stem x cue) decisions: 40/40 = 100%

Neither level alone can do this: char-only is context-blind and would misspell daxs; word-only can't spell an unseen word at all. Honest limit: the same mechanism only weakly predicts verb agreement — P(+s|she)=0.32 > P(+s|they)=0.00 is directionally right, but below 0.5 because high-frequency irregular verbs (was, had, went) carry no -s and dilute the cue. Determiner number is clean; pronoun-cued agreement is genuinely harder from raw counts.

Is this novel? The ingredients aren't (factored language models, two-level morphology, unsupervised morphology induction all exist). The demonstration is: an unsupervised, backprop-free factored morphology model that bootstraps number from the +s alternation and resolves novel-word agreement that neither the sub-word nor the word level can alone — packaged as a single MDL code length. It's a synthesis, honestly scoped, not a new algorithm class. The rest of SOLON re-implements classic MDL/distributional acquisition (RePair, ADIOS, Guo 2001) and is intended as a clean baseline.

Files

  • solon.py — core system (toy corpus, predictor, RePair, category induction, one-shot learner). Pure Python standard library.
  • solon_tinystories.py — the same pipeline on ~1M words of real TinyStories.
  • solon_morphology.py — character-level PPM; the wug test and morphology.
  • solon_fusion.py — the form/when fusion; context-sensitive inflection of novel words. Run pip install tqdm for progress bars (optional).

Scaling to real BabyLM data

Swap make_corpus() for a loader over the strict-small 10M-word corpus, move the predictor to character/sub-word PPM (robust to morphology — real wug tests), and add a CKY parser so grammaticality uses minimum-description-length parses rather than class bigrams (now available via ConstructionGrammar(..., use_chart=True).bits). Clustering now supports mdl=True (MDL delta stopping: merge iff it shortens approx DL; see induce_classes(..., mdl=True)). RePair seq is exposed for subword experiments. The eval pipeline (BLiMP, EWOK, reading-time) drops in 2026; bits-per-word is already the right currency for the reading-time fit. Run with larger n_words / mdl / chart flags for scaled experiments (e.g. python solon_tinystories.py ... 4000000).

About

Learning language by compression — an MDL language learner with no transformer and no backprop. Predicts, chunks, induces grammatical categories, and learns new words in one shot.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages