A Grammar of Bach

Notes on a Lambek type system induced from four hundred and thirteen chorale harmonizations.

· · ·

I.Premise

A Lambek grammar is a system for assigning types to words in such a way that the types of a sentence's words, concatenated and reduced according to a small calculus, yield a single distinguished type — by convention, the type of a complete sentence. The grammar is what licenses sentences as sentences. An inducer's job is to discover, given a corpus of acceptable sentences and a fixed inventory of primitive types, type assignments that make the corpus parse.

Bach's chorales provide an unusually clean corpus for an inducer that does not natively know about music. The four-voice texture is settled, the cadential vocabulary is constrained, and the corpus is bounded: four hundred and thirteen harmonizations are available in the music21 library, partitionable into 2,514 melodic phrases by rests of one beat or longer and by fermatas. The decision to treat the soprano's scale-degree sequence as the surface yield, and a phrase as a sentence, makes Bach a corpus and Lambek a tool.

The premise is then literal: induce a type system over scale degrees that licenses every observed phrase as parse-complete, and read the resulting type assignments to see what they say.

II.Method

The inducer wraps Google's CP-SAT constraint solver. Four primitive types are supplied by hand: M, the type of a complete melodic phrase, and three arrival points — At1, At3, At5 — for stability at tonic, mediant, and dominant respectively. The maximum derived-type depth is two; the solver runs with a one-hundred-and-twenty-second timeout and a memory cap of one and a half gigabytes. Phrase length is filtered to the range of three to ten notes, leaving 2,196 phrases over which the solver must succeed simultaneously.

Concretely, each scale degree may be assigned a set of types of depth at most two. A phrase parses if some choice of types for its scale-degree tokens reduces, by application of the Lambek calculus, to M. The inducer then searches for an assignment under which every phrase in the filtered corpus parses. A constraint problem of this size does not fit in memory whole; the solver attacks it one phrase at a time, batched.

The filtered corpus is diatonic: scale degrees one through seven, with no chromatic alterations surviving in the phrases the solver sees. The lexicon contains seven words and seven only.

III.The Lexicon

The corpus's first arithmetic facts are themselves illuminating, and bear stating before any type is induced. Bach's soprano lines are not uniformly distributed across the seven scale degrees, and the shape of that distribution is part of the structure the inducer must accommodate.

degree share do · 1 17.7% re · 2 17.1% mi · 3 19.1% fa · 4 13.3% sol · 5 18.4% la · 6 7.2% ti · 7 7.2% max
Figure 1.Distribution of soprano scale degrees across 22,017 notes; ordered by scale degree rather than by share. Mediant first, leading-tone and submediant tied last.

Mi is the most common surface token; fa, despite its weight in cadential theory, is comparatively scarce; la and ti are the rare two. The inducer's constraint problem is, in part, to find a small set of types per scale degree that copes with this skew without any one degree carrying the entire grammar.

Most inducers chew through several thousand types over several thousand lexical items. Here the inducer must squeeze every observed phrase through a vocabulary of seven, and the structure has to come not from dictionary growth but from polysemy: a single scale degree, at different points in different phrases, takes different types from a small finite set. That is the project's actual question. How polysemic is each scale degree?

IV.What the Solver Found

The induced grammar assigns between five and nine distinct types per scale degree — a tight band. Three findings stand out.

Sol is the structurally maximal scale degree. It is the only degree whose type-set carries M\M, the type of phrase continuation, and it reaches every primitive with nine distinct types, more than any other degree. It can either stand or extend; its grammatical role is connector, in the strong sense.

Do, the tonic, functions less as a stable arrival point than as a connector. This is the result that should surprise music theorists, or rather, the result that quietly confirms a long-standing observation: the tonic in Bach's chorales is more often passed through than arrived at.

Fa, alone, cannot end a phrase. There is no type assignment, in any solution returned, that licenses a phrase whose terminal lexical item is the subdominant in isolation. Theorists know this as a tendency. The inducer, given no music theory, takes it as a hard constraint and finds none.

These are not music-theoretical claims dressed in mathematics. They emerge from a constraint solver to which no musical knowledge has been supplied beyond the four primitive labels and the locations of phrase boundaries. The solver finds them because the corpus enforces them.

V.Why This Might Matter

Two angles, briefly. Music-theoretically, the Riemannian apparatus of tonic, subdominant, and dominant function has a long history of phenomenological description. A type-theoretic restatement — fa is the scale degree with no terminal license — is unusually crisp. It also opens an obvious follow-up: induce a type system on a corpus from a different repertoire, Palestrina, say, or Gesualdo, and read off where the lexicon's connector-and-arrival structure differs.

Linguistically, Lambek grammars are a categorial framework whose original target was natural language. Their applicability to a domain with no obvious semantic content — only structural well-formedness — is a low-grade vindication of the formalism's claim to be a theory of combinatorial possibility, not of meaning. The chorales parse for the same reason English sentences parse: the categories have to fit.

This is the sort of result that wants a paper, not a postcard. The present document is an interim report.

· · ·

Projectlambek-type-induction

Corpus413 BWV chorale harmonizations · music21 library

SolverGoogle CP-SAT, batched per phrase

CalculusLambek, primitives M / At1 / At3 / At5

Set inIowan Old Style, with Iosevka for the colophon

FindingsDaniel Wymark

ProseClaude (Anthropic), from D.W.'s materials

StatusComplete

DraftedGenerated 5/1/2026, Fact-Checked and Published 5/16/2026