Editor's note

An earlier article, Every Upload Is a Decision, followed a track through the filters that decide whether it enters the CORPUS library. It left one part for later: what happens when the system tries to describe the music users have uploaded. That layer, the semantic layer, is the subject here. It began with a single word we knew would be difficult, long before any line of code was written.

Executive Summary

CORPUS set out to give every track in its library a deep, objective description. The difficulty sat where we expected it, in a single word: what is an objective description of a piece of music? In practice, trained listeners converge on a description far more than the theory predicts. Musicologist Michael Emanuel Bauer resolved the problem by triangulation: describe each track from four perspectives (what the music is, what it does, where it belongs, and how it would function as film music), and objectivity emerges from the overlap. The result was a written method for human annotators, and that method is what made automation possible: the quality of a description follows from the quality of the question. The long-term payoff is in generation: a corpus described this deeply is what will let a trained model be steered precisely rather than generate by lottery, where there is no second try. That corpus does not yet exist to train on, so the first results have come in search and discovery, where the same depth turns a taxonomy of fixed tags into a described space that can be searched in plain language. That is the foundation our search and exploration technology is built on.

The description we wanted, and the word at its center

We wanted every track in the library to carry a real description: more than three tags and a tempo, a deep and thorough account of the music, written down and consistent across the whole catalogue. The motivation was, and still is, control over what a model trained on the corpus can do. A model learns from the annotations attached to its training data, and the depth of those descriptions sets the ceiling on how precisely it can be steered. Train on genres and a handful of moods, and generation becomes a lottery. That is tolerable when a person sits at a screen choosing among a few suggestions, and it fails in the human-machine settings that are the point of CORPUS, where there is no second try and the first output has to land.

The ambition turned on a single word, and we knew from the start it would be the hard part. Objective. What is an objective description of a piece of music?

The question is old and looks unanswerable. Music is felt before it is analyzed, two listeners hear two different things, and the same track means one thing at a wedding and another at a funeral. By that reasoning an objective description is a category error, and the field has mostly treated it as one: measure what can be measured, the tempo, the key, the instrumentation, and leave the rest to taste.

Practice tells a different story. Put a group of trained listeners in a room with a track and ask them what it is, and they tend to converge on the genre, on the mood, on what the music is doing and where it would belong, more than one would think. The disagreement at the edges is real, but the shared core is large and stable. The individual, incommunicable listening experience is a smaller part of the whole than the romantic account of music suggests.

The objection to objectivity is strong in the abstract and weak in the room.

Poetic abstract illustration on warm sepia paper: a dark gestural landscape stretches horizontally across the middle, overlaid with white concentric circles, dotted lines, points, and tiny notations — a natural scene being mapped diagrammatically.
The same scene held in more than one register at once. Image: Midjourney

Sending the question to a musicologist

What we were setting out to automate, then, was something musicology itself has never settled. There is no agreed standard for what an objective description of a piece of music should contain, which is why the question went to musicology before it went to engineering. Before a single feature can be extracted from audio, someone has to define what a good description of music actually contains, and why.

We handed it to Michael Emanuel Bauer, the author of Nothing is Original (Wolke) and a long-standing interlocutor of ours on questions of authorship and originality. We asked him whether the convergence we kept seeing could be turned into a method: a way for a group of musicology students to describe music systematically and comparably, so that two annotators working on two different tracks produce descriptions that can be read side by side and trusted to mean the same things.

Four perspectives, one description

Bauer’s solution to the objectivity problem was to stop chasing a single objective vantage point and to triangulate instead. A description in his method approaches a track from four perspectives, and every description CORPUS generates still carries all four.

What the music is. The sonic facts: instrumentation, the tempo and what it implies (104 BPM read as Moderato, a moderate and steady pace), the key, the vocal character, the production technique, the timbral surface. This is the layer closest to measurement, written as prose rather than a list.

What the music does. The perceptual and emotional reading: how it feels, what it evokes, what kind of character or situation it implies. This is interpretation, but disciplined interpretation. It is the part most people assume cannot be objective, and the part the method works hardest to stabilize.

Where the music is at home. Its contexts: the settings it suits, the place in a playlist where it belongs, the room it would sound right in, whether that is a late-night drive, a quiet café, or a documentary about a place.

How the music would function as film music. The track read the way a music supervisor reads it when pitching it to a director: which scene it could carry, which character moment, which narrative turn, played in the background of a room or laid under the action.

No single one of these is the objective description. Objectivity is what emerges from their overlap. A track described four times, from four disciplined angles, gets pinned down the way a position is fixed by triangulation: each line is partial, and the intersection is firm.

No single one of these is the objective description. Objectivity is what emerges from their overlap.

From a human method to a machine that follows it

What Bauer delivered was a method, in effect a work instruction for human annotators. That artifact is what made automation possible, and it is where the value sits. A written, systematic specification of what a description must contain is something that can be applied at scale. Once the technology was capable enough to apply the method consistently across thousands of tracks, the pipeline became real. The order matters: the method came first, the automation second.

The pipeline runs on our own servers in Germany, and the music never leaves them. What is proprietary is the architecture that turns the method into a pipeline at catalogue scale. The line we keep returning to: the quality of the output follows from the quality of the question.

A desk in black and white: headphones resting on a vinyl record, sheet music, handwritten pages of notes, and a hand-drawn waveform spread across the working surface.
A method written down, before the pipeline could run it at scale. Image: Midjourney

What a description sees that a tag cannot

The clearest case for a description over a tag is a track that is really many tracks. “Bohemian Rhapsody” moves from a piano ballad through a mock-operatic passage into hard rock and back to a quiet coda. A single genre tag and a single tempo cannot represent it, because any one number is wrong for most of the piece. Our description reads it as what it is: a multi-part structure that opens slow and contemplative around 71 BPM, accelerates into a fast operatic section, then settles into a driving rock groove, with each section carrying a different musical function in the whole. A tag names the track. The description follows its shape, and tells, in order, what happens. It does not just classify the music; it narrates it.

Here is the first of its four sections, unedited:

What the engine wrote: the first of four sections for "Bohemian Rhapsody"

This is a multi-part rock ballad with a dramatic narrative arc, beginning with a gentle piano introduction and evolving into a powerful, layered rock anthem. The piece opens with a soft, melancholic piano melody and a male vocal singing in a reflective, almost fragile tone. The mood is introspective and questioning, with lyrics that explore themes of identity, mortality, and existential doubt. As the song progresses, it builds in intensity, introducing a full band with electric guitars, bass, and drums, culminating in a climactic, operatic section with choral harmonies and a soaring guitar solo. The tempo is initially slow and deliberate, around 71 BPM (Andante), creating a somber, contemplative atmosphere. Later, the piece shifts into a faster, more driving section at 151 BPM (Vivace), and then transitions again into a mid-tempo rock groove at 138 BPM (Allegro). The instrumentation is rich and varied, featuring piano, electric guitar, bass, drums, and layered vocals. The vocal performance is highly expressive, ranging from intimate and vulnerable to powerful and theatrical. The overall mood is emotional, dramatic, and epic, with a strong sense of theatricality and narrative progression.

From a taxonomy to a described space

A tag set is a taxonomy: a fixed vocabulary, chosen in advance, into which every track has to be sorted. It is efficient, and it is closed. Whatever the vocabulary did not anticipate, it cannot express, and whatever falls between two tags falls out. Genres no one labelled, moods with no single word, the distance between “triumphant” and “triumphant with something heavy underneath,” all of it disappears into the nearest available box.

That nearest box is also where cultural difference gets flattened. We ran a set of well-known tracks through our pipeline and checked our results against those of the tagging services that lead the market today. Faced with a classic Afrobeat recording sung in Yoruba, one of them filed it as Latin, another as Jazz, with the language read as English; a Portuguese fado came back as Klezmer. A catalogue built this way has quietly decided that the world’s music is a set of variations on the categories its training data already knew. A description does better, though not by escaping the limit: our own model also knows only what its training taught it, and genuinely new music can slip past it. What a description avoids is the forced single label. It can set down what it perceives, the syncopated call-and-response of the horns, the Yoruba vocal, the highlife-and-jazz lineage of the rhythm, and describe the sound even when it cannot name the kind.

It is open because it is written in language, and language can hold compound, unfamiliar, and in-between things. Once every track carries a description, the catalogue stops being a set of labelled bins and becomes a space that can be searched the way people actually look for music: by what they want it to do. A music supervisor does not think in tags. They think “a slow build, hopeful but not naive, that could carry a montage of someone rebuilding their life.” For most of the history of music catalogues that sentence had nowhere to go, because the catalogue was indexed by keyword and the request had to be translated down into keywords first. The descriptions close that gap. Search runs against the descriptions in plain language, and the query and the catalogue finally speak the same one.

What it is for

Two things can be built on a described catalogue. The first is what the descriptions were always for: training models that are predictable and steerable, for the human-machine settings where the first output has to land. That part is still ahead of us. We have to build the music library before we can train our own models, and we are only at the start.

What already works is search and discovery. The same descriptions that would train a model also make the catalogue findable. The person it helps most knows exactly what they need music for and has no idea which track it is: a film editor, a game designer, an advertising team on a deadline, facing a catalogue too large to listen through. For them, a described catalogue is the difference between finding the track and losing an afternoon to keyword guessing.

There is a second reason this matters to us. A training dataset is usually a black box: thousands of tracks turned into model weights, never to be heard as themselves again. A described catalogue is the reverse. Every track stays visible, something a person can find, read, and play. That visibility is what a community can gather around, and it is why we are building Soundbook.

Soundbook is the product on top of all this: a place to meet the corpus and explore it as a creative process rather than a lookup. An early version is already running at intelligence.corpus.music.

An objective description, again

An objective description of music is something people make. Trained listeners describe the same track from enough angles until they converge, and that convergence, written down as a method, is what a machine can repeat.

Why it works is what we found at the start: language can hold what a category cannot, the compound, the unfamiliar, the in-between. We did not take the human out of describing music; we wrote the method down clearly enough that it could run at scale.

Sources and notes on the comparison

The benchmark comparison. The track examples are drawn from CORPUS's own Music Intelligence benchmark analysis (February 2026). The results attributed to the other tagging services come from the Soundcharts benchmark "AI Music Analysis 2026" (Soundcharts, January 2026) and were checked by us. CORPUS results were generated independently on the same tracks. The set covers nine tracks and is illustrative; it is not a controlled evaluation, and no statistical claim is made from it.

The tracks referenced. The Afrobeat example is Fela Kuti, "Water No Get Enemy" (sung in Yoruba); the fado is Amália Rodrigues, "Uma Casa Portuguesa"; the multi-part composition is Queen, "Bohemian Rhapsody."

Related CORPUS writing. On the upload filters that precede the description layer, see Every Upload Is a Decision. On the markets the semantic layer ultimately serves, see The Wrong Debate. On originality and authorship, the questions Bauer's own work turns on, see Originality Is a Story We Keep Telling.