Editor's note

CORPUS has opened its contribution platform to a public beta. Musicians can upload original recordings, stems, and additional files like MIDI to a shared library designed for AI training — and receive compensation based on how their work enriches the dataset. The premise sounds simple. This article is about why it isn’t.

The deceptive simplicity of the task

The core idea behind the CORPUS contribution app can be stated in one sentence: musicians upload music, the system evaluates it, and their contributions earn them a stake in the revenue their work helps generate. Everything else follows from taking that sentence seriously.

We expected the hard problems to live in the scoring system: the question of how to measure quality, originality, and diversity. That is indeed hard, and we have written about it. But before we could even begin to evaluate contributions, a prior set of questions forced itself into view. Questions about what happens between the moment a file leaves a contributor’s computer and the moment it enters the library.

These questions turned out to be more consequential and more revealing than we anticipated.


Filtering as a sequence of recognitions

The first question is elementary: is the uploaded file actually music?

This sounds trivial. It is not. People upload field recordings, corrupted audio, test files, spoken word, sound effects. A training corpus contaminated with non-musical material degrades what the models learn from and spends compute on silence. So we built a classifier that distinguishes music from non-music (including technically broken files), and it works well.

Except for noise music, a genre where distortion, feedback, and sonic extremity are not defects but the point. Several people on our team care deeply about this music. The classifier flags it: that is correct behavior from a signal-processing perspective and wrong behavior from a musical one.

This is not an edge case; it is the first instance of a pattern that repeats at every layer of the system: automated filters encode assumptions about what music is, and those assumptions will always be contested. The response cannot be to make the filter smarter. The response must be structural. In this case, a contributor whose work is flagged can appeal. If the appeal is legitimate (a noise artist, an experimental composer), we disable that specific filter for their account. The system adapts to the contributor, not the other way around.

The same logic extends to every subsequent filter.

Illustration of a saxophone emitting a waveform that meets a stop sign, with an arrow pointing toward a star shape.
Music enters, the system intervenes. Illustration: Jan Stoewe

Duplicates. We do not want the same recording to appear twice. Detection runs on audio fingerprinting, not on titles or filenames, which can be changed to game the system. When a duplicate is found, the question immediately becomes: who is the rightful owner? The contributor may have uploaded their own track twice by accident. Or someone else uploaded it first. Both cases require resolution, so we built a dispute flow that routes conflicts to human review with full audit trails.

Covers. Fingerprinting catches near-identical recordings. It does not catch a singer performing someone else’s song on an acoustic guitar. For a corpus that promises rights-cleared training data, this is a critical gap. Standard content identification systems (the kind that power YouTube’s Content ID) cannot help here, because the recordings are too different. We partnered with Aurismatic, a company whose technology was originally built to identify songs performed live at concerts, where the audio bears little resemblance to the studio version. The resulting prototype for our cover detection system currently recognizes the thirty thousand most covered songs; not comprehensive, but a barrier where there used to be none.

AI-generated music. Synthetic-audio detection is not a problem you solve once. Each generation of models can be learned and identified; then a new generation appears, and the detector must catch up. Building a system that stays current permanently is a separate company’s problem, and not one we wanted to take on. To our surprise, a relatively modest effort produced a detector that reliably identifies tracks generated by Suno through its second-most-recent model version. It is not future-proof, and it is not meant to be. It is a checkpoint, one layer in a system where no single filter carries the full burden.

Stem verification. Contributors earn additional points for uploading stems alongside their master recordings, because separated tracks can be more valuable for training. This creates an incentive to upload unrelated audio files as “stems.” So we verify that uploaded stems correspond to the master, not by filename, but by waveform analysis. If they don’t match, the upload is flagged.

Vocal detection and personality rights. If a track contains singing, a further legal dimension opens. Vocals carry personality rights that go beyond copyright. Our system detects whether a track contains vocals. If it does, the contributor’s collaboration agreement must include a singer who has explicitly consented to the use. Without that consent, the track does not enter the library.


What none of this required

A recurring realization during development: none of these filters are exotic technology. Non-music detection, fingerprint-based deduplication, cover recognition, AI detection, stem verification — each is buildable today with open components and a few months of careful work. The engineering was careful, not heroic.

This makes the absence of such systems elsewhere harder to explain. Major upload platforms process millions of tracks with only fragmentary checking whether the audio is music, whether it duplicates existing content, whether it is a cover, or whether it was generated by an algorithm. The result is a music ecosystem flooded with redundant, synthetic, and rights-uncertain material. Those problems are then treated as inevitable consequences of scale.

They are not inevitable; they are consequences of choosing not to filter. We chose differently, and the cost was manageable. The filters we built would function on any upload platform. That they don’t exist on most is an infrastructure decision, not a technical limitation.

An orderly workshop with tools hanging on walls and clean workbenches, fully equipped but entirely unused, seen from the doorway.
The tools exist. Most were left on the wall. Image: Midjourney

Splitting compensation before the upload, not after

When multiple people create a track together, compensation must be split. In CORPUS, this happens before the upload, not retroactively.

A contributor creates an agreement that defines a simple percentage split among all parties involved. Each party is invited by email to confirm their share and to consent explicitly to the use of the work as AI training material. The agreement is captured as an immutable snapshot at the moment of upload. If the template is later changed or deleted, past uploads are unaffected.

This is where the system’s ambitions collide with usability. Every consent step adds friction. And friction, past a certain threshold, kills adoption — no matter how justified the underlying principle. This is not only our problem. It is the central political challenge of the entire music-AI licensing landscape: collecting societies, regulators, and industry bodies are all designing complex permission architectures. But a system that is correct and unused is worse than a system that is imperfect and adopted. The speed of technological development does not wait for consensus to form or for legislation to pass.

We tried to make the process as frictionless as possible with automated onboarding, single-click consent, and reusable templates. Whether it is simple enough to be accepted by working musicians is the open question of the beta.

One point, however, was non-negotiable. If a track contains vocals, the singer must be named in the agreement and must have explicitly consented. Vocals carry personality rights, a dimension so intimate that no efficiency argument can override it. This is the line we drew: the system can tolerate imperfection elsewhere, but not here.


Controlled growth, not platform logic

CORPUS is not trying to maximize uploads. We are trying to build a library whose integrity we can guarantee. This means knowing our contributors.

The platform distinguishes between visitors and contributors. Both can upload and explore the annotation pipeline. The difference is that only a contributor’s uploads enter the CORPUS library. The transition requires a lightweight vetting process: does this person have a credible production history? Do they make music that predates the generative AI wave? What kind of work is it?

This is not scalable in the way platform economics demands; it is deliberately slow. The alternative (open registration and retroactive moderation) is precisely the model that has produced the content crisis now visible on every major music platform.

We will need to revisit this as the corpus grows. Some of the vetting may become automatable. But the principle is fixed: the quality of the dataset is bounded by the quality of the contributor base, and quality requires knowledge, not just volume.


What the app reveals

We asked the questions that upload infrastructure usually leaves unanswered: what counts as music, who owns it, whether it is original, whether it is human, whether its creators have consented. Most of these systems work well — in some cases, better than we expected. But gaps remain. Production quality assessment — the ability to reward not just what a track is, but how well it is made — is the most significant. We plan to close it in the coming months.

Upload platforms are not passive infrastructure. They are decision architectures. Every file that enters a system either passes through a series of explicit judgments — is this music, is it original, is it human, have its creators consented — or it passes through the absence of those judgments. Both are decisions. The music industry has, for the most part, chosen the second. The consequences are now visible everywhere: flooded catalogs, uncertain rights, eroded trust, and a growing dataset crisis that AI has made urgent but did not create.

CORPUS chose differently. Every upload is a decision. The contribution app is where ours are open to inspection.

The public beta is open. Musicians can contribute their music.