97365ffd-3cc8-44df-af8a-e5bd49f6bd68

The Atlantic Just Doxxed the Music Used to Train AI — And It's a Goldmine

12 million songs exposed. What comes next?

Alex Novak||Source: The Verge
The Atlantic Just Doxxed the Music Used to Train AI — And It's a Goldmine
Photo by Aj Collins Artistry on Pexels

The Atlantic's Alex Reisner just did what regulators, tech watchdogs, and disgruntled artists couldn't: he pulled back the curtain on the music feeding the AI beast. Four datasets, fully searchable. Two of them enormous — 12 million tracks and 9 million tracks. The other two smaller but still representing a significant chunk of creative work.

This isn't just a database. It's a key piece of evidence in an ongoing war over intellectual property, fair use, and the soul of art. If you've ever wondered what songs are feeding the machines that can now write a passable pop ballad or mimic a dead singer's voice, you now have a searchable list.

A Curated Shitshow

The four datasets are a study in scale. The two biggest — let's call them the heavy hitters — are constructed from what looks like YouTube audio streams, ripped from videos, playlists, and uploads. They include everything: studio albums, live recordings, bootlegs, and random YouTube covers. The smaller sets seem more curated, pulled from labeled collections or specific genres.

But here's the thing: nobody asked permission. Not the artists. Not the labels. Not the songwriters. The datasets were assembled by scraping public resources, then used to train models that can now generate music on demand. The Atlantic's database lets you search by artist, track, or album — so you can see if your favorite band is in the mix. Spoiler: they probably are.

Who's in the Mix?

I ran a few searches. Taylor Swift? Yes. The Beatles? Yes. Some obscure indie band from 2005 that only 200 people have heard of? Also yes. The datasets aren't discriminating. They're exhaustive.

This matters because AI companies have been cagey about their training data. OpenAI, Google, Meta — they all talk about "publicly available data" like it's a simple concept. But taking a song from YouTube and feeding it into a training model isn't the same as reading a news article. Music is protected by copyright, and the use of copyrighted material for AI training is still legally murky.

Reisner's database doesn't just expose the scale; it exposes the hypocrisy. If you're an AI company claiming your models are ethically trained, you now have to answer for every track in these datasets.

The Tectonic Shift

This isn't just a journalism scoop. It's a landmark in the ongoing battle between creators and tech. We've seen lawsuits from The New York Times against OpenAI, from Getty Images against Stability AI, and from countless artists against image generators. But music is different. Music is personal. Music is emotional. And music is big business.

The Recording Industry Association of America (RIAA) has already filed comments about AI training data being used "without authorization." But this database gives them a roadmap. It's a list of every song they could sue over. Every track is a potential claim.

And it's not just the labels. Independent artists, whose work was scraped without their knowledge, now have a tool to see exactly where their music ended up. That's empowering.

What's the AI Industry Going to Do?

So far, the response from AI companies has been predictable: a lot of hand-waving and vague promises about "fair use." One company even argued that since the data was publicly available on YouTube, they had a right to use it. That's like saying because you hung a painting in a museum, I can copy it and sell prints.

The fundamental problem is that AI training relies on scale. You can't train a high-quality music model on 1000 licensed tracks. You need millions. The cost of licensing that many songs would be astronomical — and that's exactly why the datasets were scraped illegally. The economics of AI demand theft.

Unless, of course, the law steps in. And that's where Reisner's work becomes crucial. Courts love specific evidence. A vague claim that "some music was used" is easy to dismiss. A searchable database of 12 million specific tracks? That's a smoking gun.

The Searchable Future

The Atlantic has made the database publicly available. You can search it right now. Go ahead, look up your favorite artist. I'll wait.

See? It's all there. Every song you've ever loved, probably. And now every AI company that's been building music models has a target on its back.

This is the future of the AI debate. Not abstract philosophical discussions about consciousness, but concrete questions about ownership. Who owns the data? Who gets paid? And who gets to decide what the machines learn?

"The datasets were assembled by scraping public resources, then used to train models that can now generate music on demand."

Reisner's database won't settle those questions. But it lights the fuse. Artists now have a tool. Lawyers now have a list. And tech companies now have a reason to sweat.

The Bottom Line

The Atlantic just handed every musician, label, and copyright lawyer a weapon. Whether they use it or not is up to them. But the days of pretending AI training data is some ambiguous, untraceable fog are over. We know what's in the box. And it's a lot of stolen music.

This is a story about power. The power to create, the power to scrape, and the power to expose. For now, the score is: creativity 1, AI theft 0. But the game is still in extra innings.

Advertisement
#AI training data#music datasets#artificial intelligence#copyright infringement#The Atlantic
分享到:XfWB