How hacktivists scraped 300TB of Spotify music, and why it matters for AI

A gaggle of hacktivists have scraped and archived hundreds of thousands of music recordsdata, album artwork, and metadata from Spotify, in a transfer that would probably let anybody clone one of many world’s largest music streaming providers.

The 86 million audio recordsdata and over 256 million rows of observe metadata amounting to roughly 300 terabytes (TB) of storage have been backed up on Anna’s Archive, an open supply search engine for shadow libraries like Sci-Hub and Lib-Gen.

Music metadata is the gathering of data that pertains to an audio file, such because the artist’s identify, producer, author, track title, launch date, style, and observe length, to call a number of. The pirate activist group has at the moment launched the observe metadata scraped from Spotify. It additionally plans to launch the 86 million audio recordsdata (representing 99.6 per cent of listens) so as of their recognition as torrent recordsdata.

With the metadata of 256 million tracks and 186 million observe ISRCs (Worldwide Commonplace Recording Code) – a singular identifier assigned to particular person sound recordings – Anna’s Archive at the moment has the most important publicly out there music metadata database. It’s the world’s first “preservation archive” for music which is totally open and permits anybody with sufficient disk area to reflect it, the platform claimed in a weblog publish revealed on December 20.

The scraping of a majority of Spotify’s music library has taken on added significance within the synthetic intelligence (AI) period, the place huge troves of knowledge are routinely harvested by AI corporations to practice and construct massive language fashions (LLMs) like GPT-5 and Gemini 3. It comes amid rising stress between AI corporations and rights-holders at the same time as key questions round copyright, consent, and compensation for creators stay unresolved for essentially the most half.

In response to the pirate activist group’s claims, Spotify has stated it’s actively investigating the incident. “An investigation into unauthorised entry recognized {that a} third celebration scraped public metadata and used illicit ways to avoid DRM to entry a few of the platform’s audio recordsdata,” an organization spokesperson was quoted as saying by Billboard, a music and leisure journal.

What’s Anna’s Archive?

Anna’s Archive is an open-source search engine for shadow libraries, which usually comprise paid or paywalled content material that has been pirated or uploaded totally free. The platform features like an everyday search engine and helps customers discover materials hosted elsewhere on the web. It reportedly doesn’t host pirated materials itself.

Story continues beneath this advert

Up to now, many of the searchable content material by way of Anna’s Archive has been books, analysis papers, and different literary materials as a result of “textual content has the very best info density,” as per the platform. That is the primary time music metadata has been made accessible via the platform.

These behind Anna’s Archive look like ideologically motivated in making info freely accessible, with the platform’s said mission being “preserving humanity’s data and tradition”.
The assorted domains linked to Anna’s Archive are among the many most focused URLs in Google takedown requests filed by copyright holders, based on the TorrentFreak weblog.

What knowledge was scraped from Spotify? Why?

Anna’s Archive stated its efforts to scrape audio recordsdata and metadata from Spotify had been geared toward constructing a music archive for preservation functions. Whereas the platform acknowledged that music is already pretty effectively preserved, it identified that present music libraries largely comprise tracks from the preferred artists and have a tendency to focus an excessive amount of on archiving recordsdata of the very best attainable high quality.

“This inflates the file dimension and makes it laborious to maintain a full archive of all music that humanity has ever produced,” the pirate activist group stated. As a way to “create an authoritative listing of torrents” that represents all music ever produced, Anna’s Archive stated it scraped Spotify’s music library at scale utilizing the streaming app’s “recognition” metric to prioritise tracks.

Story continues beneath this advert

It stated that the next knowledge has been scraped and can be launched in a phased method on its Torrents web page:

– Metadata (already launched)
– Music recordsdata (to be launched within the order of recognition)
– Further file metadata (torrent paths and checksums)
– Album artwork
– .zstdpatch recordsdata (to reconstruct unique recordsdata earlier than embedded metadata was added)

The platform additionally clarified that solely Spotify music recordsdata out there earlier than July 2025 had been scraped, that means any content material uploaded after that date is probably not current within the scraped dataset. “For now, this can be a torrents-only archive geared toward preservation, but when there’s sufficient curiosity, we might add downloading of particular person recordsdata to Anna’s Archive,” it stated.

What does it imply for the AI race?

Reactions to the weblog publish by Anna’s Archive have been break up, with some claiming that the soon-to-be publicly out there music dataset might assist AI researchers of their work and others arguing that it might fall within the mistaken arms. “This, certainly, has largely implications for ML, coaching, and so on. As in any other case, the entire catalog is out there to companions, however prices loads. So Anna did certainly liberate the content material, however I’m positively not switching off my Spotify subscription, despite the fact that, in my private style, neither high quality, nor UI does match Apple Music. It’s nonetheless helpful to have s.o. serve the content material for you,” a consumer on public discussion board Hacker Information posted.

Story continues beneath this advert

One other consumer stated the primary customers of this dataset could be huge tech corporations and AI giants corresponding to Meta, Google, OpenAI, Microsoft, and Apple. “:For them, 300TB is simply low-cost,” the consumer stated.

“Anybody can now, in principle, create their very own private free model of Spotify (all music as much as 2025) with sufficient storage and a private media streaming server like Plex. The one actual obstacles are copyright regulation and worry of enforcement,” Yoav Zimmerman, CEO of AI startup Third Chair, wrote in a publish on LinkedIn.

Nevertheless, coaching AI fashions on pirated content material might spell authorized hassle for tech corporations. Whereas US courts have held that buying books, scanning them into digital copies, and utilizing them for AI coaching functions are lined below the honest use exception, they’ve additionally dominated that utilizing datasets comprising pirated content material is probably not transformative sufficient.

Each Meta and Anthropic have confronted separate allegations that they used pirated digital copies of hundreds of thousands of books to coach their AI fashions. “This leak can even be actually helpful to dangerous actors who will resell the music from this listing with out paying royalties to the artists,” one other consumer posted on Hacker Information.

Source link

How hacktivists scraped 300TB of Spotify music, and why it matters for AI | Technology News

The man who gives: How Sachin Tendulkar has quietly shaped Indian cricket’s greatest careers — one phone call at a time | Cricket News

Tinder’s 50 million users are burning out. The app is betting AI can fix what swiping broke | Technology News

How many kilos has Rishabh Pant lost? Enough to scare every IPL bowler | Cricket News

Here is what happens next

How to trade crypto: A step-by-step guide

Prosecutor Drops Criminal Charge Against Teen After Teacher Dies In Prank Mishap

Prince Harry’s Two-Word Nickname For Him And Meghan Revealed

Best money market account rates today, March 13, 2026 (up to 4.01% APY return)

Vinesh Phogat verdict: Why Court of Arbitration for Sport’s decision was on expected lines | Sport-others News

Novak Djokovic reveals real reason behind relocating to Greece with wife Jelena and kids Tara & Stefan

CNN Staffers ‘Devastated’ Over Likely Warner Bros. Discovery-Paramount Deal

How hacktivists scraped 300TB of Spotify music, and why it matters for AI | Technology News

What’s Anna’s Archive?

What knowledge was scraped from Spotify? Why?

What does it imply for the AI race?

Related Posts