Meta‘s Llama 3 AI Trained on Pirated Books: A Copyright Controversy

Table of Contents

Meta’s Llama 3 AI Trained on Pirated Books: A Copyright Controversy
AI Giants Under Scrutiny: Copyright Concerns Rise Over Data Usage
Meta Under Scrutiny: Copyright Infringement Allegations Intensify

By Archnetys News

Published: April 2, 2025

The Allure of Illicit Data: Fueling AI Progress

In a move that has ignited a firestorm of controversy, Meta, the tech giant behind Facebook and Instagram, allegedly utilized the Library Genesis (Libgen), a notorious pirated database, to train its latest artificial intelligence model, Llama 3. Internal communications, revealed during a copyright infringement trial, suggest that the Llama team considered acquiring books “really important […] get books quickly.” This urgency highlights the perceived value of books in AI training,with engineers noting that “Books actually represent a more significant contribution than data from the web.”

Zuckerberg’s alleged Green Light: A Risky Gamble?

While exploring legal licensing options,Meta reportedly found publishers’ proposals “unreachable” and their implementation “incredibly slow.” A technical director also pointed out a strategic vulnerability: “If we place a single book under licence,rely on the Fair use strategy will be made deciduous.” To circumvent these hurdles, Meta allegedly secured approval from someone identified as “MZ” – purportedly Mark zuckerberg – to tap into Libgen’s vast repository of over 7.5 million books and 81 million scientific articles.

The team allegedly downloaded these files via Bittorrent, a peer-to-peer file-sharing protocol, risking the illegal distribution of copyrighted material under U.S. law. One employee even admitted, “Download via Torrent from a corporate computer does not seem great.” This admission underscores the awareness of the legal risks involved.

Navigating the Legal Minefield: Fair Use or Foul Play?

Some employees reportedly recognized the “average-high legal risk” associated with using Libgen, even suggesting measures to conceal the practice, such as “delete clearly marked data as pirated” and “Do not publicly cite the use of Libgen.”

Meta, like OpenAI, defends its actions by invoking the doctrine of fair use, arguing that its AI models transform the original content without reproducing or communicating it to the public.However, the legal validity of this argument remains contested. As of 2024, several lawsuits are challenging the use of copyrighted material in AI training, raising questions about the future of AI development and copyright law.

The Ethical Implications: A Pandora’s Box?

The controversy surrounding Meta’s use of pirated books raises significant ethical questions about the responsible development of AI. While access to vast datasets is crucial for training powerful AI models, the means of acquiring that data must be ethical and legal. The potential consequences of normalizing the use of pirated material could be far-reaching, undermining the creative industries and eroding respect for intellectual property rights.

“the use of copyrighted material in AI training is a complex issue with no easy answers. We need a balanced approach that fosters innovation while protecting the rights of creators.”

Dr. Anya Sharma, AI Ethics Researcher at the Institute for the Future

Looking Ahead: The Future of AI Training Data

The Meta controversy highlights the urgent need for clear legal and ethical guidelines regarding the use of copyrighted material in AI training. As AI continues to evolve, it is indeed crucial to establish a framework that promotes innovation while safeguarding the rights of creators and ensuring the responsible development of this transformative technology. The outcome of ongoing legal battles will likely shape the future of AI training data and the balance between innovation and copyright protection.

AI Giants Under Scrutiny: Copyright Concerns Rise Over Data Usage

By Archnetys News Team | Published: 2025-04-02

The rapid advancement of artificial intelligence has sparked a heated debate over the ethical and legal implications of using copyrighted material for training AI models. As AI technologies become increasingly integrated into various aspects of our lives, the question of how to fairly compensate creators and protect intellectual property rights has become paramount.

The Core Issue: Unauthorized Data Usage

The heart of the controversy lies in the alleged unauthorized use of copyrighted works, including books, research papers, and articles, to train large language models (LLMs). Publishers and authors are increasingly concerned that their intellectual property is being exploited without their consent or fair compensation. This concern has intensified with the growing awareness of the datasets used to train these models.

For example, Meta recently confirmed that its AI models were trained using data from Libgen, Sci-Hub, and Z-Library, all known for providing access to copyrighted material without authorization. This revelation has fueled the debate and prompted calls for greater clarity and regulation in the AI industry.

Libgen and the Shadow Libraries: A Double-Edged Sword

Libgen (Library Genesis), established in 2008, was initially intended to provide access to knowledge for individuals in developing countries and those outside conventional academic circles. its mission was to democratize access to facts, especially for those facing financial or geographical barriers.

…people from Africa, India, Pakistan, Iran, Iraq, China, Russia and the former USSR… Those who do not belong to the academic world.

Libgen’s Original Mission Statement

However, over time, Libgen’s database has expanded to include millions of documents, including copyrighted works by prominent authors and articles from prestigious journals like Nature, Science, and The Lancet. This expansion has transformed Libgen into a massive repository of both legitimate and pirated content, making it a valuable resource for AI developers but also a major source of concern for copyright holders.

Legal Battles and Industry Pushback

Publishing professionals have been actively combating unauthorized access to copyrighted material for years. Legal actions and blocking measures have been coordinated by publishers to target Libgen and similar sites. In September 2023, several educational publishers filed a lawsuit in the United States, seeking $30 million in damages.

In the United Kingdom, the Publishers Association obtained an expanded blocking order in November 2024 against Libgen and other similar sites. The organization also initiated judicial investigations, providing technical analyses of Libgen’s operations. Similar actions have been taken in France by the National Edition Union since 2019.

The British Perspective: Copyright Infringement and economic Impact

british publishers view the unauthorized use of protected works as a significant threat to copyright. They argue that this exploitation undermines the creative, human, and financial investments made by writers, researchers, academics, and publishing houses. The Publishers Association emphasizes that its members do not authorize the use of their protected works for training AI models without explicit license agreements.

The economic impact of copyright infringement is considerable. According to a 2024 report by the UK Intellectual Property Office, copyright industries contribute over £150 billion to the UK economy annually and support millions of jobs. Unauthorized use of copyrighted material threatens this vital sector.

Transparency and Regulation: A Call for Action

The Publishers association and its international counterparts are urging governments to demand greater transparency from AI developers regarding their use of copyrighted works. The goal is to ensure that the benefits of AI are realized in a fair and ethical manner, respecting the rights of creators and copyright holders.

The challenge lies in finding a balance between fostering innovation in AI and protecting intellectual property rights. This requires a collaborative effort involving governments, AI developers, publishers, and authors to establish clear guidelines and regulations that address the complex issues surrounding data usage in the age of AI.

Meta Under Scrutiny: Copyright Infringement Allegations Intensify

Published: 2025-04-02

By Archynetys News Team

AI Training Data Under Legal Fire: Meta Faces Transatlantic Copyright Challenges

Tech giant Meta is facing mounting legal pressure on both sides of the Atlantic over allegations of copyright infringement related to the training of its generative AI models. Accusations center on the unauthorized use of copyrighted literary works,raising critical questions about the ethical and legal boundaries of AI development.

The core allegations: Unlawful Data Acquisition

The heart of the controversy lies in the claim that Meta utilized copyrighted material without obtaining proper licenses or permissions from authors and publishers. This unauthorized data, allegedly sourced from illegally obtained databases, was then used to train Meta’s AI, potentially giving the company an unfair advantage in the rapidly evolving AI market. The use of “hacked bases” to train AI models raises serious ethical and legal questions about the sourcing of data in the AI industry.

UK publishers Association Demands Transparency and Fair Compensation

In the United Kingdom, the Publishers Association (PA) is taking a firm stance against Meta. Catriona MacLeod Stevenson, legal director and deputy director of the PA, stated in a press release that they have long suspected the use of illegal pirate sites to train LLMs. She referenced court documents reported by The Atlantic, suggesting Meta employees were actively encouraged to download and use such materials.

We have long suspected that illegal pirate sites have been used to train the LLM.The court documents relayed by The Atlantic show that Meta employees have been actively encouraged to download and use.
Catriona MacLeod Stevenson, Publishers Association

Stevenson emphasized the significant damage to authors’ and publishers’ copyrights, asserting that it cannot go unaddressed. The PA and its members are actively exploring potential legal actions. She further argued that tech companies like Meta have the resources to compensate creators for the content they use, drawing a parallel to paying for the electricity needed to power their operations.

Companies like Meta must show transparency on the protected works they have used and wish to use, and initiate discussions in good faith on licenses, so that beneficiaries are paid for their work.
Catriona MacLeod Stevenson, Publishers Association

The UK government is currently reviewing responses to a public consultation on copyright and artificial intelligence, signaling a potential shift towards stricter regulations in this area.As of 2024, the UK creative industries contributed £109 billion to the economy, highlighting the importance of protecting intellectual property rights.

French Legal Action: Copyright violation and Economic Parasitism

Across the English channel, in France, the Syndicat National de l’Édition (SNE), the Société des Gens de Lettres (SGDL), and the Union Nationale des Auteurs et Compositeurs (UNAC) have jointly filed a lawsuit against Meta. The lawsuit,presented before the 3rd Chamber of the Paris court,accuses meta of massive copyright infringement by using protected literary works without consent to train its generative AI model.

The French legal action targets both direct copyright violation and “economic parasitism,” arguing that the development of the AI market should not come at the expense of the cultural sector. The plaintiffs are demanding full compliance with copyright laws and the complete removal of unauthorized data used to train AI models. This echoes similar concerns raised by artists and musicians globally, who fear that AI is being trained on their work without proper compensation or attribution.

The Question of Liability: Hacked Data and Legal Consequences

A key question remains: can Meta be held liable for using copyrighted works that were freely available on the web through download services, even if those works were obtained illegally? The answer to this question could have significant implications for the AI industry, potentially setting a precedent for how AI companies source and utilize data. The legal teams involved, particularly Meta’s lawyers, are likely scrutinizing this aspect of the case closely.