Ethical AI Training: Researchers Build Massive Dataset from Openly Licensed text
Table of Contents
A new study challenges the assumption that copyrighted material is necessary for training large language models,demonstrating comparable performance using only openly licensed data.
The debate around the use of copyrighted material in artificial intelligence (AI) training continues, with major AI companies arguing its necessity. Though, a group of AI researchers has taken a different approach, successfully building a substantial dataset using only openly licensed or public domain text, according to the Washington Post.
This eight-terabyte dataset was used to train a 7 billion parameter language model, achieving performance levels comparable to industry models like Meta’s Llama 2-7B. The findings,detailed in a paper published Thursday, highlight the feasibility of creating powerful AI tools using ethically sourced data.
The researchers found that the process was far from simple, requiring notable human effort. according to the paper, challenges included inconsistent data formatting and the complexities of determining the correct licenses for various websites, especially given the prevalence of improperly licensed data.”This isn’t a thing where you can just scale up the resources that you have available” said Stella biderman, executive director of the nonprofit research institute Eleuther AI. She added, “We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. and that’s just really hard.”
New Ethical Datasets Emerge
Despite the challenges, the team identified new, ethically sound datasets, including a collection of 130,000 English language books from the Library of Congress, nearly twice the size of Project Gutenberg’s popular books dataset. This initiative builds upon other efforts like FineWeb from Hugging Face, wich seeks to promote ethical AI progress.
“Even partial transparency has a huge amount of social value and a moderate amount of scientific value,”
While Stella biderman expressed doubt that major companies like OpenAI and Anthropic would adopt such a labor-intensive approach, she hopes it will encourage greater transparency regarding the data used to train AI models. She suggested that even a return to the transparency levels of 2021 or 2022, when AI companies shared more data about their training data, would be a positive step.
The Future of AI Training Data
the research underscores the potential for creating high-performing AI models without relying on copyrighted material. It also highlights the need for better data provenance and licensing practices within the AI industry.
Frequently Asked Questions
- Is copyrighted material necessary for training high-performing AI models?
- This research suggests that it is not, as comparable performance was achieved using only openly licensed and public domain text.
- What are the challenges of using openly licensed data for AI training?
- Challenges include inconsistent data formatting, determining correct licenses, and the need for manual annotation and verification.
- What are the benefits of using ethical AI training data?
- Ethical AI training data reduces the risk of copyright infringement, promotes transparency, and can lead to more responsible AI development.
