Law & Technology Workshop 1/16, noon ET

Friday, January 16, 2026, noon - 1 p.m. ET
Mehtab Khan, Access to Datasets
Discussant: Michael Goodyear
Current AI development largely relies on large-scale web scraping by commercial companies. This oftentimes overwhelms open digital repositories and favors entities with resources to compile proprietary datasets. This process has sometimes resulted in open datasets that are poor-quality, untraceable, and culturally narrow. The practice of compiling AI datasets has also resulted in numerous ongoing copyright challenges as companies turn to pirated collections and other archives without authorization. Libraries and other public institutions—as custodians of information—occupy a unique position in addressing these challenges. Their collections are archival and cultural artifacts with social, economic, and political significance, yet they remain underutilized in AI development.
Recent initiatives demonstrate a potential way forward. In Summer 2025, Harvard University’s Institutional Data Initiative released over a million books in more than 250 languages, bridging the gap between AI development and public institutions. Such projects highlight how libraries can provide diverse, high-quality datasets for research. However, the challenge is to scale such practices while establishing norms for public interest use and also addressing data governance. Legal uncertainties around copyright and licensing complicate broader adoption. Without clear frameworks for public interest access to the mechanics of AI development, we also risk deepening inequities in dataset availability and consolidating knowledge power in the hands of a few private actors.
Beyond legal considerations, AI development exerts significant infrastructural pressure. Open-source platforms like Wikipedia, face heightened demands from automated web crawlers. Other repositories, including Reddit and various digital archives, have implemented defensive measures to mitigate strain, including limiting scraping access. These actions, though protective, signal a broader retreat from open-access principles and threaten the public digital commons.
This article makes three contributions: First, it argues that access to AI datasets should be recognized as a core public good, akin to open libraries and an open internet, shaping the future of knowledge production in society. Second, it situates recent developments at the intersection of copyright law, platform governance, and public access to knowledge. Finally, it makes recommendations for platforms and policymakers navigating the rapidly changing AI dataset development processes.
|
Call for Papers: Law & Technology Workshop Spring 2026
The Law and Technology Workshop is now accepting paper submissions for its Spring 2026 virtual workshop series.
Submission deadline: Friday, February 6, 2026
Format: Monthly virtual workshops (3rd Fridays), starting March 20, 2026
Scope: Broadly defined law & technology (AI, privacy, platforms, digital assets, and more)
More information and the application form are available at: https://thelawtechworkshop.org. You can also subscribe to the workshop mailing list here: https://thelawtechworkshop.beehiiv.com/subscribe. Please feel free to share with others who may be interested, and don’t hesitate to reach out with questions at [email protected].
P.S. Please note, the Law and Technology Workshop has a new website: https://thelawtechworkshop.org/
