Data Scraping Makes AI Systems Possible, but at Whose Expense?

Hanlin Li, Ph.D., a Postdoctoral Scholar from UC Berkeley’s Center for Long-Term Cybersecurity (CLTC) and an incoming assistant professor at UT Austin, issues a call for regulatory support for data stewardship. Alexa Steinbrück / Better Images of AI / Explainable AI / CC-BY 4.0 Prominent AI technologies, from ChatGPT to Midjourney,  are possible thanks to data scraping — a data collection technique that involves downloading vast swathes of information on the web, from images to articles to code. It usually involves a computer program that scans web pages and stores the information in a structured format. This technique is not new, but it has become immensely popular among technology companies developing AI systems. Advanced AI systems, such as large language models, require a massive amount of training data to be successful. For example, the Common Crawl dataset, the data source for training GPT-4, is made of data scraped from billions of web pages.  The breadth of today’s data scraping practices means that every content producer is a data worker, whether they like it or not.  While data scraping has proven to serve companies well, its limitations and unintended consequences to society are becoming more evident than ever. Currently, anyone with the right tools and resources can scrape, store and use data with little to no oversight. Data scrapers can freely traverse the web and collect any public information for their own purposes, from surveilling the public with facial recognition technologies to generating images mimicking an artist’s work. For those of us…Data Scraping Makes AI Systems Possible, but at Whose Expense?