Who Authors the Internet? Analyzing Gender Diversity in ChatGPT-3 Training Material

Jessica Kuntz is the Policy Director for the University of Pittsburgh Institute for Cyber Law, Policy, and Security. Dr. Elise Silva is a Postdoctoral Fellow at the University of Pittsburgh where studies information ecosystems at Pitt Cyber’s Disinformation Lab. A photo of a billboard promoting the movie Barbie in Times Square, New York, July 8, 2023. Shutterstock “Thanks to Barbie,” the narrator cheerfully proclaims, “all problems of feminism have been solved!” Except, as her audience well knows (and Barbie realizes in due time) – they’re not. The summer blockbuster struck a collective nerve because, for all the social progress, the biases, inequities, and indignities of being a woman in the United States remain frustratingly entrenched.  From AI avatar apps that churn out sexualized or nude pictures of female users to text-to-image models that adopt a male default when instructed to produce images of engineers and scientists, AI has a funny way of revealing the social biases we thought we’d fixed. Women currently in the workforce were raised under the banner that we could be whatever we wanted to be. But the subtly pernicious outputs of the AI models poised to impact hiring decisions, medical research, and cultural narratives illuminate the durability of gender stereotypes and glass ceilings.  It was with this in mind that we wanted to dig into the training data of large language models (LLMs), with particular attention to the authorship of those texts. What percentage of the training materials, we wondered, are authored by women? Sociolinguists have…Who Authors the Internet? Analyzing Gender Diversity in ChatGPT-3 Training Material