According to cointelegraph: LAION-5B, a substantial artificial intelligence (AI) dataset used in training various widely-used text-to-image generators, has been pulled by its creator after a survey revealed it harbored thousands of suspected instances of child sexual abuse material (CSAM). LAION, the Large-scale Artificial Intelligence Open Network based in Germany, is a non-profit organization known for creating open source AI models and datasets that serve as backbone to several renowned text-to-image models.
Researchers at the Stanford Internet Observatory's Cyber Policy Center, in their report published on December 20, exposed the presence of 3,226 instances of alleged CSAM in the LAION-5B dataset. Numerous suspicious instances were verified as CSAM by independent parties, as highlighted by David Thiel, Stanford Cyber Policy Center's Big Data Architect and Chief Technologist.
Thiel noted that while the CSAM instances detected in the dataset may not drastically alter the outcomes of models trained on it, they are likely to exert some influence. The repetition of identical CSAM instances, furthermore, brings an added layer of concern due to its reinforcement of images of specific victims.
Introduced in March 2022, the LAION-5B dataset consists of 5.85 billion image-text pairs. In response to the findings, LAION confirmed in a statement that it has removed the contentious datasets as a precautionary measure, and these include both LAION-5B and LAION-400M. The organization will republish the datasets once they are ensured to be safe.