Hundreds of images of child sexual abuse used to train AI image-generating tools
A study by researchers from the Stanford Internet Observatory revealed that over a thousand images of child sexual abuse material were discovered in a massive public dataset known as LAION 5B. This dataset, containing billions of images scraped from the internet, including social media and adult entertainment websites, is used to train popular AI image-generating models. The presence of such images in the training data raises concerns about the potential for AI models to generate new and realistic deepfake images of child abuse content.
The dataset, managed by the German nonprofit LAION, has been taken offline in response to the findings. LAION stated that it has a “zero tolerance policy for illegal content” and is working with organizations like the Internet Watch Foundation to address the issue. The organization plans to conduct a full safety review of LAION 5B and intends to republish the dataset after addressing the identified issues.
The researchers reported the identified images to the National Center for Missing and Exploited Children and the Canadian Centre for Child Protection, and removal of the images is in progress. The study highlights concerns about the opaque nature of training data for generative AI tools and calls for more curated and well-sourced datasets for publicly distributed models, emphasizing the potential risks related to privacy, copyright, and illegal content in large web-scale datasets.