Artificial intelligence research company OpenAI announcement a new initiative this week aimed at diversifying and expanding the data used to train AI models called data partnerships. Thanks to the program, OpenAI plans to collaborate with third-party organizations to create new public and private datasets for AI training.
In order to be fairer and more precise, OpenAI wants to present better data
According to OpenAI, the goal is to create fairer, more accurate and more beneficial models by exposing them to a wider range of data that better reflects the diversity of languages, cultures and topics. Current AI datasets tend to suffer from issues such as Western centrism, lack of diversity, and the inclusion of toxic or biased content.
“To finally make [AI] which is safe and beneficial for all humanity, we would like AI models to deeply understand all subjects, industries, cultures and languages, which requires as broad a training dataset as possible,” OpenAI said in a blog post announcing the program.
Patterns and understanding across platforms can happen through training
By working with partners to collect large-scale datasets in modalities such as text, images, audio and video, OpenAI hopes to improve understanding of patterns beyond what can easily be extracted from ‘Internet today. The company says it will work to remove any sensitive or personal information and will provide options to keep data sets confidential.
OpenAI has already partnered with organizations such as the Icelandic government, Free Law Project and Miðeind ehf on early versions of the program. However, some experts express skepticism that these efforts can minimize the deep-rooted biases that have thus far impacted AI models.
“Overall, we are looking for partners who want to help us teach AI to understand our world in order to be maximally useful to everyone,” OpenAI said.
Diversification of AI training data for GPT-4 to be improved
While diversification of AI training data is essential, the program will also clearly benefit OpenAI models like GPT-4 commercially. This perceived dual motivation, along with OpenAI’s lack of compensation for data partners, has drawn some criticism in light of accusations about the company using data without permission.
Greater transparency around OpenAI’s dataset collection, bias mitigation efforts, and commercial interests will be key to assessing the impact of data partnerships on the overall AI landscape. But the program reflects a realization that improving future AI requires starting with better, more representative data.
Featured image credit: Photo by Andrew Neel; Pixels; THANKS!