April 12, 2021
Facebook released an open-source AI data set of 45,186 videos featuring 3,011 U.S. actors who were paid to participate. The data set is dubbed Casual Conversations because the diverse group was recorded giving unscripted answers to questions about age and gender. Skin tone and lighting conditions were also annotated by humans. Biases have been a problem in AI-enabled technologies such as facial recognition. Facebook is encouraging teams to use the new data set. Most AI data sets comprise people unaware they are being recorded.
CNN reports that, “Facebook had humans label the lighting conditions in videos and label participants’ skin tones according to the Fitzpatrick scale, which was developed in the 1970s by a dermatologist to classify skin colors.” Until now, Facebook and other tech companies have relied on ImageNet, a huge data set of all kinds of images, to advance AI research.
Casual Conversations, CNN notes, is “composed of the same group of paid actors that Facebook previously used when it commissioned the creation of deepfake videos for another open-source data set.” Casual Conversations is intended for internal use, and Facebook is “encouraging” its teams to use it.
Facebook AI research manager Cristian Canton Ferrer said Casual Conversations “includes some information that was not used when Facebook created the deepfake data set.” The paid participants spent “several hours being recorded in a studio … [and] can also tell Facebook to remove their information in the future for any reason.”
Ferrer added that, “much more work needs to be done to make AI systems fair … [and that] he hopes to get feedback from academic researchers and companies so that, over time, fairness can be better measured.” One area under consideration for expansion is gender, which is currently identified in a binary manner. In Casual Conversations, “participants were asked to self-identify as ‘male,’ ‘female,’ or ‘other’.” Ferrer noted that “other” “encapsulates a huge gamut of options.”
On Facebook’s AI blog, the company states that Casual Conversations is “a tool for AI researchers to surface useful signals that may help them evaluate the fairness of their computer vision and audio models across subgroups of age, gender, apparent skin tone, and ambient lighting.”
“To our knowledge, it’s the first publicly available data set featuring paid individuals who explicitly provided their age and gender themselves,” it continues. “We prefer this human-centered approach and believe that it allows our data to have a relatively unbiased view of age and gender.” Annotators labeled ambient lighting conditions and skin tone.
Facebook says that the Casual Conversations data set “should be used as a supplementary tool for measuring the fairness of computer vision and audio models, in addition to accuracy tests, for communities represented in the data set.”
“It’s designed to surface instances in which performance may be unequal across different subgroups,” it adds. “By making fairness research more transparent and normalizing subgroup measurement, we hope this data set brings the field one step closer to building fairer, more inclusive technology.”
Facebook Algorithm Shows Gender Bias in Job Ads, Study Finds, The Wall Street Journal, 4/9/21