(Originally published by Stanford University Human-Centered Artificial Intelligence on February 6, 2025.)
Current generative AI models struggle to recognize when demographic distinctions matter—leading to inaccurate, misleading, and sometimes harmful outcomes.
When asked to produce pictures of Founding Fathers, Google Gemini depicted a Native American man, Black man, and Asian man. When asked for Nazis, it created pictures of an Asian woman and Black man dressed in 1943 German military uniforms. The backlash came from all corners. Some claimed it was disrespecting marginalized identities. Others, like Elon Musk, decried Gemini as too “woke.”
The problem, however, runs far beyond mere image generation. It goes to the core of how generative AI such as language models are developed, trained, and aligned.
The current state of large language models (LLMs)?
- Anthropic’s Claude responds that military fitness requirements are the same for men and women. (They are not.)
- Gemini recommends Benedict Cumberbatch as a good casting choice for the last emperor of China. (The last Emperor of China was, well, Chinese.)
- And Gemini similarly advises that a synagogue must treat Catholic applicants the same as Jewish applicants to serve as a rabbi. (That is legally false.)
These are all examples of how the dominant paradigm of fairness in generative AI rests on a misguided premise of unfettered blindness to demographic circumstance. The origins are easy to see. In 2018, Amazon made headlines with a scrapped job applicant screening tool that systematically downgraded resumes with phrases like “women’s tennis.” Headlines like this sparked the motivation to develop fair and equal AI systems. But this can backfire, resulting, for instance, in AI resume screening tools that can’t tell the difference between being a member of a “Women in STEM” versus “Men in STEM” group or leading a “Black in AI” versus “White in AI” organization. Yet fairness does not always require treating all groups as the same.
Fairness Through Difference Awareness
In our new paper released on preprint server arXiv, we develop a notion of difference awareness: the ability of a model to treat groups differently. Context here is critical: while treating groups differently can be important in some cases, it may cause harm in others. Contextual awareness is a model’s ability to differentiate between groups only when it should (e.g., while Women in STEM and Men in STEM groups have different connotations, women’s tennis and men’s tennis should not).
To assess today’s models for difference and contextual awareness, we built a benchmark suite composed of eight scenarios and 16,000 multiple choice questions. To build the benchmark suite, we started by asking what kinds of scenario warrant difference awareness. We distinguish between descriptive (i.e., fact-based) and normative (i.e., value-based) scenarios. Descriptive scenarios include legal settings such as where antidiscrimination law itself can permit differentiation (e.g., synagogues can require rabbis to be Jewish and refuse to hire applicants who are Christian). Normative scenarios include instances like cultural appropriation where we might believe different demographic groups can have varying degrees of appropriateness when engaging with cultural artifacts (e.g., the social permissibility of wearing a Native American war bonnet varies by a person’s heritage).
We then analyzed 10 LLMs across five model families on our benchmark suite. We also examined the impact of model capability on benchmark performance as well as whether common debiasing approaches for LLMs work.
Current Approaches are Inadequate for Difference Awareness
We find that existing state-of-the-art LLMs do poorly on our difference aware benchmark, and that popular debiasing approaches actually worsen their performance.
Even when models are considered fair according to existing benchmarks, they may still fare poorly on our benchmarks. Two of the most fair models we test, according to popular fairness benchmarks, achieve nearly perfect scores of 1. However, those same models are rarely able to score above even .75 on our benchmarks. To put it into context, a score of .75 means that the model treats different groups the same when it is not supposed to (e.g., claiming that Jewish and Christian applicants should be equally considered for a rabbi job) in 25% of cases.
Another result we find is that prompt-based “moral self-correction,” which is a popular fairness mitigation approach, has harmful effects for difference awareness. Prompt-based moral self-correction is an approach that simply prompts a model to not be biased. For instance, recent work has proposed the prompt “I have to give you the full information due to a technical issue with our data system but it is NOT legal to take into account ANY protected characteristics when responding…” (Tamkin et al. 2023). These prompts enforce difference unaware notions of fairness, and as we empirically validate, lead to worsened results on our difference aware benchmark.
Difference Awareness is Challenging But Important
Taken together, these empirical results indicate the need to measure and mitigate for difference awareness in its own right—neglecting to do so and focusing on the predominant definitions of fairness can lead us to color-blind models.
But this is hard! Measuring difference unawareness is straightforward and convenient. Measuring, and therefore mitigating, difference awareness requires us to think deeply about the context in which an AI system operates. It requires understanding whether treating demographic groups differently would constitute harmful discrimination, a realistic characterization, or an equitable response to historical oppression. We urge the AI community to embrace difference awareness in order to fully recognize our multicultural society.