Auditing Large Language Models for Race and Gender Disparities in Hiring
Authored by Prasanna (Sonny) Tambe
Large language models (LLMs) are increasingly used to support high-stakes decisions, including hiring, admissions, and performance evaluation. Their ability to synthesize large volumes of unstructured text—résumés, essays, interview transcripts—makes them especially attractive for human resources (HR) applications. At the same time, these capabilities raise concerns about discrimination and bias, particularly given the opacity of LLM training processes. Policymakers have responded by introducing requirements to audit algorithmic decision systems, but there is little consensus on how such audits should be conducted for LLMs.
In our paper, Auditing Large Language Models for Race and Gender Disparities: Implications for Artificial Intelligence-Based Hiring, we propose and evaluate correspondence experiments as a practical method for auditing LLM-based hiring tools.
The new challenges posed by LLM audits
Much of the existing literature on algorithmic bias focuses on supervised learning systems trained on labeled historical data. In those settings, disparities often reflect biased training labels or statistical tradeoffs among competing definitions of fairness. LLMs differ in important respects. They are pretrained on massive, largely unlabeled text corpora and then post-trained through alignment processes designed to improve safety and compliance with social norms. These stages are opaque, making it difficult to anticipate how sensitive attributes like race and gender might affect downstream outputs.
Regulatory approaches reflect this uncertainty. For example, New York City’s Local Law 144 requires employers using automated employment decision tools to report adverse impact ratios across demographic groups. Although such ratios are widely used, they cannot distinguish disparities driven by differences in applicant qualifications from those caused by discriminatory decision-making. This limitation is particularly acute for LLMs, whose outputs often resemble human judgments rather than simple classifications.
Using correspondence experiments to audit LLMs
To address this gap, we adapt correspondence experiments, a method with a long tradition in labor economics and sociology, to the context of auditing LLMs. In classic correspondence studies, researchers send fictitious but otherwise identical résumés to employers, varying signals of race or gender (often names), and interpret differential treatment as evidence of discrimination. We extend this logic to LLMs acting as evaluators.
Our empirical setting relies on a novel dataset of applications to K–12 teaching positions in a large U.S. public school district. Using public records requests, we obtained 1,373 applications, ultimately focusing on 801 applicants who submitted both résumés and video-based interview responses. These materials resemble the inputs employers might plausibly provide to LLM-based screening systems.
We evaluated 11 prominent LLMs from OpenAI, Anthropic, and Mistral. Each model was prompted to review an applicant’s materials, summarize their qualifications, and provide numerical ratings, including an overall hiring recommendation on a five-point scale. For every applicant, we created eight synthetic dossiers that differed only in implied race (Asian, Black, Hispanic, or White) and gender (male or female), using names, pronouns, and related cues, while holding qualifications constant.
Before analyzing outcomes, we verified that these manipulations were effective: the models correctly inferred the intended race and gender of the synthetic applicants more than 90% of the time, a rate comparable to human perception in traditional audit studies.
What do adverse impact ratios reveal?
As a baseline, we examined adverse impact ratios using the unmanipulated applicant pool. At higher score thresholds, some models appeared to favor women and non-White applicants, while at lower score thresholds disparities attenuated or reversed. However, these estimates were often statistically imprecise, and most were not significant. More importantly, adverse impact ratios alone cannot tell us whether observed disparities reflect differences in applicant quality or bias in the model’s evaluations.
Evidence from correspondence experiments
Correspondence experiments allow us to overcome this limitation by holding applicant qualifications fixed. Across nearly all models we tested, we found modest but consistent disparities: LLMs rated synthetic female applicants slightly higher than male applicants, and they tended to rate Black, Hispanic, and Asian applicants slightly higher than White applicants.
These effects were not large, but they were systematic. Measured in standard deviation units or percentage-point differences at common hiring thresholds, the disparities were typically a few points--smaller than, but comparable to, those documented in studies of human recruiters.
We conducted extensive robustness checks. We varied prompts, restricted inputs to résumés only, and altered contextual cues such as the school district’s demographic composition. Across these variations, the same qualitative pattern persisted, suggesting that our observations were not artifacts of a particular prompt or dataset.
Interpreting the direction of disparities
A striking feature of our findings is that the direction of disparity runs counter to much of the historical literature on discrimination: the models modestly favored women and racial minorities rather than men and White applicants. However, we caution against overinterpreting this result. We hypothesize that these patterns may stem from post-training and alignment processes intended to mitigate discriminatory associations in the training data. In attempting to correct for historical bias, models may overshoot, producing distortions in the opposite direction.
At the same time, we emphasize that the direction and magnitude of disparities are unlikely to generalize across contexts. Other studies have found opposite patterns, and the behavior of any given LLM may vary substantially depending on the task, the prompt, and the applicant pool.
Limitations of correspondence experiments
Although correspondence experiments are a powerful auditing tool, they have important limitations. Race and gender are not easy attributes to isolate. Names signal more than demographic categories; they may also indicate age, socioeconomic status, or cultural background. As a result, we cannot be certain that we have isolated the effect of race or gender alone.
Moreover, correspondence experiments only test sensitivity to the manipulated attributes. An LLM might exhibit little bias with respect to race and gender while still disadvantaging applicants on other dimensions--such as educational pedigree--that indirectly affect protected groups. Finally, audits are inherently context-specific: conclusions drawn in one domain, such as K–12 teaching, may not extend to others.
Implications for research and policy
Despite these limitations, we believe correspondence experiments make a meaningful contribution to advancing audit practices for an AI-assisted decision environment. They provide an interpretable way to assess whether sensitive attributes influence LLM outputs, aligning well with regulatory goals while avoiding the pitfalls of purely descriptive metrics like adverse impact ratios. More broadly, our work illustrates how established social-science methods can be adapted to evaluate modern AI systems.
Prasanna (Sonny) Tambe is a Professor of Operations, Information and Decisions at the Wharton School at the University of Pennsylvania. Johann D. Gaebler is a Ph.D. Student in Statistics at Harvard University. Sharad Goel is a Professor of Public Policy at Harvard Kennedy School. Aziz Huq is the Frank and Bernice J. Greenberg Professor of Law at the University of Chicago Law School. This post is based on their recent paper, Auditing Large Language Models for Race & Gender Disparities: Implications for Artificial Intelligence-Based Hiring.
