7arXiv cs.AI (Artificial Intelligence)·24d ago

Algorithmic Monocultures in Hiring: Racial Disparities and Homogeneous Rejection Patterns

A study of 3 million applicants and 4 million applications screened by algorithms from the same vendor finds significant racial disparities: 14.74% of Asian applicants and 25.87% of Black applicants submit to positions where the algorithm adversely impacts their group under U.S. employment discrimination standards. The paper also documents individual-level homogeneity, with 4% of applicants who apply to 10 positions receiving rejection recommendations from all of them—a rate above chance. The authors use deterministic replicability of hiring algorithms to simulate counterfactual outcomes, showing applicants would need to apply very broadly to receive human review.

Evaluation and Benchmarking AI Safety Research Enterprise Deployment Patterns Regulatory Developments hiring screening algorithms algorithmic monoculture Algorithmic Monocultures in Hiring U.S. employment discrimination standards

Related guides (4)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Regulatory DevelopmentsTopic guide

AI Regulatory Developments: From Voluntary Frameworks to Government Enforcement

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: The Shifting Yardstick of AI Capability

Read asIn-depth

Related events (8)

5Ai Snake Oil·1mo ago·source ↗

Does the UK's liver transplant matching algorithm systematically exclude younger patients?

This commentary examines whether the UK's liver transplant matching algorithm contains technical design choices that systematically disadvantage younger patients. The piece argues that seemingly minor algorithmic decisions can have life-or-death consequences in high-stakes medical AI systems. It falls within the broader discourse on algorithmic fairness and unintended bias in deployed AI/ML systems.

AI Safety Research Enterprise Deployment Patterns UK liver transplant matching algorithm NHS AI Snake Oil

4Mit Technology Review — Ai·25d ago·source ↗

A Reality Check on the AI Jobs Hysteria

MIT Technology Review offers a critical analysis of current narratives around AI-driven white-collar job displacement, questioning whether recent tech-sector layoffs at companies like Coinbase, Meta, and Cisco genuinely signal broad AI-driven workforce disruption. The piece appears to push back on alarmist framing around AI's near-term labor market impact. It targets knowledge workers including software developers and financial analysts as the focal demographic in the debate.

Enterprise Deployment Patterns Agent and Tool Ecosystem Cisco Coinbase MIT Technology Review +1 more

6arXiv · cs.CL·5d ago·source ↗

Computational audit finds ClinicalBERT amplifies demographic bias beyond training data distributions

Researchers present a systematic audit of representational bias in ClinicalBERT, a BERT-based model pretrained on MIMIC-III clinical discharge summaries, using two probing methodologies: Log Probability Bias Analysis and Masked Language Model probing across 98 clinical sentence templates and eight intersectional race-gender combinations. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing. The key finding is that bias in ClinicalBERT operates predominantly through model-internal amplification rather than simple inheritance from training data, which has direct implications for clinical AI safety and deployment. This challenges the assumption that auditing training corpora is sufficient to characterize model bias.

Evaluation and Benchmarking AI Safety Research A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions MIMIC-III ClinicalBERT +1 more

7Anthropic News·19d ago·source ↗

Anthropic Launches Economic Index: First Large-Scale Empirical Study of AI's Labor Market Impact

Anthropic has released the Anthropic Economic Index, an initiative tracking AI's effects on labor markets using anonymized data from approximately one million Claude.ai conversations matched to U.S. Department of Labor O*NET occupational tasks. Key findings show AI use is concentrated in software development and technical writing, with 36% of occupations seeing AI use in at least 25% of their tasks, and usage skewing toward augmentation (57%) over automation (43%). The underlying dataset is being open-sourced to enable independent research, and Anthropic is inviting economists and policy experts to contribute to the ongoing initiative. The analysis was enabled by Clio, Anthropic's privacy-preserving internal conversation analysis tool.

Evaluation and Benchmarking Enterprise Deployment Patterns claude.ai Clio U.S. Department of Labor +5 more

6The Batch·28d ago·source ↗

Agent Benchmarks Skew Toward Software Engineering, Missing Most Economically Valuable Labor

Researchers from Carnegie Mellon University and Stanford University mapped over 10,000 examples from 43 agent benchmarks to U.S. labor statistics using O*NET occupational taxonomies, finding that current benchmarks heavily over-represent software engineering relative to its share of employment and wages. Office and administrative support (18.2M workers, $869.8B wages) and management (11M workers, $1326.3B wages) are vastly under-represented compared to computer and mathematical occupations (5.2M workers, $563.6B wages). No single benchmark covered more than 50% of work activities, and all 43 benchmarks combined covered only 56.5% of work activities. The study identifies a systematic gap between where agentic AI is being evaluated and where the largest economic opportunity lies.

Evaluation and Benchmarking Enterprise Deployment Patterns Carnegie Mellon University GDPval Stanford University +7 more

5arXiv · cs.CL·5d ago·source ↗

Study finds AI-generated stories rely on superficial cultural markers rather than holistic localization

Researchers propose a method to measure the degree of 'templated' versus 'holistic' cultural localization in AI-generated stories, finding that only 9-17% of vocabulary accounts for cross-national variation and that a shared culturally-agnostic narrative template underlies most outputs. The study evaluates five models across 125 topics and 193 nationalities. A notable finding is that cultural markers associated with 19 countries—mostly in the Global South—are rated as offensive on average, raising concerns about bias and representation in multilingual/multicultural AI content generation.

Evaluation and Benchmarking AI Safety Research Characterizing Cultural Localization in AI-Generated Stories

5arXiv · cs.CL·15d ago·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

5arXiv · cs.CL·2d ago·source ↗

RECOM benchmark reveals validity-discrimination tradeoff in automatic metrics for open-ended QA

Researchers introduce RECOM, a contamination-free evaluation dataset of 15,000 r/AskReddit questions paired with authentic community replies postdating all evaluated models' training cutoffs. Testing five open-source 7–10B LLMs, the paper finds that no standard automatic metric (cosine similarity, BERTScore, LLM judges) simultaneously achieves both validity (distinguishing real from random answers) and discriminative power (ranking models against each other). Cosine similarity is valid but cannot rank models; BERTScore's apparent ranking collapses when response length is controlled. The authors argue this tradeoff is a structural property of metric representation design and recommend reporting metrics on both axes with an explicit random-baseline floor.

Evaluation and Benchmarking BERTScore RECOM r/AskReddit