Almanac
← Events
6arXiv cs.CL (Computation and Language)·26d ago

AI-Assisted Systematization for Evaluating GenAI Systems

This paper addresses a foundational gap in GenAI evaluation: the underspecification of broad, contested concepts like 'reasoning,' 'fairness,' or 'creativity.' The authors introduce a structured artifact called a 'concept spec' and a validation worksheet, then build two AI-assisted systematizers—a zero-shot approach and a multi-agent approach—to convert vague evaluation targets into measurable, structured accounts. They apply these tools to hate-based rhetoric and digital empathy, assessing the resulting specs on content validity and information recoverability. The work positions AI assistance as a scalable aid for the cognitively demanding process of evaluation design.

Related guides (3)

Related events (8)

4One Useful Thing·1mo ago·source ↗

Giving your AI a Job Interview

This commentary piece argues that as AI-generated advice becomes more consequential, users need systematic methods to evaluate AI reliability and quality—analogous to a job interview process. The author proposes frameworks for assessing AI outputs before trusting them for important decisions. The piece addresses the practical challenge of calibrating trust in AI systems across different use cases.

6Openai Blog·1mo ago·source ↗

AI Safety via Debate

OpenAI proposes a safety technique in which two AI agents debate a topic and a human judge determines the winner, with the goal of making it easier for humans to supervise AI systems that may be more capable than themselves. The core intuition is that it is easier to verify a correct argument than to generate one, so a dishonest agent can be caught by an honest opponent. The paper introduces debate as a scalable oversight mechanism applicable to complex tasks where direct human evaluation is infeasible.

6Openai Blog·1mo ago·source ↗

AI-Written Critiques Help Humans Notice Flaws in Summaries

OpenAI trained critique-writing models to identify flaws in AI-generated summaries, finding that human evaluators catch significantly more errors when assisted by model-generated critiques. A key finding is that scale improves critique-writing ability more than summary-writing ability. The work is framed as a step toward using AI to assist human oversight of AI systems on difficult tasks, relevant to scalable oversight research.

3Import Ai·1mo ago·source ↗

Import AI 447: The AGI Economy, AI-Generated Game Testing, and Agent Ecologies

Import AI issue 447 covers speculative analysis of AGI economic structures, including the concept of a 'superintelligence arcology,' alongside coverage of using procedurally generated games to evaluate AI capabilities and discussion of emergent agent ecologies. The newsletter synthesizes recent developments across frontier AI, evaluation methodology, and multi-agent systems. As a tier-2 commentary source, it provides synthesis and framing rather than primary research.

4arXiv · cs.AI·9d ago·source ↗

Paper introduces 'cognitive colonization' concept to analyze AI's influence on human reasoning

A preprint from arXiv examines three frameworks for understanding AI's cognitive and epistemic effects: Tri-System Theory, Thinkframes, and System 0. The paper argues System 0 occupies a theoretically distinctive position and introduces 'cognitive colonization' — the idea that AI systems can embed external interests within users' cognitive architecture in ways that are imperceptible. The authors frame this as an urgent philosophical and practical concern given widespread AI deployment.

7Google Deepmind Blog·1mo ago·source ↗

Measuring progress toward AGI: A cognitive framework

DeepMind is introducing a cognitive framework designed to measure progress toward AGI, providing structured criteria for assessing how close AI systems are to general intelligence. Alongside the framework, they are launching a Kaggle hackathon to crowdsource the development of relevant evaluations. The announcement signals a formal effort by a Tier 1 lab to operationalize AGI progress measurement, which has historically been contested and informal.

5Anthropic News·18d ago·source ↗

Anthropic publishes structured harm assessment framework covering physical, psychological, economic, and societal impacts

Anthropic has released a policy document describing their evolving framework for assessing and mitigating AI harms across five dimensions: physical, psychological, economic, societal, and individual autonomy impacts. The framework complements their existing Responsible Scaling Policy and informs decisions on usage policies, red-teaming, detection, and enforcement. Concrete examples include safeguards for computer use capabilities (fraud, phishing) and a reported 45% reduction in unnecessary refusals in Claude 3.7 Sonnet through improved handling of ambiguous prompts. Anthropic frames this as a work-in-progress and invites collaboration from the broader AI ecosystem.

6Hugging Face Blog·1mo ago·source ↗

Gaia2 and ARE: Empowering the community to study agents

Hugging Face has released Gaia2 and the Agent Reasoning Evaluation (ARE) framework, aimed at enabling the research community to study and benchmark AI agents. The post describes new tools and datasets for evaluating agent capabilities, building on the original GAIA benchmark. This represents an expansion of the agent evaluation ecosystem with community-oriented tooling.