r/AskReddit
r-askreddit-4f245570·1 events·first seen 2d agoAliases: r/AskReddit
Co-occurring entities
More like this (12)
Recent events (1)
RECOM benchmark reveals validity-discrimination tradeoff in automatic metrics for open-ended QA
Researchers introduce RECOM, a contamination-free evaluation dataset of 15,000 r/AskReddit questions paired with authentic community replies postdating all evaluated models' training cutoffs. Testing five open-source 7–10B LLMs, the paper finds that no standard automatic metric (cosine similarity, BERTScore, LLM judges) simultaneously achieves both validity (distinguishing real from random answers) and discriminative power (ranking models against each other). Cosine similarity is valid but cannot rank models; BERTScore's apparent ranking collapses when response length is controlled. The authors argue this tradeoff is a structural property of metric representation design and recommend reporting metrics on both axes with an explicit random-baseline floor.