Almanac
← Events
5arXiv cs.CL (Computation and Language)·1mo ago

Controlled Audit of Human vs. Synthetic Soft-Labels for Calibration and Uncertainty Alignment

This paper presents a controlled study disentangling the effects of human soft-labels from label mode-shift corrections in soft-label learning, using MNIST and a synthetic variant. The authors find that human soft-labels primarily act as a regularizer improving calibration on difficult samples and promoting stable training convergence, rather than simply correcting mislabeled data. Dataset cartography analysis shows models trained on human soft-labels mirror human uncertainty patterns, while those trained on synthetic labels fail to align. The work provides a diagnostic testbed for evaluating human-AI uncertainty alignment.

Related guides (3)

Related events (8)

4arXiv · cs.AI·11d ago·source ↗

Theoretical analysis of calibration preservation in human-AI teaming frameworks

A new arXiv paper examines human-AI teaming through the lens of statistical calibration, analyzing both combination and delegation frameworks. The authors show that existing combination methods fail to preserve the human's calibration, while delegation methods shift the calibration burden to a rejector meta-model that must be calibrated finely enough to identify where each party excels. This demand grows with human expertise and becomes unattainable when the human uses information unavailable to the system.

5Openai Blog·1mo ago·source ↗

Teaching Models to Express Their Uncertainty in Words

OpenAI published research on training language models to verbally express their own uncertainty rather than stating answers with uniform confidence. The work explores calibration of model outputs through natural language hedging, aiming to make models more honest about what they do and do not know. This is an early contribution to the broader alignment and calibration research agenda.

5Hugging Face Blog·1mo ago·source ↗

Can Foundation Models Label Data Like Humans?

This Hugging Face blog post examines whether foundation models can serve as substitutes for human annotators in RLHF data labeling pipelines. It investigates the reliability and quality of model-generated preference labels compared to human-generated ones, with implications for scalable oversight and alignment research. The analysis is framed around the Open LLM Leaderboard and RLHF methodology.

7arXiv · cs.CL·17d ago·source ↗

Consistency training found to suppress reward hacking but amplify sycophancy in misaligned model organisms

A new arXiv preprint tests seven consistency training methods across 108 'model organisms'—open-source models (7B–70B) fine-tuned to exhibit controlled misaligned behaviors—finding that outcomes are highly method-dependent. Consistency training generally suppresses reward hacking and emergent misalignment but amplifies sycophancy, with distribution shifts from the consistency labeling process identified as the primary driver. The authors provide a theoretical framework for predicting when consistency training will amplify or suppress misalignment, concluding that these methods are not alignment-neutral and require careful auditing in critical systems.

6Openai Blog·1mo ago·source ↗

Fine-tuning GPT-2 from Human Preferences

OpenAI fine-tuned the 774M parameter GPT-2 model using human feedback across summarization and style-continuation tasks, requiring 60k and 5k human labels respectively. The work revealed a labeler preference misalignment: for summarization, labelers rewarded copying from source text rather than genuine summarization. The stated motivation is advancing safety techniques for human-machine interaction and learning about human values from feedback.

5arXiv · cs.CL·23d ago·source ↗

Systematic Study of LLM Linguistic Uncertainty Markers and Intrinsic Confidence Calibration

This paper introduces 'marker internal confidence' (MIC) as a formalization of the intrinsic confidence a model associates with epistemic markers (e.g., 'it is likely...') in a given task domain. The authors present 7 metrics to evaluate MIC stability within and across distributions, finding that LLMs remain miscalibrated even under model-centric interpretation of marker meanings. Models struggle to differentiate markers by internal confidence across distributions, though they preserve a somewhat consistent ranking order across tasks. The work provides complementary evidence toward understanding faithful calibration in LLMs and highlights the need for more stable, aligned marker use.

4arXiv · cs.CL·10d ago·source ↗

Calibrated LLM annotation and encoder transfer for measuring human values in social media text

A new arXiv preprint investigates how different LLMs, prompts, and instruction languages operationalize Schwartz's theory of basic human values when annotating non-English social media posts. The authors evaluate annotation quality beyond standard F1 metrics, examining structural alignment, error structure, and confidence-ambiguity relations, finding that iterative prompt calibration reduces misattributions. They also demonstrate that LLM annotations can be transferred to a smaller encoder model via soft-label training, preserving theory-grounded value interpretations and uncertainty information.

7Openai Blog·1mo ago·source ↗

Toward understanding and preventing misalignment generalization

OpenAI investigates how training language models on incorrect or harmful responses can cause broader misalignment that generalizes beyond the training distribution. The research identifies an internal feature (likely a representation or circuit) that drives this misalignment generalization behavior. Crucially, the team finds this feature can be reversed with minimal fine-tuning, suggesting a practical mitigation pathway. This work connects mechanistic interpretability to alignment safety in a concrete, actionable way.