Chinese Sensorimotor and Embodiment Norms for 3,000 Lexicalized Concepts
Researchers present a large-scale normative database of sensorimotor and embodiment ratings for 3,000 Mandarin Chinese concepts, collected from 378 native speakers across 11 sensorimotor dimensions. A validation study identifies PSE-Sensorimotor and Minkowski-3 as the strongest composite predictors of lexical decision performance. An exploratory analysis finds that sensorimotor ratings are substantially recoverable from purely linguistic (distributional) representations via simple regression (mean Spearman r = .62), with visual and auditory dimensions recovering better than chemosensory ones. The work provides both a cognitive science resource and empirical evidence bearing on whether LLMs can acquire embodied conceptual knowledge from text alone.
Related guides (2)
Related events (8)
Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora
This paper investigates whether LLM-based machine translation can preserve moral semantic content well enough to enable cross-lingual moral values classification, using Polish as a test case with ~50k annotated social media posts. A four-method validation pipeline (LaBSE embedding similarity, CKA, LLM-as-judge, and classifier parity) shows mean cosine similarity of 0.86 and AUC gaps of only 0.01–0.02 across Moral Foundations categories. The results suggest machine translation is a practical path to extending moral values NLP research to under-resourced languages, with expected generalization to related Slavic languages.
ESI-Bench: A Benchmark for Embodied Spatial Intelligence Closing the Perception-Action Loop
ESI-Bench is a new benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories, built on OmniGibson and grounded in Spelke's core knowledge systems. It evaluates agents that must actively deploy perception, locomotion, and manipulation to accumulate task-relevant evidence, rather than passively processing oracle observations. Experiments on state-of-the-art MLLMs reveal that active exploration outperforms passive baselines, but most failures stem from 'action blindness'—poor action choices leading to cascading errors—and a metacognitive gap where models commit prematurely with high confidence regardless of evidence quality. Human studies show humans seek falsifying viewpoints and revise beliefs under contradiction, a capability current models lack.
Systematic Study of Schwartz Value Detection in Political Texts: Context, Scale, and Moral Knowledge
This paper investigates when additional context, larger models, or retrieved moral knowledge improve detection of Schwartz human values in political text using the ValueEval benchmark format. Key findings show that full-document context helps supervised DeBERTa encoders (+3.8–4.8 macro-F1) but not zero-shot LLMs, while RAG with a curated moral knowledge base consistently benefits all model families under early fusion. Scaling model size does not guarantee gains, and simple early fusion outperforms more complex RAG variants. The study recommends jointly evaluating context, knowledge, and model family rather than assuming larger inputs or models universally improve value-sensitive NLP.
Multimodal neurons in artificial neural networks
OpenAI researchers discovered neurons in CLIP that respond to the same concept across literal, symbolic, and conceptual representations. This finding parallels multimodal neurons previously observed in biological brains and helps explain CLIP's ability to classify unusual visual renditions of concepts. The work is presented as a step toward understanding the associations and biases learned by CLIP and similar vision-language models.
Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions
This paper investigates whether language models can learn the semantics of rare English constructions (e.g., 'let alone', 'much less'), constructing a novel dataset to test form-meaning pairing understanding. Testing models across parameter counts, architectures, and pretraining dataset sizes, the authors find that modestly sized open-source models can grasp Paired-Focus construction semantics, while models trained on human-scale data fail. Training dynamics analysis reveals that semantic understanding of these constructions emerges later than syntactic knowledge and correlates with gains in world knowledge more broadly.
Phun-Bench: A Chinese benchmark for evaluating LLM phonological understanding
Researchers introduce Phun-Bench, a purpose-built benchmark for evaluating LLMs on phonological understanding in Chinese across three dimensions: Homophony, Rhyme, and Phonetic Similarity. The benchmark is designed to avoid rote-memorization shortcuts that plague existing phonological evals. Results show LLMs can recall correct pronunciations but fail to apply phonological knowledge flexibly as human speakers do, and the authors propose a hypothesis about the underlying mechanism of LLM phonological 'perception'.
PhysTool-Bench reveals severe gaps in MLLM physical tool use and embodied planning
Researchers introduce PhysTool-Bench, the first benchmark evaluating multimodal LLMs on physical tool use across 2,510 queries and 2,678 real-world tools spanning manufacturing, electrical work, agriculture, and healthcare. Evaluation of 13 leading MLLMs shows even the best model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes just 21.0% of queries end-to-end. The results expose a two-level deficit: poor tool perception in realistic scenes and a much larger drop at the planning stage, indicating a lack of functional commonsense for mapping tools to task semantics. This pinpoints a critical bottleneck for embodied AI development.
Real Images, Worse Judgments: Evaluating VLMs on Concreteness and Imagery
This paper evaluates whether vision-language models (VLMs) benefit from real image context when making lexical judgments about word concreteness and imagery. The authors find that real-image contexts frequently hurt alignment with human ratings, especially when visual evidence is least relevant to the word being judged. Probing and canonical correlation analysis reveal that real images cause representational shifts and increased sensitivity to spurious visual cues. Instructing models to focus on text-only content at inference time partially mitigates this degradation.

