5arXiv cs.CL (Computation and Language)·18d ago

FigSIM: Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

FigSIM is the first annotated dataset for analyzing suicide memes, comprising 1,049 memes labeled for suicide severity levels, figurative language phenomena, and suicide-related content categories. The authors benchmark 16 unimodal and multimodal models across three classification tasks. Key findings include systematic underprediction of high-severity cases, particularly for figurative memes, highlighting challenges for automated content moderation. The dataset is publicly released to support future research.

Evaluation and Benchmarking AI Safety Research Multimodal Progress suicide meme content moderation figurative language detection multimodal classification models FigSIM

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·47h ago·source ↗

Meaning Intelligence Framework addresses context failure in AI processing of Nigerian public discourse

Researchers introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema designed to separate surface sentiment from true communicative intent in Nigerian public discourse. The paper argues that AI systems fail on Nigerian language data primarily due to context failure rather than translation failure, as pragmatic meaning shifts with speaker, audience, and situation. Evaluating Gemini 2.5 Flash on a 30-item calibration dataset, they find zero-shot register classification accuracy of 33.3% rising to 73.3% with schema-informed prompting, demonstrating large gains from structured in-context guidance. The framework and calibration set are released publicly to support reproducibility.

Evaluation and Benchmarking Google Gemini-2.5-Flash-Lite AfriSenti +2 more

4arXiv · cs.CL·29d ago·source ↗

Image-Semantic Guided Detection of AI-Generated Modern Chinese Poetry Using MLLMs

This paper proposes a multimodal detection method for identifying AI-generated modern Chinese poetry by incorporating images that reflect poetic content alongside text. The approach uses example-driven prompting to integrate meaning, imagery, and emotional cues from images as a complement to textual analysis. A Gemini-based detector using this method achieves 85.65% Macro-F1, outperforming both plain-text LLM baselines and the traditional RoBERTa detector. The work extends AI-generated content detection research into a domain—modern Chinese poetry—previously unaddressed by prior studies.

Evaluation and Benchmarking Multimodal Progress RoBERTa image-semantic guided poetry detection modern Chinese poetry AI detection +2 more

4Hugging Face Blog·1mo ago·source ↗

FineVideo: Behind the Scenes — HuggingFace Video Dataset Release

HuggingFace published a behind-the-scenes account of FineVideo, a curated dataset aimed at advancing video understanding in AI/ML models. The post details the data collection, annotation, and curation methodology used to build the dataset. FineVideo is positioned as a resource for training and evaluating multimodal video models.

Evaluation and Benchmarking Multimodal Progress FineVideo HuggingFace

4arXiv · cs.CL·1mo ago·source ↗

Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs

This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.

Enterprise Deployment Patterns Agent and Tool Ecosystem International Classification of Diseases (ICD)e5_large Bag of Words (BoW)+3 more

4arXiv · cs.CL·4d ago·source ↗

RDS Fusion: Hybrid neuro-symbolic gating with compressed CoT for zero-shot irony detection

Researchers introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought reasoning without supervised fine-tuning for irony and sarcasm detection in social media text. Evaluated on TweetEval (N=734) and iSarcasm, the zero-shot system matches fine-tuned BERTweet performance and outperforms supervised SemEval transformer ensembles on the imbalanced iSarcasm dataset. A statistical ablation shows that only the full concurrent fusion of all three signals yields a validated improvement, with individual components providing no significant standalone gain.

Evaluation and Benchmarking TweetEval BERTweet Robust Dual-Signal Fusion +1 more

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

5arXiv · cs.CL·47h ago·source ↗

StylisticBias benchmark reveals a small set of visual cues drives most social bias in MLLMs

Researchers introduce StylisticBias, a controlled benchmark of ~25K photorealistic face images with single-attribute variations designed to isolate how specific visual cues shift social judgments in multimodal LLMs. Evaluating six MLLMs across 25 binary social judgment scenarios, they find that age and body type dominate identity-level effects, while fashion style drives the largest attribute-level shifts, with ~15 attributes accounting for ~80% of total bias variation. The benchmark is released publicly on GitHub and Hugging Face, enabling fine-grained bias auditing of multimodal models.

Evaluation and Benchmarking AI Safety Research StylisticBias StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs +1 more

4arXiv · cs.CL·18d ago·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more