Entity · model

GPT

modelactivegpt-58362d74·12 events·first seen May 19, 2026

Aliases: GPT

Co-occurring entities

More like this (12)

GPTs GPT-f GPT-1 Image GPT GPT-4 GPT-4.1 WebGPT GPT-next GPT Builder GPT-5.2 GPT-4V GPT Pro

Recent events (12)

6arXiv · cs.CL·Jul 15, 2026·source ↗

One-Word Census: Answer-choice conformity measured across 44 language models

Researchers introduce the One-Word Census, a minimal 31-prompt instrument that probes which one-word answers language models select from open-ended categories, applied to 44 models. Convergence is extreme — 41% of models chose 'serendipity' when asked to pick any word — yet conformity varies fourfold across models in structured ways: persona- and community-tuned models diverge most, while newest mainline flagships conform most. Within four model lineages (Claude, GPT, Qwen, Grok), conformity rises with each generation but reverses for the latest Claude and GPT flagships, suggesting possible repositioning. The field is more lexically concentrated than human norms in 18 of 20 shared categories.

Frontier Model Releases Evaluation and Benchmarking Grok Claude Qwen +4 more

4arXiv · cs.CL·Jul 3, 2026·source ↗

LLMs evaluated for automated grading of Linux/bash exams using four-level cognitive taxonomy

A new arXiv paper evaluates GPT, Claude Opus, Gemini, and GLM on automated grading of 1,200 real student Linux/bash command responses, benchmarked against three expert instructors. Using a four-level cognitive taxonomy, Gemini 3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC=0.888, MAE=0.10). Key findings: rubric quality mattered more than model choice, and grading accuracy declined consistently at higher cognitive complexity levels. The study proposes a taxonomy-based framework for deciding which exam questions are suitable for AI-assisted grading.

Evaluation and Benchmarking Claude Opus 4.6 Google Gemini-3.0-Pro +5 more

6arXiv · cs.LG·Jul 1, 2026·source ↗

Surrogate Fidelity: Open LLMs often cannot reliably explain closed model behavior

A new arXiv paper from Facebook Research evaluates whether mechanistic interpretability findings from open-weight models transfer to closed API-only models across prediction, attribution, and representation levels. Studying eleven models across four families (Llama, Qwen, GPT, Gemini), the authors find that prediction-level agreement substantially overstates attribution fidelity — models that agree on answers often disagree on why. They document an 'access-validity inversion' where white-box signals like attention patterns are stable across models but weakly predictive of causal attributions, undermining the common practice of using open surrogates to explain closed systems.

Evaluation and Benchmarking AI Safety Research Qwen Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?Llama +3 more

6arXiv · cs.CL·Jun 17, 2026·source ↗

RubricsTree: Scalable hierarchical rubric framework for evaluating personal health AI agents

RubricsTree is a new evaluation framework for LLM-powered personal health agents, built around a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics derived from 4,000 real user queries and curated with physician oversight. A context-aware router activates only relevant rubrics per query, enabling scalable yet expert-aligned evaluation. The framework outperforms strong LLM-as-a-judge baselines on expert alignment and, when used as training signal, yields up to ~66% relative gains on HealthBench across Gemini, GPT, and Qwen model families. The work addresses a concrete bottleneck in clinical deployment of health AI: the cost-quality tradeoff in evaluation.

Evaluation and Benchmarking AI Safety Research HealthBench RubricsTree Qwen +2 more

5Hacker News·Jun 15, 2026·source ↗

HN community discusses replacing Claude/GPT with local models for daily coding

A high-engagement Hacker News thread (510 points, 256 comments) asks whether practitioners have successfully replaced cloud-hosted models like Claude or GPT with local models for daily coding workflows. The discussion likely surfaces real-world comparisons of local vs. hosted model performance, latency, cost, and privacy tradeoffs. High engagement signals this is a live practitioner concern in mid-2026.

Open Weights Progress Inference Economics Claude OpenAI GPT +1 more

5Github Trending·Jun 3, 2026·source ↗

HexStrike AI: MCP server exposing 150+ cybersecurity tools to AI agents

HexStrike AI is an open-source MCP server that enables AI agents (Claude, GPT, Copilot, and others) to autonomously invoke over 150 offensive security tools for penetration testing, vulnerability discovery, and bug bounty automation. The project bridges LLMs with real-world offensive security capabilities via the Model Context Protocol. With 9,221 GitHub stars, it represents a notable community signal around agentic security tooling and the expanding attack surface of AI-driven automation.

AI Safety Research Agent and Tool Ecosystem Claude HexStrike AI MCP +1 more

5arXiv · cs.CL·May 29, 2026·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

6Openai Blog·May 20, 2026·source ↗

Image GPT: Transformer Models Applied to Pixel Sequences for Image Generation and Classification

OpenAI demonstrates that a large transformer model trained autoregressively on pixel sequences can generate coherent image completions and samples, analogous to text generation. The work establishes a correlation between generative sample quality and downstream image classification accuracy. The best generative model achieves features competitive with top convolutional networks in the unsupervised setting, suggesting shared representational principles across modalities.

Frontier Model Releases Multimodal Progress Transformers convolutional neural network OpenAI +2 more

6Openai Blog·May 20, 2026·source ↗

Efficient Training of Language Models to Fill in the Middle

OpenAI published research on training language models with a fill-in-the-middle (FIM) objective, enabling models to complete text given both a prefix and a suffix context. The technique allows infilling capabilities to be added at essentially no cost to left-to-right generative performance. This work has direct implications for code completion and editing use cases, and was later incorporated into Codex and related models.

Frontier Model Releases Agent and Tool Ecosystem Fill-in-the-Middle (FIM)OpenAI GPT +1 more

7Openai Blog·May 20, 2026·source ↗

Sora System Card

OpenAI has published the system card for Sora, its video generation model capable of accepting text, image, and video inputs to produce video outputs. The model builds on techniques from DALL-E and GPT and is positioned as a creative storytelling tool. The system card documents safety evaluations, mitigations, and residual risks associated with the model's deployment.

Frontier Model Releases AI Safety Research DALL·E 3 OpenAI Sora +2 more

4Openai Blog·May 20, 2026·source ↗

OpenAI Releases GABRIEL: Open-Source Toolkit for AI-Assisted Social Science Research

OpenAI has released GABRIEL, an open-source toolkit that leverages GPT models to convert qualitative text and images into quantitative data for social science research. The tool is designed to help researchers analyze large-scale qualitative datasets that would otherwise be impractical to process manually. It represents an application of frontier LLMs to academic research methodology rather than a new model or capability announcement.

Enterprise Deployment Patterns Agent and Tool Ecosystem GABRIEL OpenAI GPT

7Openai Blog·May 19, 2026·source ↗

OpenAI models, Codex, and Managed Agents come to AWS

OpenAI has announced that its GPT models, Codex, and Managed Agents are now available on AWS, allowing enterprise customers to deploy OpenAI capabilities within their existing AWS environments. The partnership extends OpenAI's distribution reach into the major cloud hyperscaler ecosystem. This follows a broader industry pattern of AI labs partnering with cloud providers to reach enterprise customers through familiar procurement and compliance channels.

Inference Economics Enterprise Deployment Patterns OpenAI Managed Agents OpenAI GPT +3 more