Almanac
benchmark

iOSWorld

benchmarkactiveprovisionaliosworld-f27d8cec·1 events·first seen 8d ago

Aliases: iOSWorld

More like this (12)

Recent events (1)

6arXiv · cs.CL·8d ago·source ↗

iOSWorld: Benchmark for Personalized iOS Phone Agents with Persistent User Identity

Researchers introduce iOSWorld, the first interactive native iOS simulator benchmark designed to evaluate phone agents on personalized, identity-aware tasks across 26 custom-built iOS apps. The benchmark includes 133 tasks spanning single-app, multi-app, and memory/personalization categories, with connected personal data such as transactions, messages, and social relationships. Frontier models reach only 52% overall and 37% on multi-app tasks; privileged vision+XML access improves frontier models by up to 26 percentage points but does not help smaller models. The benchmark is released open-source with all apps, data, tasks, and evaluation code.