Jiasen Lu
jiasen-lu-74869b5e·1 events·first seen 14d agoAliases: Jiasen Lu
Co-occurring entities
More like this (12)
Recent events (1)
Apple's AToken: A Unified Multimodal Tokenizer and Encoder for Images, Videos, and 3D Objects
Apple researchers introduced AToken, a transformer model with a single 4D tokenizer and encoder-decoder architecture that handles images, videos, and 3D objects in a shared token space. The model is trained to both reconstruct and classify all three media types, using a pretrained SigLIP2 vision encoder extended to four dimensions with 4D Rotary Position Embedding. AToken approaches or matches specialized models on image classification (82.2% ImageNet), image generation (0.21 rFID), and 3D reconstruction (28.28 PSNR), while remaining competitive on video tasks. The work addresses a longstanding tension between generation-focused and classification-focused encoders by forcing embeddings to retain both fine visual detail and semantic content.