WWDC 24: Running Mistral 7B with Core ML
This Hugging Face blog post covers running Mistral 7B on Apple devices using Core ML, likely demonstrated or announced around WWDC 2024. It addresses on-device inference of a 7B parameter open-weights model using Apple's ML framework. This represents a practical deployment pattern for running capable open-weights LLMs locally on Apple Silicon hardware.
Related guides (4)
Related events (8)
Using Stable Diffusion with Core ML on Apple Silicon
Hugging Face published a guide on running Stable Diffusion models via Apple's Core ML framework on Apple Silicon hardware. The post covers converting diffusion model weights to Core ML format and integrating them into the Diffusers library for on-device inference. This represents an early effort to enable efficient local image generation on consumer Apple hardware without requiring cloud GPU resources.
Stable Diffusion XL on Mac with Advanced Core ML Quantization
Hugging Face details the process of running Stable Diffusion XL (SDXL) on Apple Silicon Macs using Core ML with advanced quantization techniques. The post covers how quantization reduces model size and memory requirements to make SDXL feasible on consumer Mac hardware. This represents a practical deployment advance for running large diffusion models at the edge on Apple devices.
Faster Stable Diffusion with Core ML on iPhone, iPad, and Mac
Hugging Face published a blog post detailing optimizations for running Stable Diffusion models via Core ML on Apple devices including iPhone, iPad, and Mac. The post covers techniques to accelerate on-device inference using Apple's neural engine and Core ML framework. This represents progress in deploying capable diffusion models at the edge without cloud dependency.
Releasing Swift Transformers: Run On-Device LLMs in Apple Devices
Hugging Face released Swift Transformers, a Swift library enabling on-device LLM inference on Apple hardware (iOS, macOS) via Core ML. The library provides a pipeline abstraction for text generation and supports models converted to Core ML format. This extends the Hugging Face ecosystem to Apple's native development environment, lowering the barrier for deploying LLMs on Apple Silicon devices.
Mistral Small 3: 24B Latency-Optimized Open-Weight Model Released Under Apache 2.0
Mistral AI has released Mistral Small 3, a 24B-parameter instruction-tuned model optimized for low latency, achieving over 81% on MMLU at 150 tokens/s on a single GPU. The model is competitive with Llama 3.3 70B and Qwen 32B while being more than 3x faster on equivalent hardware, and is released under Apache 2.0 for both pretrained and instruction-tuned checkpoints. It is explicitly not trained with RL or synthetic data, positioning it as a base model for community fine-tuning and reasoning capability development. Deployment targets include local inference on consumer hardware (RTX 4090, MacBook 32GB RAM), agentic function calling, and domain-specific fine-tuning.
Swift Diffusers: Fast Stable Diffusion for Mac
Hugging Face published a blog post introducing Swift Diffusers, a native macOS/iOS application for running Stable Diffusion models locally on Apple Silicon hardware. The post covers optimizations leveraging Apple's Core ML framework to accelerate inference on Mac. This represents an effort to bring on-device diffusion model inference to consumer Apple hardware without cloud dependency.
omlx: LLM inference server with continuous batching and SSD caching for Apple Silicon
omlx is an open-source Python project providing an LLM inference server optimized for Apple Silicon, featuring continuous batching and SSD caching managed via a macOS menu bar interface. The project has accumulated nearly 16,000 GitHub stars with strong daily momentum. It targets local inference on Apple hardware, a growing niche as consumer-grade silicon becomes increasingly capable for running open-weights models.
Mistral Small 3.1: Multimodal, 128k Context, Apache 2.0 Open-Weight Model
Mistral AI releases Mistral Small 3.1, a ~24B parameter model with multimodal understanding, 128k token context window, and claimed best-in-class performance among small models, outperforming Gemma 3 and GPT-4o Mini on text, multimodal, and multilingual benchmarks. The model runs on a single RTX 4090 or 32GB RAM Mac at 150 tokens/second and is released under Apache 2.0 license with both base and instruct checkpoints. It is available on HuggingFace, Mistral's La Plateforme API, and Google Cloud Vertex AI, with NVIDIA NIM and Azure AI Foundry support coming soon. The release targets enterprise and on-device use cases including document verification, agentic workflows, and domain fine-tuning.



