Unified SDK for deploying AI models on mobile, web, and edge with local-first privacy.
Using on-device speech inference at 550 tokens/sec, local RAG for offline knowledge retrieval, and adaptive inference routing between device and cloud.

|
On-Device AI
|
YC W26

Last Updated:
March 19, 2026

Builds a unified, open-source SDK and control plane for deploying and managing AI models (LLMs, speech, vision) directly on mobile, web, and edge devices, enabling private, low-latency, offline-capable AI applications.
RunAnywhere has publicly announced cross-platform SDK support (Swift, Kotlin, React Native, Flutter, WebAssembly), a proprietary MetalRT inference engine for Apple Silicon, OTA model delivery and versioning, hybrid local/cloud policy-based routing, and a browser/WebGPU SDK in beta. Their open-source GitHub repos show active development on RAG pipelines, streaming STT/TTS, and voice agent tooling. All aimed at making on-device AI the default for production apps.
GitHub commit activity shows rapid iteration on WebGPU browser inference and new model format support (vision-language models, tool-calling agents). Hackathon wins and community demos signal experimentation with home automation and IoT integrations. Job-related signals suggest hiring for ML inference optimization and mobile platform engineering. Conference and community appearances hint at enterprise-grade fleet management features and compliance tooling for regulated industries. There are strong indicators of a hybrid on-device/cloud orchestration layer designed to upsell enterprise customers on analytics and control plane SaaS.
<p>Privacy-first on-device voice agents with real-time STT/TTS for mobile apps</p>
Your phone's voice assistant works instantly and privately because the AI brain lives on your device, not in a data center.
RunAnywhere's SDK enables developers to build fully on-device voice agents by bundling streaming speech-to-text (Whisper, Zipformer, Parakeet), text-to-speech (Piper, Kokoro, Matcha), and voice activity detection (Silero) models directly into mobile applications. The entire inference pipeline—from microphone input to natural language response—runs locally on the device's GPU or neural engine, achieving sub-200ms latency without any network dependency. This architecture eliminates cloud API costs, removes privacy risks associated with transmitting audio data, and enables voice-powered experiences in offline or low-connectivity environments such as healthcare, field operations, and regulated industries. The control plane allows OTA model updates so voice models can be improved without app store resubmissions.
It's like having a personal translator living in your pocket who never gossips about your conversations to anyone.
<p>On-device RAG-powered document Q&A for offline enterprise copilots</p>
Employees can ask questions about sensitive company documents on their phone and get instant answers without any data ever leaving the device.
RunAnywhere provides a fully on-device Retrieval-Augmented Generation (RAG) pipeline that combines hybrid vector embeddings (Snowflake) with BM25 keyword retrieval to index and query local documents directly on mobile or edge devices. Developers can build enterprise copilot applications where users load PDFs, manuals, or compliance documents onto their device, and the local LLM (Qwen3, Llama 3.2, SmolLM2) answers questions grounded in those documents with cited passages. Because the entire pipeline—embedding generation, index construction, retrieval, and LLM inference—runs on-device via MetalRT or llama.cpp, no proprietary or regulated data is transmitted externally. This is critical for industries like healthcare, legal, finance, and defense where data residency and compliance requirements prohibit cloud processing. The control plane enables fleet-wide model and index updates via OTA delivery.
It's like giving every employee a photographic-memory research assistant who's sworn to secrecy and works without Wi-Fi.
<p>Hybrid local/cloud inference routing with policy engine for cost and latency optimization</p>
The system automatically decides whether to run AI on your phone or in the cloud based on rules you set, saving money and keeping things fast.
RunAnywhere's policy-based routing engine allows developers and platform teams to define custom rules that dynamically determine whether each AI inference request is handled on-device or routed to a cloud endpoint. Policies can be configured based on device capability (GPU, RAM, battery), model complexity, network availability, latency requirements, data sensitivity, and cost thresholds. For example, a healthcare app might enforce that all patient-related queries run locally for HIPAA compliance, while general knowledge questions fall back to a cloud LLM for higher accuracy. The control plane provides real-time analytics on inference distribution, device performance, model accuracy, and cost attribution across the entire fleet. This hybrid architecture lets companies start with cloud AI and progressively shift workloads on-device as models shrink and hardware improves, creating a smooth migration path that avoids vendor lock-in to any single cloud AI provider. Fleet-wide policy updates are pushed via OTA without requiring app updates.
It's like a smart thermostat for your AI bills—it automatically runs the cheap, local option when it can and only fires up the expensive cloud furnace when it really needs to.
RunAnywhere combines open-source developer trust with a proprietary inference engine (MetalRT) that achieves up to 550 tokens/sec on Apple Silicon, giving them a performance moat on the fastest-growing consumer hardware while locking in developers through a unified cross-platform SDK that no competitor currently matches in breadth.