Delivers serverless GPU cloud with sub-second model swaps and scale-to-zero pricing for ML teams.
Using predictive resource scheduling for GPU allocation, multi-model inference optimization via IonRouter, and adaptive memory management with live workload migration.

|
Cloud Infrastructure
|
YC W26

Last Updated:
March 19, 2026

Builds a serverless, globally aggregated GPU cloud with predictive scheduling, live workload migration, and proprietary inference engines for ultra-fast, cost-efficient AI model hosting, training, and inference.
Serverless GPU cloud with scale-to-zero and pay-per-second billing. Fractional GPU sharing via GPU Credits. IonRouter multi-model inference with IonAttention Engine. NVIDIA GH200 support. NVIDIA Inception member.
CRIU-based GPU workload migration hints at hybrid on-prem + cloud. Custom kernel-level CUDA optimizations. Possible AMD MI300X support. Enterprise tier with SLA guarantees likely.
<p>Predictive GPU scheduling and autonomous workload placement that uses ML agents to forecast resource demand, pre-allocate fractional GPU capacity, and auto-recover failed jobs across a globally distributed compute pool.</p>
An AI dispatcher watches every GPU in the fleet and figures out where to send each job before you even click run, so nothing sits idle and nothing crashes without a backup plan.
Cumulus Labs deploys ML-driven scheduling agents that continuously monitor GPU utilization, memory pressure, thermal state, and job queue depth across their globally aggregated compute pool. These agents use time-series forecasting models trained on historical workload patterns to predict demand spikes and pre-warm fractional GPU slices before requests arrive, eliminating cold-start penalties. When a job is submitted, the scheduler evaluates hundreds of placement candidates across regions and hardware types, optimizing for latency, cost, and fault tolerance simultaneously. If a node fails mid-job, the system leverages CRIU-based checkpointing to capture GPU state and live-migrate the workload to a healthy node with sub-second interruption. The agents also perform automated root-cause diagnosis of failures, feeding anomaly signals back into the scheduling model to avoid problematic nodes in future placements. This closed-loop system transforms what is traditionally a manual, reactive DevOps burden into a fully autonomous, self-healing infrastructure layer.
It's like having an air traffic controller who not only knows where every plane is, but also predicts turbulence before it happens and reroutes flights while passengers are still sipping their coffee.
<p>IonRouter multi-model inference routing with the proprietary IonAttention Engine, enabling concurrent serving of multiple LLMs with shared KV-cache memory pools and sub-750ms model swaps on a single GPU.</p>
Instead of renting a whole GPU for each AI model, IonRouter lets multiple models share one GPU's memory like roommates splitting rent, swapping in and out so fast users never notice.
IonRouter is Cumulus Labs' flagship inference product, built on their proprietary IonAttention Engine. It addresses a critical inefficiency in production AI deployments: most organizations run multiple models (e.g., a coding assistant, a summarizer, a chatbot) but dedicate separate GPU instances to each, resulting in massive underutilization. IonRouter solves this by hosting multiple models concurrently on shared GPU memory, exploiting the mathematical property that transformer KV-cache blocks are immutable once computed. This allows the engine to perform background memory migration and compaction without interrupting active inference streams. When a request arrives for a model not currently resident in GPU HBM, IonRouter uses CUDA graph-based model swapping to load the model in under 750ms for 7B-parameter architectures, leveraging pre-compiled execution graphs to eliminate JIT compilation overhead. The shared KV memory pool means that common prefix tokens across models (e.g., system prompts) can be deduplicated, further reducing memory footprint. A lightweight routing layer uses request metadata and real-time latency signals to decide which model to serve on which GPU fraction, enabling intelligent load balancing across heterogeneous hardware. The result is dramatically higher GPU utilization, lower per-query cost, and a seamless developer experience where deploying a new model is as simple as registering an endpoint.
It's like a restaurant kitchen where one chef can flawlessly switch between Italian, Japanese, and Mexican dishes mid-service because all the shared ingredients are already prepped and within arm's reach.
<p>Lifecycle-driven GPU memory management system that uses ML-informed heuristics to dynamically allocate, migrate, and reclaim GPU HBM across concurrent workloads, enabling fractional GPU sharing with near-zero performance degradation.</p>
The system watches how each AI job uses GPU memory over its lifetime and automatically shuffles, compresses, and reclaims unused space so multiple teams can share expensive hardware without stepping on each other's toes.
Traditional GPU cloud platforms allocate memory statically: a user requests a GPU (or fraction), and that memory is reserved whether it's actively used or not. Cumulus Labs' lifecycle-driven memory management system replaces this with a dynamic, ML-informed approach. The system profiles each workload's memory access patterns across its lifecycle phases—initialization, warm-up, steady-state inference, and teardown—and builds lightweight predictive models of future memory demand per phase. During steady-state inference, for example, transformer models exhibit highly predictable memory patterns because KV-cache blocks are append-only and immutable once written. The system exploits this by identifying cold memory regions (blocks unlikely to be accessed in the near future) and migrating them to host RAM or NVMe storage via asynchronous DMA transfers, freeing HBM for other co-located workloads. When a workload transitions phases (e.g., from idle to active inference), the system pre-fetches its memory blocks back to HBM based on predicted access patterns, hiding migration latency behind compute operations. A feedback loop continuously refines the predictive models using actual access telemetry, improving placement accuracy over time. This enables Cumulus Labs' fractional GPU credit system: users pay only for the memory and compute they actually consume, while the platform maximizes physical GPU utilization across all tenants. The result is a multi-tenant GPU environment that approaches the performance of dedicated instances at a fraction of the cost.
It's like a hyper-efficient hotel concierge who moves your luggage to a nearby storage room the moment you leave for dinner, then has it back in your suite before you return—so the hotel can rent your closet space to someone else in the meantime.
Rare hands-on experience building GPU compute platforms (TensorDock) and mission-critical ML for government (Palantir, Space Force, NASA) with proprietary CRIU-based live migration and custom inference engine.