Compression middleware cutting LLM costs by 20% while improving accuracy by 2.7%.
Using context window information density optimization, real-time token importance scoring, and compression-aware cost analytics.

|
LLM Middleware
|
YC W26

Last Updated:
March 20, 2026

Builds compression middleware that uses a proprietary ML model to prune and compress tokens sent to and from LLMs, reducing API costs, latency, and improving output quality for any model provider.
The Token Company has publicly positioned its core product as a model-agnostic, API-first compression layer for LLM prompts and outputs, with demonstrated benchmarks of up to 20% token reduction, +2.7% accuracy improvement on financial QA, and 37% latency gains. They are actively recruiting ML engineers in San Francisco and participating in Y Combinator's W26 batch, signaling a focus on rapid product iteration and go-to-market with developer-first tooling.
Job postings and GitHub signals suggest investment in adaptive compression algorithms, including both lossy (semantic pruning, summarization) and lossless (dictionary-based, meta-token) techniques. Community and conference activity hints at upcoming support for multimodal compression (images, structured data), RAG pipeline integration, and agentic workflow optimization. The emphasis on non-generative, ultra-fast ML models (<100ms for 100K tokens) suggests a future enterprise play around real-time, high-throughput LLM cost management dashboards and CI/CD-integrated prompt optimization tooling.
<p>Intelligent Context Window Optimization: ML-driven compression selects and compresses the most semantically relevant retrieved documents to maximize information density within LLM context windows for RAG workflows.</p>
It fits more of the right information into the LLM's reading window so it gives smarter answers from your documents.
In retrieval-augmented generation (RAG) pipelines, the quality of LLM outputs depends heavily on what fits inside the context window. The Token Company's ML model can be applied after the retrieval step to compress and re-rank retrieved document chunks, removing redundant or low-signal tokens while preserving the most decision-relevant content. This means more documents—or more of each document—can be packed into a single LLM call, improving answer accuracy and reducing hallucination risk. The compression model's speed (sub-100ms for 100K tokens) makes it practical even in real-time RAG applications like customer support bots or financial research assistants. By intelligently compressing retrieved context rather than naively truncating it, the system ensures that critical details (numbers, names, conditions) are preserved while boilerplate and repetition are stripped. This use case is especially valuable for enterprise customers with large, heterogeneous knowledge bases where context window limits are a binding constraint on answer quality.
It's like packing for a trip with a tiny suitcase—instead of leaving your best outfits behind, a genius packer folds everything perfectly so you bring more of what matters.
<p>Semantic Token Pruning: A proprietary, non-generative ML model identifies and removes redundant or low-importance tokens from LLM prompts in real time, preserving semantic meaning while reducing token count by up to 20%.</p>
It's like a spell-checker for wasted words—automatically trimming the fat from every prompt so you pay less and get answers faster.
The Token Company's core engineering use case is a lightweight, non-generative ML model that performs real-time semantic token pruning on LLM prompts before they reach the provider API. Unlike full LLM-based summarization or naive truncation, this model scores each token's importance to the overall semantic intent of the prompt and removes those below a learned threshold. The model processes up to 100,000 tokens in under 100 milliseconds, making it viable for production-scale, latency-sensitive applications. Benchmarks show up to 20% token reduction, a +2.7% accuracy improvement on financial QA tasks, and up to 37% faster end-to-end LLM latency. The system is model-agnostic and integrates as a drop-in API gateway layer, requiring minimal code changes for developers. This approach creates a compounding cost advantage: every API call is cheaper, every context window is used more efficiently, and downstream LLM outputs improve because noise is removed before inference.
It's like having a brilliant editor who reads every email you send to your lawyer, crosses out the fluff, and somehow makes your case stronger—all in the time it takes to blink.
<p>LLM Cost & Performance Analytics: ML-powered analytics layer that tracks token savings, compression ratios, accuracy impact, and latency improvements across all LLM API calls, enabling data-driven optimization of AI spend.</p>
It's a smart dashboard that shows exactly how much money and time you're saving on every AI call, and where you can save more.
Beyond compression itself, The Token Company's middleware generates a rich stream of operational data: per-call token counts (before and after compression), compression ratios, downstream accuracy metrics, latency deltas, and cost savings. An ML-powered analytics layer aggregates this data to surface actionable insights—identifying which prompts benefit most from compression, flagging anomalies (e.g., prompts where compression degrades output quality), and recommending optimal compression settings per use case or model. For operations and finance teams, this translates into clear ROI dashboards, budget forecasting for AI spend, and automated alerts when usage patterns shift. For engineering teams, it provides feedback loops to improve prompt design and compression tuning. This analytics capability is a natural extension of the middleware and creates sticky enterprise value: once a company sees granular, real-time visibility into their LLM costs and performance, it becomes a critical operational tool rather than a one-time optimization.
It's like getting an itemized electricity bill that not only shows what each appliance costs to run, but also automatically turns off the ones you forgot about.
Otso Veisterä combines deep AI product experience with venture ecosystem insight (EIR at Lifeline Ventures), enabling him to build infrastructure that solves a universal LLM cost/performance pain point with a lightweight, non-generative ML model that is orders of magnitude faster than the LLMs it optimizes,a rare technical moat in a space where most competitors rely on prompt engineering heuristics or full LLM-based summarization.