
Technology
|
LLM Middleware
|
YC W26
|
Valuation:
Undisclosed

Last Updated:
March 24, 2026

Builds compression middleware that uses proprietary bear-1 and bear-1.1 ML models to prune and compress tokens sent to LLMs, reducing API costs by up to 66%, cutting latency, and improving output quality for any model provider.
The Token Company has publicly positioned its core product as a model-agnostic, API-first compression layer for LLM prompts and outputs, with demonstrated benchmarks of 66% token reduction with +1.1% accuracy improvement. Working with Pax Historia (60K DAU, 15th largest token consumer on OpenRouter, 193B tokens/month), they showed users prefer outputs from compressed inputs and saw a +5% lift in user purchases. They compress 100K tokens in under 100ms. The vision is to optimize every LLM request in the world at the token level before it reaches a model.
Job postings and GitHub signals suggest investment in adaptive compression algorithms, including both lossy (semantic pruning, summarization) and lossless (dictionary-based, meta-token) techniques. Community and conference activity hints at upcoming support for multimodal compression (images, structured data), RAG pipeline integration, and agentic workflow optimization. The emphasis on non-generative, ultra-fast ML models (<100ms for 100K tokens) suggests a future enterprise play around real-time, high-throughput LLM cost management. YC partners reached out to Otso directly rather than him applying for the batch.
Intelligent Context Window Optimization: ML-driven compression selects and compresses the most semantically relevant retrieved documents to maximize information density within LLM context windows for RAG workflows.
It fits more of the right information into the LLM's reading window so it gives smarter answers from your documents.
It's like packing for a trip with a tiny suitcase—instead of leaving your best outfits behind, a genius packer folds everything perfectly so you bring more of what matters.
Semantic Token Pruning: A proprietary, non-generative ML model identifies and removes redundant or low-importance tokens from LLM prompts in real time, preserving semantic meaning while reducing token count by up to 20%.
It's like a spell-checker for wasted words—automatically trimming the fat from every prompt so you pay less and get answers faster.
It's like having a brilliant editor who reads every email you send to your lawyer, crosses out the fluff, and somehow makes your case stronger—all in the time it takes to blink.
LLM Cost & Performance Analytics: ML-powered analytics layer that tracks token savings, compression ratios, accuracy impact, and latency improvements across all LLM API calls, enabling data-driven optimization of AI spend.
It's a smart dashboard that shows exactly how much money and time you're saving on every AI call, and where you can save more.
It's like getting an itemized electricity bill that not only shows what each appliance costs to run, but also automatically turns off the ones you forgot about.
Otso Veisterä is an 18-year-old solo founder, Finnish National Physics Champion, and Z Fellow who combines precocious technical ability with venture ecosystem insight from an EIR stint at Lifeline Ventures. He is building infrastructure that solves a universal LLM cost/performance pain point with a lightweight, non-generative ML model that is orders of magnitude faster than the LLMs it optimizes, a rare technical moat in a space where most competitors rely on prompt engineering heuristics or full LLM-based summarization.