Cuts LLM input costs by up to 76% while actually improving accuracy through context compression.
Using token-level context distillation for fine-grained compression, agentic memory management for autonomous agents, and codebase context compression for developer tools.

|
LLM Optimization
|
YC W26

Last Updated:
March 19, 2026

Builds an LLM-native context compression API gateway that uses ML-driven relevance filtering and token-level distillation to reduce LLM input costs by up to 76% while improving accuracy.
Context-Gateway API (latte_v1) with coarse and fine-grained compression. 10x compression on SEC filings via FinanceBench. Integrations with Claude Code, Cursor, and open-source agent frameworks.
Zero-latency background compaction, pre-computed summaries, and logging/history compaction. Show HN positioning and agentic proxy architecture suggest expansion into persistent memory management for autonomous agents.
<p>Compresses massive SEC filings (230K+ tokens) down to ~10.5K tokens for LLM analysis, improving accuracy while cutting costs 76%.</p>
It's like hiring a brilliant paralegal who reads a 500-page SEC filing and hands the analyst only the 25 pages that actually matter—faster, cheaper, and somehow more accurate.
Compresr's latte_v1 API was benchmarked against FinanceBench, a standardized evaluation suite for financial question-answering over SEC filings. Raw GPT-5.2 ingested an average of ~106K tokens per query at baseline, achieving 72.3% accuracy. With Compresr's two-stage compression pipeline—first selecting the most relevant document chunks via coarse-grained ML filtering, then performing fine-grained token-level distillation—the average context was reduced to ~10.5K tokens (a 10x compression ratio). Remarkably, accuracy increased to 74.5%, demonstrating that removing noisy, irrelevant context actually helps the downstream LLM focus on signal. This translates to a 76% reduction in API costs per query, making large-scale financial document analysis economically viable for fintech startups and enterprise compliance teams alike.
It's like Marie Kondo organizing your filing cabinet before your accountant arrives—less clutter, better answers, and your accountant charges by the hour.
<p>Provides real-time, zero-latency background compaction of agent conversation histories and tool outputs to keep autonomous agents within context limits during long-running tasks.</p>
It's like giving an AI agent a photographic memory that automatically forgets the boring parts so it can keep working on hard problems without losing its train of thought.
As autonomous AI agents (e.g., coding agents, research agents, customer support bots) execute multi-step workflows, their accumulated context—conversation history, tool call results, intermediate reasoning—quickly exceeds LLM context windows, causing truncation, hallucination, or failure. Compresr's Context-Gateway acts as an agentic proxy that continuously compacts this growing context in the background with zero user-perceived latency. Using pre-computed summaries and ML-driven relevance scoring, the system retains critical decision points and discards redundant or low-signal information. This allows agents to maintain coherent, high-quality reasoning over dramatically longer task chains—critical for production-grade autonomous systems in software engineering, research automation, and enterprise process orchestration. GitHub signals indicate active development of logging and history compaction tracking, enabling developers to audit exactly what was retained and why.
It's like a road trip GPS that remembers every important turn you need but forgets all the straight highway stretches—so it never runs out of memory mid-journey.
<p>Compresses large codebases and repository context fed to LLM-powered coding assistants in IDEs, enabling more accurate code generation and review within token budgets.</p>
It's like giving your AI coding assistant the ability to skim an entire codebase and pull up only the exact files and functions it needs to write your feature—instead of trying to read the whole repo at once and getting confused.
Modern LLM-powered coding assistants (Cursor, GitHub Copilot, Claude Code) struggle with large repositories because feeding the full codebase context into the model exceeds token limits, forcing naive truncation that loses critical type definitions, utility functions, and architectural patterns. Compresr's IDE integration plugin applies its two-stage compression pipeline directly to repository context: first, coarse-grained retrieval identifies the most relevant files, classes, and functions for the current coding task; then, fine-grained token distillation strips boilerplate, comments, and redundant code patterns while preserving semantic structure. The result is a highly concentrated context payload that fits within standard token budgets and gives the downstream LLM a clearer picture of the codebase's architecture. GitHub activity on the Context-Gateway repo shows active work on Cursor and Claude Code integrations, suggesting this is a near-term priority. For developer tool companies, this creates a compelling embedded infrastructure play—better completions, lower costs, happier developers.
It's like a librarian who knows exactly which three books on the shelf answer your question, instead of dumping the entire library on your desk and saying "good luck."
Plug-and-play middleware that compresses context without requiring changes to upstream models or downstream code. Proving that compression improves accuracy (not just reduces cost) flips the quality-vs-cost tradeoff, making adoption a no-brainer.