Builds simulated worlds for training AI agents on complex digital workflows.
Using procedural environment generation for enterprise toolchains, agent benchmark verification, and synthetic RL training data for long-horizon tasks.

|
AI Simulation & Agent Training
|
YC W26

Last Updated:
March 19, 2026

Builds automated simulated worlds for training and evaluating AI agents on complex, long-horizon, multi-tool digital workflows like software engineering, enterprise operations, and knowledge work.
Polymath has publicly released Horizon-SWE, a benchmark simulating an end-to-end software company where agents must plan, code, test, deploy, and monitor software. They've announced automated, human-in-the-loop environment generation and integration with real-world digital tools (Slack, GitHub, Linear, email, browsers). Their public roadmap centers on expanding benchmark coverage across enterprise digital workflows and scaling cloud-native simulation infrastructure.
Job postings for Build & DevOps Engineers, Networking Engineers, Data Pipeline Engineers, and Mechanical Designers suggest investment in robotics-adjacent simulation and distributed cloud infrastructure beyond their current digital-workflow focus. GitHub activity and hiring for Founding Research Engineers point toward differentiable simulation and sim-to-real transfer research. Sales hiring (Account Executives) signals imminent enterprise go-to-market motion. Conference participation and open-source engagement hint at a platform play,enabling third-party developers to build, share, and evaluate agents within Polymath environments. The mechanical designer role is a strong signal they may be expanding into physical/robotic agent simulation alongside digital workflows.
<p>Automated procedural world generation that creates diverse, realistic digital environments for training AI agents without manual scenario authoring.</p>
Instead of humans hand-building every test scenario for AI agents, Polymath's system automatically generates thousands of realistic digital workplaces—complete with tools, data, and tasks—so agents can learn faster and cheaper.
Polymath employs ML-driven procedural generation to automatically create rich, varied digital environments that simulate real-world enterprise toolchains. The system uses learned priors about tool relationships, task dependencies, and realistic data distributions to compose environments featuring interconnected tools like GitHub repositories, Slack channels, Linear boards, email inboxes, and web browsers. A human-in-the-loop feedback mechanism allows researchers to steer generation toward specific complexity profiles or domain requirements. This dramatically reduces the engineering bottleneck of manual environment authoring—traditionally the most time-consuming part of RL research—while ensuring statistical diversity that prevents agent overfitting. The generator can produce thousands of unique world configurations, each with internally consistent state, realistic data artifacts, and verifiable task objectives, enabling scalable agent training pipelines that would be impossible to build by hand.
It's like having an AI dungeon master who instantly conjures up thousands of unique office buildings—each with different teams, tools, and problems—so your AI intern can practice working at all of them before its first real day on the job.
<p>Rigorous, automated evaluation of AI agents on long-horizon, multi-tool software engineering tasks using the Horizon-SWE benchmark with ML-powered verifiers.</p>
Polymath built an automated grading system that tests whether AI coding agents can actually do a software engineer's full job—from reading a ticket to shipping and monitoring code—not just answer trivia questions about programming.
Horizon-SWE is Polymath's flagship benchmark product, simulating a complete software company environment where AI agents must perform end-to-end software engineering: interpreting product requirements from Linear tickets, navigating codebases on GitHub, writing and testing code, deploying via CI/CD pipelines, and monitoring production systems. ML-powered verifiers automatically assess agent performance across multiple dimensions—correctness, efficiency, code quality, tool usage patterns, and task completion—without requiring human graders. These verifiers use learned models trained on expert-annotated trajectories to distinguish between superficially correct and genuinely robust agent behavior. Current frontier models (GPT-4-class and beyond) achieve only approximately 25% on Horizon-SWE, demonstrating the benchmark's difficulty and its value as a discriminating evaluation instrument. The benchmark serves as both a product differentiator for Polymath and a community resource that drives adoption of their simulation platform by AI research labs seeking to measure real-world agent capability.
It's like building a full-scale simulated restaurant kitchen where you test whether a robot chef can handle an entire dinner service—not just see if it can chop an onion.
<p>Synthetic interaction data generation from agent-environment sessions used to create high-quality training datasets for improving AI agent capabilities via reinforcement learning.</p>
Every time an AI agent practices in Polymath's simulated worlds, it automatically creates labeled training data showing what worked and what didn't—like a flight simulator that writes its own textbook after every session.
As AI agents interact with Polymath's procedurally generated environments, every action, observation, tool invocation, and outcome is captured as structured trajectory data. This synthetic interaction data is then processed through Polymath's automated verifiers to produce richly labeled datasets—annotating successful strategies, failure modes, tool usage patterns, and decision quality at each step. These datasets serve as high-quality training signal for reinforcement learning from environment feedback (RLEF), enabling iterative agent improvement without costly human annotation. The data pipeline supports filtering by task type, difficulty, tool combination, and outcome, allowing researchers to construct targeted training curricula. Because the environments are procedurally generated and internally consistent, the synthetic data exhibits natural diversity that mitigates distribution shift problems common in hand-curated datasets. This creates a powerful flywheel: better environments produce richer data, which trains better agents, whose failures reveal gaps that inform the next generation of environment design.
It's like a driving school where every student's practice session automatically writes a new chapter in the driver's ed manual—complete with highlighted mistakes and gold-star moments—so the next class learns even faster.
Polymath's unfair advantage is its focus on automated world generation for multi-tool digital workflows,a gap left by robotics-centric simulators like Isaac Sim and AI2-THOR. By building environments that mirror real enterprise toolchains (GitHub, Slack, Linear, browsers), they create the only realistic training ground for the next generation of AI coding and knowledge-work agents, validated by benchmarks where even frontier models score only ~25%.