Post-training lab building rubric-based reward models and agentic frameworks for LLM alignment.
Using rubric-based reward modeling for LLM evaluation, academic review simulation for paper feedback, and agentic AI orchestration for alignment workflows.

|
Post-training
|
YC W26

Last Updated:
March 19, 2026

A post-training research and product lab that builds rubric-based reward models, agentic AI frameworks, and open-source developer tooling to align, evaluate, and fine-tune large language models after pre-training.
Rubric AI has publicly released open-source agentic app frameworks (modular packages for agents, memory, events, auth, UI), a CLI bootstrapping tool (create-rubric-app), and CSPaper,a rubric-aligned academic paper feedback tool targeting top ML conferences (ICML, ICLR, SIGIR). Their GitHub monorepo signals continued investment in composable, type-safe developer tooling for LLM-powered applications and agent orchestration (rOS).
GitHub commit patterns suggest active development of an "Agent Operating System" (rOS) with persistent memory and event-driven workflows, pointing toward enterprise-grade agent orchestration. Social signals (Twitter mentions of @calcom, @triggerdotdev, @AgentHub_AI) indicate B2B infrastructure partnerships likely preceding a formal product launch. The absence of Hugging Face model/dataset releases combined with heavy framework development suggests they are building proprietary evaluation and alignment pipelines internally before open-sourcing model artifacts. Y Combinator listing without disclosed funding hints at imminent fundraising or stealth batch participation. Job market silence suggests founder-mode deep R&D phase before scaling.
<p>Rubric-based reward modeling replaces subjective human preference scoring with structured, criteria-driven checklists to align LLMs more reliably and interpretably during post-training.</p>
Instead of asking people "which AI answer do you like better?" and hoping they're consistent, Rubric AI gives the graders (human or AI) a detailed checklist—like a cooking competition scorecard—so every model output is judged on the same clear criteria.
Traditional RLHF relies on human annotators ranking model outputs by subjective preference, which introduces inconsistency, annotator bias, and poor scalability. Rubric AI's core engineering innovation replaces this with structured rubric-based reward models: for each prompt or task type, a detailed rubric defines weighted criteria (accuracy, helpfulness, safety, formatting, domain relevance). Human or AI judges score outputs against these rubrics, producing decomposed, interpretable reward signals rather than opaque scalar preferences. These rubric scores feed into Direct Preference Optimization (DPO) or PPO-based RLHF pipelines, enabling more stable training, targeted debugging of failure modes, and curriculum-based fine-tuning that progresses from instruction-following to open-ended generation. The system also supports RLAIF (AI-as-judge) where LLMs self-grade against rubrics, creating synthetic preference data at scale. This approach makes post-training reproducible, auditable, and dramatically cheaper—critical for startups and enterprises that cannot afford Scale AI-level annotation budgets.
It's like replacing a restaurant's "rate your meal 1-5 stars" card with a detailed scorecard for flavor, presentation, temperature, and portion size—suddenly you know exactly what to fix in the kitchen.
<p>CSPaper provides venue-specific, rubric-aligned simulated peer review feedback for academic ML paper submissions, helping researchers identify and fix acceptance blockers before submission.</p>
CSPaper acts like a practice round with a tough-but-fair AI reviewer who knows exactly what ICML or ICLR reviewers look for, so researchers can fix problems before the real reviews come back.
CSPaper is Rubric AI's flagship product application, targeting the academic ML community. For each supported venue (ICML, ICLR, SIGIR, NeurIPS, etc.), CSPaper encodes the actual review rubrics and acceptance criteria used by program committees—novelty, technical soundness, clarity, reproducibility, significance, and ethical considerations. When a researcher uploads a draft, the system uses LLM-based evaluation agents that simulate multi-reviewer panels, each scoring the paper against the venue-specific rubric. The output is a structured review with per-criterion scores, highlighted weaknesses, and prioritized revision suggestions ranked by impact on acceptance probability. Unlike generic grammar or writing tools, CSPaper understands the meta-game of peer review: it flags missing ablation studies, insufficient baselines, unclear threat models, and positioning gaps relative to related work. The tool leverages retrieval-augmented generation to compare submissions against published proceedings and identify novelty gaps. This creates a powerful feedback loop where researchers iterate on rubric-aligned weaknesses, dramatically improving submission quality and reducing the costly cycle of reject-revise-resubmit that plagues the ML community.
It's like having a brutally honest friend who's served on every top conference program committee read your paper and hand you a color-coded fix-it list before you hit submit.
<p>An agent operating system (rOS) that orchestrates autonomous AI agents with persistent memory, event-driven workflows, and modular tool integration for enterprise automation tasks.</p>
rOS is like an air traffic control tower for AI agents—it keeps track of what each agent knows, what it's doing, and when to hand off tasks, so complex business workflows run themselves without crashing into each other.
Rubric AI's rOS (Agent Operating System) is a conceptual and technical framework for orchestrating multiple AI agents in production environments. Unlike single-shot LLM calls, rOS provides persistent memory (agents retain context across sessions and tasks), event-driven architecture (agents respond to triggers from external systems like calendars, CRMs, code repositories, and communication tools), and modular action packages (type-safe integrations with third-party services). The open-source monorepo (RubricLab) already includes packages for actions, agents, blocks (UI components), chains (multi-step workflows), events, authentication, and memory—forming the building blocks of rOS. B2B partnership signals with Cal.com (scheduling), Trigger.dev (background job orchestration), and AgentHub AI (agent marketplace) suggest rOS is being designed as connective tissue for enterprise agent deployments. The system uses rubric-based evaluation to continuously assess agent performance against task-specific success criteria, enabling automated quality assurance and self-improvement loops. For enterprises, this means complex multi-step workflows (customer onboarding, incident response, content pipelines) can be delegated to agent teams
Rubric AI combines rare production-grade engineering experience (Meta, Apple, Snapchat payments infrastructure) with investment-side product intuition (Climate Capital), enabling them to bridge the gap between cutting-edge post-training research and scalable, developer-friendly products,a combination almost no other two-person lab possesses.