How Is

Ashr

Using AI?

Catches AI agent failures before users see them by stress-testing across text, voice, and images.

Using LLM-driven synthetic test generation, custom ML scorers for business-specific quality evaluation, and multi-modal swarm testing that uncovers rare edge cases at scale.

Company Overview

Builds an automated testing platform that mimics users in production environments to catch AI agent failures before they reach real users. Uses synthetic test generation and custom ML scorers to evaluate agents across multiple modalities.

Product Roadmap & Public Announcements

Ashr has launched its testing platform focused on mimicking users in production environments to catch agent failures. Per YC, the tagline is 'Mimic Users in Your Production Environment to Catch Agent Fails.' The platform is live with synthetic scenario-driven testing capabilities.

Signals & Private Analysis

Team size of 2 per YC profile suggests focused pre-scaling phase. The AI agent testing space is growing rapidly with agentic AI adoption. Likely building toward CI/CD pipeline integration and framework-specific support.

Ashr

Machine Learning Use Cases

LLM synthetic scenario generation
For
Product Differentiation
Engineering

<p>Uses large language models to automatically generate diverse, realistic, and adversarial synthetic test scenarios that simulate real-world user interactions with AI agents across multiple modalities.</p>

Layman's Explanation

It's like hiring a thousand creative QA testers who never sleep, each dreaming up unique ways to trick and challenge your AI agent.

Use Case Details

Ashr's core ML use case is leveraging large language models (likely GPT-4-class or fine-tuned open-source models) to automatically generate synthetic test cases for AI agents. Rather than relying on human testers to manually write scenarios, Ashr's platform prompts LLMs to produce diverse, realistic, and adversarial inputs—spanning text, voice transcripts, UI interaction sequences, images, and file uploads—that simulate the full range of real-world user behavior. The system uses prompt engineering, few-shot learning, and domain-specific fine-tuning to ensure generated scenarios are contextually relevant, edge-case-rich, and aligned with the customer's specific agent use case. This enables organizations to achieve dramatically higher test coverage, uncover rare failure modes (hallucinations, cross-modal inconsistencies, instruction-following failures), and iterate on agent quality far faster than traditional manual QA allows. The synthetic generation pipeline is designed to scale horizontally, supporting swarm-based parallel test execution across thousands of scenarios simultaneously.

Analogy

It's like having an improv comedy troupe that rehearses every possible audience heckle so your AI agent never freezes on stage.

Custom ML evaluation scoring
For
Decision Quality
Product

<p>Deploys fine-tuned machine learning models as custom scorers that automatically evaluate AI agent outputs against business-specific quality criteria, replacing subjective human review with consistent, scalable, and configurable evaluation.</p>

Layman's Explanation

Instead of having a room full of experts grade every AI response by hand, Ashr trains a digital judge that scores answers the way your business cares about.

Use Case Details

Ashr enables customers to define and deploy fine-tuned ML models—likely transformer-based classifiers or LLM-as-judge architectures—that serve as custom scorers for evaluating AI agent outputs. These scorers go beyond generic accuracy metrics: they can be configured and fine-tuned to assess domain-specific quality dimensions such as factual correctness, tone alignment, regulatory compliance, task completion, and cross-modal consistency. For example, a financial services customer might fine-tune a scorer to flag hallucinated numbers, while a healthcare customer might prioritize medical accuracy and empathy. The platform likely supports both zero-shot LLM-based evaluation (using prompt-driven rubrics) and supervised fine-tuning on labeled customer data for higher precision. Scorers run automatically as part of the testing pipeline, producing structured evaluation reports with pass/fail rates, confidence scores, and inline diffs against expected outputs. This replaces slow, inconsistent manual review with rapid, reproducible, and auditable quality assessment—critical for enterprises deploying agents in regulated or high-stakes environments.

Analogy

It's like training a food critic who knows exactly what your restaurant's regulars love, so every dish gets scored before it leaves the kitchen.

Multi-modal swarm stress testing
For
Risk Reduction
Operations

<p>Orchestrates large-scale, ML-driven swarm tests that simultaneously bombard AI agents with thousands of diverse multi-modal inputs to uncover rare edge cases, performance bottlenecks, and cross-modal failure modes that traditional testing misses.</p>

Layman's Explanation

Imagine unleashing a flash mob of thousands of simulated users—each speaking, typing, uploading, and clicking differently—all at once to see if your AI agent can handle the chaos.

Use Case Details

Ashr's swarm testing capability uses ML-driven orchestration to launch thousands of concurrent, synthetic agent interactions across multiple modalities simultaneously. Each "swarm agent" is an LLM-generated persona with unique behavioral patterns, input types (text queries, voice commands, image uploads, file attachments, UI navigation sequences), and adversarial strategies. The orchestration layer intelligently distributes test load, manages concurrency, and aggregates results in real time. Machine learning is used not only to generate the swarm inputs but also to adaptively prioritize test paths: reinforcement learning or bandit-style algorithms likely guide the swarm toward unexplored regions of the agent's behavior space, maximizing the probability of discovering novel failure modes. This approach is particularly powerful for multi-modal agents, where failures often emerge at the intersection of modalities—for example, an agent that handles text queries well but breaks when the same query arrives as a voice transcript with background noise, or when an image and text instruction conflict. The result is a comprehensive stress test that reveals performance bottlenecks, hallucination patterns, and cross-modal inconsistencies at a scale and speed impossible with manual QA.

Analogy

It's like crash-testing a car not just from the front, but from every angle, speed, and weather condition simultaneously—before a single customer gets behind the wheel.

Key Technical Team Members

  • Shreyas Kaps, Co-founder
  • Rohan Kulkarni, Co-founder
  • Brij Patel, Early Engineer

Shreyas brings direct experience building AI agents in production (finance and DevOps), meaning he has lived the exact testing pain point the product solves. Rohan has a successful exit (Ask Geri, acquired) and Berkeley EECS credentials. They are building testing infrastructure from the perspective of agent builders, not QA generalists.

Ashr

Funding History

  • 2025: Shreyas Kaps and Rohan Kulkarni found Ashr
  • 2026: Accepted into Y Combinator W26 batch

Ashr

Competitors

  • AI Agent Testing: Patronus AI, Galileo AI, Confident AI (DeepEval)
  • General ML Testing: Weights and Biases, Arize AI, Kolena
  • Traditional QA: Selenium, Playwright, Cypress
  • Emerging: AgentOps, LangSmith, Braintrust
More

Companies
Get Every New ML Use Cases Directly to Your Inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.