Lets teams edit professional video through natural language commands in the browser, no NLE needed.
Using multimodal LLMs for natural language video editing, automated highlight extraction from footage, and semantic search across video libraries.

|
Video Editing
|
YC W26

Last Updated:
March 19, 2026

AI-powered, browser-based video editor that uses multimodal LLMs and agentic AI to let users edit professional video through natural language commands, automated clip selection, and real-time collaboration. Runs natively in the browser via WebGPU and WebCodecs.
Browser-based professional editing with natural language commands, automated captioning, cloud collaboration, and export to Premiere Pro/DaVinci Resolve. Tiered pricing (Creator $60/mo, Pro $150/mo, Teams custom). Highest upvoted Hacker News launch in YC W26.
Pricing tiers suggest enterprise/agency sales motion. 'Early access to new models' on Pro tier hints at proprietary model development. Lean engineering team focused on core AI R&D. Likely plugin/agent extensibility architecture.
<p>AI interprets plain-English editing commands and autonomously executes complex timeline operations on video projects.</p>
Instead of clicking through menus and dragging clips on a timeline, you just tell the editor what you want in plain English and it does it for you.
Cardboard's semantic natural language editing system uses multimodal large language models to parse user intent from free-text prompts and map those instructions to precise timeline operations—cuts, transitions, reordering, audio adjustments, and effects. The system ingests both the text command and the underlying video/audio content, enabling context-aware edits (e.g., "remove all pauses longer than 2 seconds" or "add a zoom-in every time the speaker says 'important'"). This goes beyond simple keyword matching; the model understands temporal, spatial, and semantic relationships within the footage, allowing it to execute multi-step editing workflows from a single prompt. The result is a dramatically lower barrier to entry for professional-quality editing and a significant speed advantage for experienced editors handling repetitive tasks.
It's like having a film editor sitting next to you who instantly understands "make it punchier" without you ever touching the Avid.
<p>AI analyzes raw footage to automatically identify and extract the best moments, generate highlight reels, and sync edits to music.</p>
The AI watches all your raw footage, picks out the best moments, and assembles a polished highlight reel before you've even finished your coffee.
Cardboard's automated clip selection engine leverages multimodal LLMs and computer vision to analyze hours of raw video footage—detecting speaker energy, facial expressions, audience engagement cues, audio peaks, and semantic content relevance—to surface the most compelling moments automatically. The system scores each segment on multiple dimensions (visual quality, audio clarity, emotional intensity, topical relevance) and assembles candidate highlight reels ranked by predicted viewer engagement. It can also sync selected clips to uploaded music tracks by aligning cuts to beat patterns and energy curves. For marketing teams and creators producing high volumes of content (podcasts, webinars, live streams), this eliminates the most time-consuming phase of post-production: reviewing and selecting footage. The output is a draft timeline that users can refine, rather than a blank canvas they must build from scratch.
It's like hiring an intern who actually watched all 47 hours of your conference footage and somehow picked the only five minutes worth posting.
<p>AI enables users to search across all uploaded footage by describing what happened in the video, not by filenames or timestamps.</p>
Instead of scrubbing through hours of footage looking for "the part where she holds up the product," you just type that and the AI finds it instantly.
Cardboard's content-based semantic search system indexes all uploaded video assets using multimodal embeddings—combining visual scene understanding, object/person detection, speech transcription, and on-screen text recognition into a unified vector representation. When a user types a natural language query like "the moment the CEO mentions Q3 revenue" or "close-up of the red sneakers," the system performs a semantic similarity search across the entire indexed library and returns timestamped results ranked by relevance. This fundamentally changes how editors and teams interact with large media libraries: instead of relying on manual tagging, folder structures, or memory, they can treat their footage like a searchable knowledge base. For organizations producing hundreds of hours of content monthly, this capability transforms asset management from a bottleneck into a competitive advantage, enabling rapid repurposing and content discovery.
It's like Google Search, but instead of searching the internet, you're searching your own chaotic mountain of unedited footage—and it actually works.
Saksham and Ishan have known each other for 15 years. Saksham built AI products at Iterate AI and published at ACL. Ishan spent 4.5 years building high-performance web apps at HackerRank and built Hotspoter (5M+ downloads) at age 14. Together they can build the editor core others avoid.