Case study · HumanFirst · 2025

Execution Mode: making AI prompt execution visible before users pay for it

Designing visibility of system status into a prompt builder, including the trust mechanic I built and then killed two weeks later.

Role: Senior Product Designer, acting as designer and product owner
Team: CEO, two engineers, growth lead
Timeframe: October to November 2025, two week active cycle
Stack: React, TypeScript, Redux, Claude Code, Datadog RUM, Mixpanel

In a rush?

Watch the 2 minute version

Prompt

Write a short summary of {{ company }}. Use the data in @Companies Collection to fill in the variable. Keep it under 100 words.

Execution Settings

Executing on Collection:Companies Collection

3 executions planned

Expected output3 outputs

Atlas Coffee

Atlas Coffee sells roasted coffee subscriptions for small teams and remote offices.

BrightLedger

BrightLedger helps finance teams reconcile invoices and vendor records faster.

Northline Labs

Northline Labs builds workflow software for clinical operations teams.

The shipped execution mode panel as it appears below the prompt editor. Source, mode, count, Run, and expected output all colocated where the user's attention already is.

HumanFirst's prompt builder let users run AI prompts across unstructured datasets.

The challenge: By late 2025, the surface had become powerful and the moment before execution was under explained. A user could click Run without a clear summary of what would happen next: which data would be used, how many executions could start, and whether the selected mode matched the task.

My goal was to make the moment before execution clearer and safer without slowing down quick experimentation.

How I came to the problem

Direct customer access was limited, so I triangulated the problem through product behavior, session replays, analytics, and systematic testing. The evidence pointed to the same issue: the Run flow had become powerful. The system status around it had fallen behind.

I used three signals:

Datadog session replays showed users moving around the Run flow, opening related panels, closing them, and leaving before execution.
Mixpanel showed low run_prompt usage even as the prompt builder became more capable.
While testing the area, I found a smaller bug: the merge toggle produced no visible change when no rows were selected.

The bug gave me a concrete entry point. The larger problem was visibility of system status. Users needed the system to answer a simple question before they committed to the run: what exactly is going to happen when I click this?

The wedge · PromptRunMode component, before

Merge query

A single Switch, labeled Merge query (or Merge context), with a tooltip explaining the behavior. When the user had nothing selected, toggling produced no observable change. The control was technically correct and conceptually empty.

The recreated component, faithful to the shipped version that preceded this work. The bug was that the toggle produced the same visible state when zero items were selected, which I read as a symptom of a deeper mismatch between system state and user model.

This was a product judgment call made under startup constraints. I had enough signal to form a hypothesis, and the fastest responsible way to test it was to prototype the behavior in code.On working with constraints

Key decisions

I worked with the CEO, engineering, and Marketing to turn the execution flow from a loose concept into shipped behavior. The main design question was how much guidance the interface should provide before a user clicked Run.

Mode aware panel

The panel needed to stay visible near execution and appear in the moments where it helped the active workflow. I scoped it to the modes where users needed confirmation before running a job.

Context selector

I used progressive disclosure for the context selector. It appears when the user has a real ambiguity to resolve, keeping every control off the surface by default.

Warning behavior

My first version blocked users when a variable was set with zero selected items. After reviewing the flow with engineering, I changed it to an informative message so experimental runs could stay fast while still making the consequence clear.

The progressive disclosure call · three states of the panel

No variables

Execution Settings

This package will execute once. To execute against each data row, add a variable by typing /.

Variables, no data

Execution Settings

Select rows in Data to populate {{ variables }} during execution.

Variables and data

Execution Settings

Merge once

Per item

Executing on Collection:Companies Collection

3 executions planned

Progressive disclosure across three states. Each state surfaces only what the user can act on. The mode selector and source row appear only when both are necessary.

The trust mechanic I killed

This is the moment of the case study.

On October 23, I shipped the execution mode panel with a token estimator. The goal was sound: before users clicked Run, give them a rough sense of the cost of execution.

The implementation used a lightweight heuristic based on character count and file size approximation. It worked directionally for simple text, then became fragile as files, larger inputs, and model differences entered the flow. Engineering raised the accuracy concern after we had lived with the feature in production.

I made the call with engineering to remove it. On November 6, the token estimator was pulled out of production.

Side by side · the token estimator I killed

Oct 23 to Nov 6, 2025 · shipped, then removed

Execution

Merge once

Per item

Run one per row fromCompanies Collection

3 jobs planned·total ≈ ~1.2K tokensi

Nov 6 onward · the kept version

Execution Settings

Executing on Collection:Companies Collection

3 executions planned

Built and shipped on October 23. Removed on November 6 after engineering raised accuracy concerns. The kept version uses verified signals: a clearer source row, execution count, and the Run button colocated inside the panel.

The estimator solved the right problem in the wrong way. A token number looks authoritative. When that number is rough, slow, or inconsistently accurate, it can damage trust faster than an omitted estimate would.The rationale for the kill

The kept version focused on the trust signals we could stand behind: execution count, selected mode, source, and placement near the Run button.

The version that shipped keeps three trust mechanics that did hold up:

Job count
Driven by priority based selection logic so the number on screen matches the actual execution count across the selected scope.
Mode selector
Merge once and Per item, with a soft notification when a variable is set with no items selected. Inform while preserving momentum.
Adaptive placement
The panel surfaces system status where the user's attention already is, near the Run button, and adapts to the active mode.

Why I prototyped in code

I prototyped this entire feature in code, on an isolated branch, with merge blocks our CTO had approved. Every commit was co authored with Claude Code. The active design and build cycle was three days of concentrated work. The full lifecycle including the token estimator decision was about two weeks.

The single sharpest reason it had to be code: I needed to feel the actual latency of the LLM, the real speed at which the system would return an answer, and whether the token estimate was accurate enough to display. Figma would have hidden that signal.

The decision to remove the estimator depended on lived experience with the running system. Code prototyping gave us the evidence to remove that feature before it continued poisoning user trust. The design review is the secondary proof. I walked the team through a working prototype. They could click. They could try the edge cases of text variable plus context plus selection in another tab. The conversation surfaced edge cases that static frames would have missed.

Two week lifecycle · commit log on the feature branch

9120a79

feat: add execution mode section with job count and token estimation

callebe0 · 2025-10-23co-authored: Claude Code

d7fbeb7

fix: remove redundant check and add tooltip for token estimation

callebe0 · 2025-10-23co-authored: Claude Code

0e578c3

fix: use priority-based selection count instead of view-based

callebe0 · 2025-10-23co-authored: Claude Code

aa13afa

refactor(prompt-panel): change 'job' terminology to 'execution'

Backend engineer · 2025-11-06

af1e671

feat(prompt-panel): remove token estimation feature

Backend engineer · 2025-11-06

ccdf62b

feat(stash-controls): remove redundant 'run in parallel' toggle

Backend engineer · 2025-11-06

Two week lifecycle visible in the commit history. Built fast, lived with, refined, partially killed. Speed was strategic and tied to judgment.

Outcome and what I would do differently

This shipped with a weak outcome metric. run_prompt usage stayed low after launch, which made it a poor measure of success for this specific change. Given another cycle, I would instrument the behavior closer to the trust problem itself.

What I would instrument given another run at this:

Time from prompt focus to Run click, as a behavioral proxy for hesitation.
Abandonment after seeing the executions planned count, as a signal of mismatch between user expectation and system state.
Support tickets about unexpected run results, as a trust failure indicator.
Mode selector usage rate, to validate that the feature is being read and acted on across sessions.

What I take from this

Three things I would carry into the next role:

Trust requirements depend on the user's job at that moment. A quick experimental run needs lightweight visibility. A high stakes batch run may justify slower, more precise estimation.

AI product design needs code prototypes earlier than traditional product work. Latency, model behavior, data shape, and estimation quality need to be judged in the running system.

Strategic design in a small company often starts with a small concrete bug. The bug creates the entry point, and the real work is finding the deeper mismatch between system behavior and user understanding.

Watch the 2 minute version

How I came to the problem

Key decisions

Mode aware panel

Context selector

Warning behavior

The trust mechanic I killed

Job count

Mode selector

Adaptive placement

Why I prototyped in code

Outcome and what I would do differently

What I take from this