Case study · HumanFirst · 2025
Execution Mode: making AI prompt execution visible before users pay for it
Designing visibility of system status into a prompt builder, including the trust mechanic I built and then killed two weeks later.
Atlas Coffee sells roasted coffee subscriptions for small teams and remote offices.
BrightLedger helps finance teams reconcile invoices and vendor records faster.
Northline Labs builds workflow software for clinical operations teams.
HumanFirst's prompt builder let users run AI prompts across unstructured datasets.
The challenge: By late 2025, the surface had become powerful and the moment before execution was under explained. A user could click Run without a clear summary of what would happen next: which data would be used, how many executions could start, and whether the selected mode matched the task.
My goal was to make the moment before execution clearer and safer without slowing down quick experimentation.
How I came to the problem
Direct customer access was limited, so I triangulated the problem through product behavior, session replays, analytics, and systematic testing. The evidence pointed to the same issue: the Run flow had become powerful. The system status around it had fallen behind.
I used three signals:
- Datadog session replays showed users moving around the Run flow, opening related panels, closing them, and leaving before execution.
- Mixpanel showed low
run_promptusage even as the prompt builder became more capable. - While testing the area, I found a smaller bug: the merge toggle produced no visible change when no rows were selected.
The bug gave me a concrete entry point. The larger problem was visibility of system status. Users needed the system to answer a simple question before they committed to the run: what exactly is going to happen when I click this?
The wedge · PromptRunMode component, before
A single Switch, labeled Merge query (or Merge context), with a tooltip explaining the behavior. When the user had nothing selected, toggling produced no observable change. The control was technically correct and conceptually empty.
This was a product judgment call made under startup constraints. I had enough signal to form a hypothesis, and the fastest responsible way to test it was to prototype the behavior in code.On working with constraints
Key decisions
I worked with the CEO, engineering, and Marketing to turn the execution flow from a loose concept into shipped behavior. The main design question was how much guidance the interface should provide before a user clicked Run.
Mode aware panel
The panel needed to stay visible near execution and appear in the moments where it helped the active workflow. I scoped it to the modes where users needed confirmation before running a job.
Context selector
I used progressive disclosure for the context selector. It appears when the user has a real ambiguity to resolve, keeping every control off the surface by default.
Warning behavior
My first version blocked users when a variable was set with zero selected items. After reviewing the flow with engineering, I changed it to an informative message so experimental runs could stay fast while still making the consequence clear.
The progressive disclosure call · three states of the panel
No variables
/.Variables, no data
{{ variables }} during execution.Variables and data
The trust mechanic I killed
This is the moment of the case study.
On October 23, I shipped the execution mode panel with a token estimator. The goal was sound: before users clicked Run, give them a rough sense of the cost of execution.
The implementation used a lightweight heuristic based on character count and file size approximation. It worked directionally for simple text, then became fragile as files, larger inputs, and model differences entered the flow. Engineering raised the accuracy concern after we had lived with the feature in production.
I made the call with engineering to remove it. On November 6, the token estimator was pulled out of production.
Side by side · the token estimator I killed
The estimator solved the right problem in the wrong way. A token number looks authoritative. When that number is rough, slow, or inconsistently accurate, it can damage trust faster than an omitted estimate would.The rationale for the kill
The kept version focused on the trust signals we could stand behind: execution count, selected mode, source, and placement near the Run button.
The version that shipped keeps three trust mechanics that did hold up:
Job count
Driven by priority based selection logic so the number on screen matches the actual execution count across the selected scope.
Mode selector
Merge once and Per item, with a soft notification when a variable is set with no items selected. Inform while preserving momentum.
Adaptive placement
The panel surfaces system status where the user's attention already is, near the Run button, and adapts to the active mode.
Why I prototyped in code
I prototyped this entire feature in code, on an isolated branch, with merge blocks our CTO had approved. Every commit was co authored with Claude Code. The active design and build cycle was three days of concentrated work. The full lifecycle including the token estimator decision was about two weeks.
The single sharpest reason it had to be code: I needed to feel the actual latency of the LLM, the real speed at which the system would return an answer, and whether the token estimate was accurate enough to display. Figma would have hidden that signal.
The decision to remove the estimator depended on lived experience with the running system. Code prototyping gave us the evidence to remove that feature before it continued poisoning user trust. The design review is the secondary proof. I walked the team through a working prototype. They could click. They could try the edge cases of text variable plus context plus selection in another tab. The conversation surfaced edge cases that static frames would have missed.
Two week lifecycle · commit log on the feature branch
Outcome and what I would do differently
This shipped with a weak outcome metric. run_prompt usage stayed low after launch, which made it a poor measure of success for this specific change. Given another cycle, I would instrument the behavior closer to the trust problem itself.
What I would instrument given another run at this:
- Time from prompt focus to Run click, as a behavioral proxy for hesitation.
- Abandonment after seeing the executions planned count, as a signal of mismatch between user expectation and system state.
- Support tickets about unexpected run results, as a trust failure indicator.
- Mode selector usage rate, to validate that the feature is being read and acted on across sessions.
What I take from this
Three things I would carry into the next role:
Trust requirements depend on the user's job at that moment. A quick experimental run needs lightweight visibility. A high stakes batch run may justify slower, more precise estimation.
AI product design needs code prototypes earlier than traditional product work. Latency, model behavior, data shape, and estimation quality need to be judged in the running system.
Strategic design in a small company often starts with a small concrete bug. The bug creates the entry point, and the real work is finding the deeper mismatch between system behavior and user understanding.