Why Building with Voice Is a UX Design Challenge, Not Just a Tech One

When people think about building with voice, they think about hard problems:

Real-time transcription
Latency management
GPT inference pipelines
Audio quality, noise filtering, etc

All valid. All difficult.

But none of them are what truly broke my first version of Learnflow AI, a voice-first tutor platform powered by Vapi.

The real challenge? UX.

Because when your user isn't looking at a screen — when they're speaking instead of typing — you lose most of the affordances we've come to rely on.

No hover states. No tooltips. No loading spinners.

And as I learned the hard way: No clarity.

This is a breakdown of what went wrong when I first shipped a real-time voice app and how I reworked it to be understandable, usable, and even delightful.

Voice Tech Is Easy (When Vapi Handles It)

I built the first version of Learnflow AI over a weekend.

Vapi handled the entire voice loop: speech-in, text-to-GPT, voice-out
Convex tracked sessions, user data, and credits
Kinde managed auth, billing, and plan-based access control

Thanks to Vapi, I didn’t need to stitch together Whisper, GPT-4, ElevenLabs, and a WebSocket architecture. One agent definition and a vapi.start() call handled it all.

Sample flow agent session start:

const assistantOverrides = {
  variableValues: { subject, topic, style },
  clientMessages: ["transcript"],
  serverMessages: [],
};

vapi.start(configureAssistant(voice, style), assistantOverrides)

But that just gave me the plumbing.

It didn’t solve what my users were facing.

What Actually Went Wrong

1. No Clarity on When the Session Was Active

Vapi is fast — the session starts within seconds. But users had no idea.

They’d click "Start Session"...

Then wait.

Then say, "Hello?"

Then say it again.

Why? Because I didn’t give them visual cues.

There was no feedback that their voice was being heard, transcribed, and responded to. For a voice interface, that’s a dealbreaker.

2. Muted Mic Confusion

Vapi offers a setMuted toggle, but I didn't expose that clearly.

One user turned off the mic thinking it was ending the session.

Another forgot it was off and kept talking. Silence.

3. No Transcript = No Confirmation

Even though I was getting real-time transcripts from Vapi, I didn't display them at first.

Result? Users didn’t know what was being heard, understood, or ignored.

They didn’t trust the app.

Before: The Broken Voice UX

How I Fixed It

Voice UI Is Feedback UI

I rebuilt the voice session component from scratch with one goal:

Always show users what’s happening.

Design Fix 1: Real-Time Transcript Feed

As Vapi emits transcript messages, I append them to a rolling transcript UI.

vapi.on('message', (message) => {
  if (message.type === 'transcript' && message.transcriptType === 'final') {
    const newMessage = { role: message.role, content: message.transcript };
    setMessages((prev) => [newMessage, ...prev]);
  }
});

The transcript appears like a conversation thread. This helps users feel heard.

Design Fix 2: Speaking Animation (Lottie)

When the assistant is speaking, I show a wave animation using Lottie.

vapi.on('speech-start', () => setIsSpeaking(true));
vapi.on('speech-end', () => setIsSpeaking(false));

This became the signal for active state.

Users now intuitively know:

When it's listening
When it's thinking
When it's speaking

Design Fix 3: Microphone Toggle That Makes Sense

I added a visible mic toggle button:

<button onClick={toggleMicrophone}>
  {isMuted ? "Mic Off" : "Mic On"}
</button>

Plus a tooltip: "Turn this off if you want silence. Your session continues."

After: Fixed UX Flow

Real User Flow Example

Let’s say Joy signs up for Learnflow AI.

She picks the free plan (10 voice sessions)
Lands on the dashboard and clicks “Start Session”
A Lottie animation appears
She says: “Hey, what’s a HTML?”
Sees: “You: Hey, what’s a HTML?”
Hears: “The Hypertext Markup Language is the standard markup language for documents designed to be…”
Credit drops from 10 → 9 in real-time

Then a nudge appears: “You have 9 sessions left. Upgrade for 100/month.”

She clicks “Upgrade”, gets routed to Kinde’s billing page, and instantly returns as a Pro user.

Convex + Kinde: Infra That Made My App’s UX Better

Convex: Session + Credit Logic

Every time a session begins, I log it in Convex and deduct a credit:

// schema.ts
export const users = defineTable({
  credits: v.number(),
  plan: v.string(),
});

export const sessions = defineTable({
  userId: v.id("users"),
  startedAt: v.number(),
});

// mutation.ts
export const startSession = mutation(async (ctx, args) => {
  const user = await ctx.db.get(args.userId);
  if (!user || user.credits <= 0) throw new Error("Out of credits");

  await ctx.db.insert("sessions", {
    userId: args.userId,
  });

  await ctx.db.patch(args.userId, {
    credits: user.credits - 1,
  });
});

If credits hit 0:

The user is unable to create a new session with Vapi
Full-screen upgrade modal appears

Kinde: Role Gating + Plan Sync

I use Kinde’s hosted pricing page.

After signup, users are assigned free or pro roles via metadata:

const user = await getUser();

let plan: "starter" | "pro" | "plus" = "pro";

const plans = entitlements?.data?.plans ?? [];
console.log("Plans:", plans);

if (plans.some((p: any) => p.key === "pro")) {
  plan = "pro";
} else if (plans.some((p: any) => p.key === "starter")) {
  plan = "starter";
} else if (plans.some((p: any) => p.key === "plus")) {
  plan = "plus";
}

console.log("Plan:", plan);

Then I sync that in Convex for backend enforcement.

Edge Cases I Had to Handle

User hits 0 credits mid-session: Block next attempt with modal
User mutes mic, thinks session is paused. Solution: copy + mic color state
User switches tabs mid-session. Solution: session timer auto-ends after 60s idle
User upgrades mid-session. Solution: full reload refreshes plan + credit count

Final UX Checklist Before Launch

Real-time transcript feed ✅
Visual signal for when assistant is speaking ✅
Sticky credit counter ✅
Mic toggle with explanation ✅
Upgrade nudge after session ✅
Kinde role sync across backend ✅

What I Learned

Building voice is not just about latency and speech quality
Voice-first UX is not like chatbot UX
UX clarity is everything when there are no visual anchors
Trust comes from visibility: show the transcript, show the state
Feedback loops build confidence

And most of all:

If your user isn't sure whether they're being heard, they won't speak again.

Takeaways If You’re Building Voice AI Apps

Don’t launch voice without a feedback loop
Show users their words (transcript)
Show agent activity (Lottie or animation)
Use a backend like Convex to gate usage in real time
Use Kinde's roles to simplify access control
Let something like Vapi handle the hard infra
Don't assume your user will "figure it out" — measure their hesistation and guide them
Build with latency in mind, but design for confidence

Your Turn

Have you tried building voice-first UX?

Did you run into any of these challenges?

Drop a comment, let’s compare notes below.

Written by Shola Jegede, building Learnflow AI

Built with:

Vapi (voice agent abstraction)
Convex (real-time backend)
Kinde (auth + billing)

Late nights, live feedback, and lots of learning.

See you in the comments.

Shola Jegede @sholajegede